Set a regex pattern to identify a repeating pattern of an enumeration, where the same match is repeated an unknown number of successive times

Question:

import re

input_text = "hjshhshs el principal, amplio, de gran importancia, y más costoso hotel de la zona costera. Es una sobrilla roja, bastante amplia y incluso cómoda de llevar. Hay autos rápidos, más costosos, y veloces. también, hay otro tipo de autos menos costosos"

direct_subject_modifiers = r"((?:w+))"
modifier_connectors = r"(?:(?:,s*|)y|(?:,s*|)y|,)s*(?:(?:(?:a[úu]n|todav[íi]a|incluso)s+|)(?:des*gran|bastante|uns*tanto|uns*poco|)s*(?:m[áa]s|menos)s+|)"

regex = modifier_connectors + direct_subject_modifiers

matches = re.finditer(regex, input_text, re.MULTILINE | re.IGNORECASE)

input_text = re.sub(matches, lambda m: (f"((DESCRIP){m[1]})"), input_text, re.IGNORECASE)
print(repr(input_text))

How to build regex to detect a successive description of n elements that coincide in these 2 patterns regex = modifier_connectors + direct_subject_modifiers , repeating themselves an unknown number of times?

The output after identifying the elements in the string, and placing them in parentheses, keep in mind that within the same string there can be more than one pattern that must be encapsulated between parentheses, in this example there are 3 of them.

"hjshhshs el ((DESCRIP)principal, amplio, de gran importancia, y más costoso) hotel de la zona costera. Es una sobrilla ((DESCRIP)roja, bastante amplia y incluso cómoda) de llevar. Hay autos ((DESCRIP)Hay autos rápidos, más costosos, y veloces). también, hay otro tipo de autos menos costosos"
Asked By: Matt095

||

Answers:

First some of the issues with the regex and code:

  • As the first regex does not match spacing, the second regex should allow for spaces before and after the words it matches. This was not done consistently (e.g. after "de gran" no space was matched).

  • As your text contains accented letters, you would need to apply the re.UNICODE modifier, so w will also match those.

  • To avoid false positives, you’d better also add some b in the regex, to make sure y doesn’t match in hay and similar issues.

  • re.sub expects a regex as first argument, not the matches from re.finditer

  • re.sub expects the flags as 5th argument, not 4th

  • m[1] will only reproduce what the first capture group matched, while you need all the matched text to be reproduced here. This would be m[0], but see comment below.

Not a problem, but there is a ? operator you could use. It could be used instead of adding empty alternatives with ( |).

The re.sub callback is not needed. You can provide a string literal as second argument and reproduce the matched string with g<0>.

As to the main question: you can use the {2,} quantifier to repeat a pattern at least twice.

Here is the code I ended up with:

import re

input_text = "hjshhshs el principal, amplio, de gran importancia, y más costoso hotel de la zona costera. Es una sobrilla roja, bastante amplia y incluso cómoda de llevar. Hay autos rápidos, más costosos, y veloces. también, hay otro tipo de autos menos costosos"

direct_subject_modifiers = r"((?:w+))"
# always match prefix and postfix spaces
modifier_connectors = r"s*(?:bys+|,s*(?:ys+)?)b(?:(?:a[úu]n|todav[íi]a|incluso)?(?:des*gran|m[áa]s|menosdes*gran|bastante|(?:uns*tanto|uns*poco)?s*(?:m[áa]s|menos))?)bs*"  

regex = direct_subject_modifiers + "(?:" + modifier_connectors  + direct_subject_modifiers + "){2,}"

"""
Added Unicode mode, to match accented letters too
First arg should be the regex. 
No need for callback argument. Just back reference with g<0>
No need to escape parentheses in the replacement string
"""

input_text = re.sub(regex, r"((DESCRIP)g<0>)", input_text, flags=re.I | re.U)
print(repr(input_text))

Output:

‘hjshhshs el ((DESCRIP)principal, amplio, de gran importancia, y más costoso) hotel de la zona costera. Es una sobrilla ((DESCRIP)roja, bastante amplia y incluso cómoda) de llevar. Hay autos ((DESCRIP)rápidos, más costosos, y veloces). también, hay otro tipo de autos menos costosos’

On a final note: language grammar is too complex to be parsed with regular expressions. You’ll always bump into examples where the regex based solution falls short… and the code will become hard to maintain. Tokenising the input and then applying rules via code will be easier to manage.

Answered By: trincot