Why doesn't this regex capture group stop with the set condition and continue capturing until the end of the line?

Question:

import re

input_text = "((PL_ADVB)alrededor (NOUN)(del auto rojizo, algo grande y completamente veloz)). Luego dentro del baúl rápidamente abajo de una caja por sobre ello vimos una caña." #example input

#place_reference = r"((?i:ws*)+)?"
#place_reference = r"(?i:[w,;.]s*)+" <--- greedy regex
place_reference = r"(?i:[w,;.]s*)+?"


list_all_adverbs_of_place = ["adentro", "dentro", "al rededor", "alrededor", "abajo", "hacía", "hacia", "por sobre", "sobre"]
list_limiting_elements = list_all_adverbs_of_place + ["vimos", "hemos visto", "encontramos", "hemos encontrado", "rápidamente", "rapidamente", "intensamente", "durante", "luego", "ahora", ".", ":", ";", ",", "(", ")", "[", "]", "¿", "?", "¡", "!", "&", "="]

pattern = re.compile(rf"(?:(?<=s)|^)({'|'.join(re.escape(x) for x in list_all_adverbs_of_place)})?(s+{place_reference})s*({'|'.join(re.escape(x) for x in list_limiting_elements)})", flags = re.IGNORECASE)

input_text = re.sub(pattern,
                    #lambda m: f"((PL_ADVB){m[1]}{m[2]}){m[3]}",
                    lambda m: f"((PL_ADVB){m[1]}{m[2]}){m[3]}" if m[2] else f"((PL_ADVB){m[1]} NO_DATA){m[3]}",
                    input_text)

print(repr(input_text)) #--> output

When I use lambda m: f"((PL_ADVB){m[1]}{m[2]}){m[3]}" if m[2] else f"((PL_ADVB){m[1]} NO_DATA){m[3]}" I get this wrong output:

'((PL_ADVB)alrededor (NOUN)(del auto rojizo, algo grande y completamente veloz)). Luego ((PL_ADVB)dentro del baúl rápidamente abajo de una caja por sobre ello vimos una caña).'

It can be noticed how the capture group {m[3]} only captured .

That would not be entirely correct since you should not put everything inside the parentheses, in order to get this correct output:

"((PL_ADVB)alrededor ((NOUN)del auto rojizo, algo grande y completamente veloz)). Luego ((PL_ADVB)dentro del baúl) rápidamente ((PL_ADVB)abajo de una caja) ((PL_ADVB)por sobre ello) vimos una caña."

list_all_adverbs_of_place represents the start of the capturing group, and list_limiting_elements represents the end of the capturing group.

Asked By: Matt095

||

Answers:

If I am understand your question correctly, the issue is the text "por sobre ello" is not highlighted by the regular expression.

The regular expression is trying to find a word from the first list, followed by the word we are interested in, followed by a word on the third list.

If we run your example, here is the matches it makes for the text given:

input_text = "((PL_ADVB)alrededor (NOUN)(del auto rojizo, algo grande y completamente veloz)). Luego dentro del baúl rápidamente abajo de una caja por sobre ello vimos una caña."

list_all_adverbs_of_place = [
    "adentro",
    "dentro",
    "al rededor",
    "alrededor",
    "abajo",
    "hacía",
    "hacia",
    "por sobre",
    "sobre"]

list_limiting_elements = list_all_adverbs_of_place + [
    "vimos",
    "hemos visto",
    "encontramos",
    "hemos encontrado",
    "rápidamente", "rapidamente",
    "intensamente",
    "durante",
    "luego",
    "ahora", ".", ":", ";", ",", "(", ")", "[", "]", "¿", "?", "¡", "!", "&", "="]

# For the sake of this question, this could all be simplified
pattern = re.compile(
    rf"(?:(?<=s)|^)({'|'.join(re.escape(x) for x in list_all_adverbs_of_place)})?(s+{place_reference})s*({'|'.join(re.escape(x) for x in list_limiting_elements)})", flags = re.IGNORECASE)


for match in pattern.finditer(input_text):
    print(match.group(1, 2, 3))

This shows the results:

('dentro', ' del baúl ', 'rápidamente')
('abajo', ' de una caja ', 'por sobre')

And running your code above gives the output

'((PL_ADVB)alrededor (NOUN)(del auto rojizo, algo grande y completamente veloz)). Luego ((PL_ADVB)dentro del baúl )rápidamente ((PL_ADVB)abajo de una caja )por sobre ello vimos una caña.'

However, "sobre ello vimos" is not wrapped in parenthesis, as you want.

If we take this output and feed it in again, the regular expression does now match and see this.

for match in pattern.finditer(input_text):
    print(match.group(1, 2, 3))
('sobre', ' ello ', 'vimos')

The issue is "sobre" was the word in the previous match, and that caused it to be missed. This can be fixed by making the third word be specified in a look-ahead assertion.

You can take the third word regular expression

(third|list|of|words)

and wrap it in a (?=...) statement.
(?=(third|list|of|words))

So that would make the final regular expression:

pattern = re.compile(
    rf"(?:(?<=s)|^)({'|'.join(re.escape(x) for x in list_all_adverbs_of_place)})?(s+{place_reference})s*((?={'|'.join(re.escape(x) for x in list_limiting_elements)}))", flags = re.IGNORECASE)
Answered By: The Matt