How to capture all substrings that match this regex pattern, which is based on a repeat range of 2 or more consecutive times?


import re

input_text = "((PERS)Marcos) ssdsdsd sdsdsdsd sdsdsd le ((VERB)empujé) hasta ((VERB)dejarle) en ese lugar. A ((PERS)Marcos) le ((VERB)dijeron) y luego le ((VERB)ayudo)"

input_text = re.sub(r"((PERS)((?:ws*)+))s*((?!el)w+s+){2,}(le)",
                    lambda m: print(f"{m[2]}"),
                    input_text, flags = re.IGNORECASE)

print(repr(input_text)) # --> output

Here I have used repeat quantifiers, such as + (one or more repeats) or * (zero or more repeats), in combination with {} to specify a range of repeats.

Why this code gives me as output, only the first word and not all the possible words that the pattern ((?!el)w+s+){2,} would cover. Since this pattern captures if there are 2 or more words between "((PERS) )" and "el" ?

"sdsdsd "

And not this output, which is what I want to get

" ssdsdsd sdsdsdsd sdsdsd "

How could I fix my regex to get this result when I print capturing group 2?

Asked By: Matt095



Wrap the entire part (s*((?!el)w+s+){2,}) into one capturing group.

m ="((PERS)((?:ws*)+))(s*((?!el)w+s+){2,})(le)",
                    input_text, flags=re.IGNORECASE)
Answered By: Unmitigated