Why does capturing the capture group identified with this regex search pattern fail?

Question

import re

input_text_substring = "durante el transcurso del mes de diciembre de 2350" #example 1
#input_text_substring = "durante el transcurso del mes de diciembre del año 2350" #example 2
#input_text_substring = "durante el transcurso del mes 12 2350" #example 3

##If it is NOT "del año" + "(it doesn't matter how many digits)" or if it is NOT "(it doesn't matter what comes before it)" + "(year of 4 digits)"
if not re.search(r"(?:(?:del|de[s|]*el|el)[s|]*(?:año|ano)[s|]*d*|.*d{4}$)", input_text_substring):
    input_text_substring += " de " + datetime.datetime.today().strftime('%Y') + " "

#For when no previous phrase indicative of context was indicated, for example "del año" and the number of digits is not 4

some_text = r"(?:(?!.s*?n)[^;])*" #a number of month or some other text without dots .  or ;, or n ((although it must also admit the possible case where there is nothing in the middle or only a whitespace)

#we need to capture the group in the position of the last d*
m1 = re.search( r"(?:del[s|]*mes|de[s|]*el[s|]*mes|de[s|]*mes|d{2})" + some_text + r"(?P<year>d*)" , str(input_text_substring), re.IGNORECASE, )
#if m1: identified_year = str(m1.groups()["g<year>"])
if m1: identified_year = str(m1.groups()[0])

input_text_substring = re.sub( r"(?:del[s|]*mes|de[s|]*el[s|]*mes|de[s|]*mes|d{2})" + some_text + r"d*", identified_year, input_text_substring )


print(repr(identified_year))
print(repr(input_text_substring))

This is the wrong output that I get with this code (tested in the example 1):

''
'durante el transcurso '

And this is the correct output that I need:

'2350' #in example 1, 2 and 3
'durante el transcurso del mes de diciembre 2350' #in example 1 and 2
'durante el transcurso del mes 12 2350' #in example 3

Why can’t I capture the numeric value of the years (?P<year>d*) using the capture group references with m1.groups()["g<year>"] or m1.groups()[0] ?

Asked By: Matias Nicolas Rodriguez

||

Source

Answer 1

The <year> part is not matched because the previous pattern is capturing that year with [^;] and a greedy *.

One way to have the previous pattern not consume the year, is to extend the negative look-ahead as follows:

some_text = r"(?:(?!.s*?n|d{4})[^;])*"
#                           ^^^^^^

In the expected results you want to keep "del mes…" in the final output of input_text_substring, but if that is the case then just don’t remove that part of the string with the last call of re.sub — remove that statement. But maybe you overlooked this in your question?

Finally, [s|]* is not really what you want: it would match a literal | in your input. Moreover, you seem to want to match at least one white space character. So replace these occurrences with s+.

Answered By: trincot

Why does capturing the capture group identified with this regex search pattern fail?

Question:

Answers: