Why does capturing the capture group identified with this regex search pattern fail?
Question:
import re
input_text_substring = "durante el transcurso del mes de diciembre de 2350" #example 1
#input_text_substring = "durante el transcurso del mes de diciembre del año 2350" #example 2
#input_text_substring = "durante el transcurso del mes 12 2350" #example 3
##If it is NOT "del año" + "(it doesn't matter how many digits)" or if it is NOT "(it doesn't matter what comes before it)" + "(year of 4 digits)"
if not re.search(r"(?:(?:del|de[s|]*el|el)[s|]*(?:año|ano)[s|]*d*|.*d{4}$)", input_text_substring):
input_text_substring += " de " + datetime.datetime.today().strftime('%Y') + " "
#For when no previous phrase indicative of context was indicated, for example "del año" and the number of digits is not 4
some_text = r"(?:(?!.s*?n)[^;])*" #a number of month or some other text without dots . or ;, or n ((although it must also admit the possible case where there is nothing in the middle or only a whitespace)
#we need to capture the group in the position of the last d*
m1 = re.search( r"(?:del[s|]*mes|de[s|]*el[s|]*mes|de[s|]*mes|d{2})" + some_text + r"(?P<year>d*)" , str(input_text_substring), re.IGNORECASE, )
#if m1: identified_year = str(m1.groups()["g<year>"])
if m1: identified_year = str(m1.groups()[0])
input_text_substring = re.sub( r"(?:del[s|]*mes|de[s|]*el[s|]*mes|de[s|]*mes|d{2})" + some_text + r"d*", identified_year, input_text_substring )
print(repr(identified_year))
print(repr(input_text_substring))
This is the wrong output that I get with this code (tested in the example 1):
''
'durante el transcurso '
And this is the correct output that I need:
'2350' #in example 1, 2 and 3
'durante el transcurso del mes de diciembre 2350' #in example 1 and 2
'durante el transcurso del mes 12 2350' #in example 3
Why can’t I capture the numeric value of the years (?P<year>d*)
using the capture group references with m1.groups()["g<year>"]
or m1.groups()[0]
?
Answers:
The <year>
part is not matched because the previous pattern is capturing that year with [^;]
and a greedy *
.
One way to have the previous pattern not consume the year, is to extend the negative look-ahead as follows:
some_text = r"(?:(?!.s*?n|d{4})[^;])*"
# ^^^^^^
In the expected results you want to keep "del mes…" in the final output of input_text_substring
, but if that is the case then just don’t remove that part of the string with the last call of re.sub
— remove that statement. But maybe you overlooked this in your question?
Finally, [s|]*
is not really what you want: it would match a literal |
in your input. Moreover, you seem to want to match at least one white space character. So replace these occurrences with s+
.
import re
input_text_substring = "durante el transcurso del mes de diciembre de 2350" #example 1
#input_text_substring = "durante el transcurso del mes de diciembre del año 2350" #example 2
#input_text_substring = "durante el transcurso del mes 12 2350" #example 3
##If it is NOT "del año" + "(it doesn't matter how many digits)" or if it is NOT "(it doesn't matter what comes before it)" + "(year of 4 digits)"
if not re.search(r"(?:(?:del|de[s|]*el|el)[s|]*(?:año|ano)[s|]*d*|.*d{4}$)", input_text_substring):
input_text_substring += " de " + datetime.datetime.today().strftime('%Y') + " "
#For when no previous phrase indicative of context was indicated, for example "del año" and the number of digits is not 4
some_text = r"(?:(?!.s*?n)[^;])*" #a number of month or some other text without dots . or ;, or n ((although it must also admit the possible case where there is nothing in the middle or only a whitespace)
#we need to capture the group in the position of the last d*
m1 = re.search( r"(?:del[s|]*mes|de[s|]*el[s|]*mes|de[s|]*mes|d{2})" + some_text + r"(?P<year>d*)" , str(input_text_substring), re.IGNORECASE, )
#if m1: identified_year = str(m1.groups()["g<year>"])
if m1: identified_year = str(m1.groups()[0])
input_text_substring = re.sub( r"(?:del[s|]*mes|de[s|]*el[s|]*mes|de[s|]*mes|d{2})" + some_text + r"d*", identified_year, input_text_substring )
print(repr(identified_year))
print(repr(input_text_substring))
This is the wrong output that I get with this code (tested in the example 1):
''
'durante el transcurso '
And this is the correct output that I need:
'2350' #in example 1, 2 and 3
'durante el transcurso del mes de diciembre 2350' #in example 1 and 2
'durante el transcurso del mes 12 2350' #in example 3
Why can’t I capture the numeric value of the years (?P<year>d*)
using the capture group references with m1.groups()["g<year>"]
or m1.groups()[0]
?
The <year>
part is not matched because the previous pattern is capturing that year with [^;]
and a greedy *
.
One way to have the previous pattern not consume the year, is to extend the negative look-ahead as follows:
some_text = r"(?:(?!.s*?n|d{4})[^;])*"
# ^^^^^^
In the expected results you want to keep "del mes…" in the final output of input_text_substring
, but if that is the case then just don’t remove that part of the string with the last call of re.sub
— remove that statement. But maybe you overlooked this in your question?
Finally, [s|]*
is not really what you want: it would match a literal |
in your input. Moreover, you seem to want to match at least one white space character. So replace these occurrences with s+
.