Why does this regex fail when trying to tell if there is a numeric character or more than one numeric character in the middle of a pattern?
Question:
import re
input_text = "hay alrededor de 4 coches el dia 7 del mes 5" #example 1
input_text = "Serian 4 unidades de mermelada para el dia 04 del 8 de este año 2023" #example 2
input_text = "Hay 10 unidades para el dia 15 del 12 y seran 9 ya para el 7 de noviembre" #example 3
input_text = "Hay 10 unidades para el 15 del 1º mes del año y seran alrededor de 9 para el 7º dia del mes de noviembre" #example 4
#for days
standard_number_of_digits_re_1 = r"(?:del dia|de el dia|el dia|del|de el|el)[s|]*(bd{1}b)"
standard_number_of_digits_re_2 = r"(bd{1}b)[s|]*º[s|]*dia"
re_1_and_re_2 = r"(?:" + standard_number_of_digits_re_1 + r"|" + standard_number_of_digits_re_2 + r")"
#for months
#standard_number_of_digits_re_3 = re_1_and_re_2 + r"(?:del mes|de el mes|del|de el)[s|]*(bd{1}b)"
#standard_number_of_digits_re_4 = re_1_and_re_2 + r"(?:del mes|de el mes|del|de el)[s|]*(bd{1}b)[s|]*(?:º[s|]*mes del año|º[s|]*mes)"
standard_number_of_digits_re_3 = r"(?:del mes|de el mes|del|de el)[s|]*(bd)"
standard_number_of_digits_re_4 = r"(?:del mes|de el mes|del|de el)[s|]*(bd)[s|]*(?:º[s|]*mes del año|º[s|]*mes)"
#replacement with this conditions, and put '0' in front a day number only if it is one number and not two(or more numbers)
# example: '1' --> '01' or '10' --> '10'
input_text = re.sub(standard_number_of_digits_re_3, r"01", input_text)
input_text = re.sub(standard_number_of_digits_re_4, r"01", input_text)
input_text = re.sub(standard_number_of_digits_re_1, r"01", input_text)
input_text = re.sub(standard_number_of_digits_re_2, r"01", input_text)
print(repr(input_text)) #output
The problem I’m having is that the regex don’t seem to work correctly since I can’t do the replacements, but trying in several code editors I notice these 2 things:
- In the Regex Debuger https://regex101.com/r/TfsKHZ/1 the regex appear to catch the correct text groups, but even so within the code they fail
- In the SublimeText editor, when I put an
r
in front of "(?:"
the whole line is immediately put as if it were a string of characters although you can see how this string is separated with operators in the middle in charge of the concatenation, however in StackOverflow that line is displayed correctly.
I do not know if these observations are important in the malfunction of these regex, but for some reason that I cannot find, the regex cannot extract the text to be edited by placing (if applicable) a 0 in front.
Leaving the output should be like this when printing them:
"hay alrededor de 4 coches el dia 07 del mes 05" #for example 1
"Serian 4 unidades de mermelada para el dia 04 del 08 de este año 2023" #for example 2
"Hay 10 unidades para el dia 15 del 12 y seran 9 ya para el 07 de noviembre" #for example 3
"Hay 10 unidades para el 15 del 01º mes del año y seran alrededor de 9 para el 07º dia del mes de noviembre" #for example 4
What modifications should I make to get these results? I think the problem is in this part of the regex (bd{1}b)
Edited code
import re
input_text = "hay alrededor de 4 coches el dia 7 del mes 5" #example 1
#input_text = "Serian 4 unidades de mermelada para el dia 04 del 8 de este año 2023" #example 2
input_text = "Hay 10 unidades para el dia 15 del 12 y seran 9 ya para el 7 de noviembre" #example 3
#input_text = "Hay 10 unidades para el 15 del 1º mes del año y seran alrededor de 9 para el 7º dia del mes de noviembre" #example 4
#for days
standard_number_of_digits_re_1 = r"(?:del dia|de el dia|el dia|del|de el|el)[s|]*(bd)"
standard_number_of_digits_re_2 = r"(bd)[s|]*º[s|]*dia"
#for months
standard_number_of_digits_re_3 = r"(?:del mes|de el mes|del|de el)[s|]*(bd)"
standard_number_of_digits_re_4 = r"(?:del mes|de el mes|del|de el)[s|]*(bd)[s|]*(?:º[s|]*mes del año|º[s|]*mes)"
#replacement with this conditions, and put '0' in front a day number only if it is one number and not two(or more numbers)
# example: '1' --> '01' or '10' --> '10'
input_text = re.sub(standard_number_of_digits_re_3, r"01", input_text)
print(repr(input_text)) #output
input_text = re.sub(standard_number_of_digits_re_4, r"01", input_text)
print(repr(input_text)) #output
input_text = re.sub(standard_number_of_digits_re_1, r"01", input_text)
print(repr(input_text)) #output
input_text = re.sub(standard_number_of_digits_re_2, r"01", input_text)
print(repr(input_text)) #output
The problem is for example in the example 3 :
'Hay 10 unidades para 015 012 y seran 9 ya para 07 de noviembre'
And the correct output is:
'Hay 10 unidades para 15 12 y seran 9 ya para 07 de noviembre'
Answers:
Get rid of b
after d{1}
. Python considers º
to be a word character, so there’s no word boundary between 7
and º
. You don’t need to match a word boundary, since the regexp only allows spaces or |
between the number and º
.
There’s also no need for {1}
. All patterns match 1 time unless they’re quantified, so {1}
is redundant.
(bd)[s|]*º[s|]*dia
import re
input_text = "hay alrededor de 4 coches el dia 7 del mes 5" #example 1
input_text = "Serian 4 unidades de mermelada para el dia 04 del 8 de este año 2023" #example 2
input_text = "Hay 10 unidades para el dia 15 del 12 y seran 9 ya para el 7 de noviembre" #example 3
input_text = "Hay 10 unidades para el 15 del 1º mes del año y seran alrededor de 9 para el 7º dia del mes de noviembre" #example 4
#for days
standard_number_of_digits_re_1 = r"(?:del dia|de el dia|el dia|del|de el|el)[s|]*(bd{1}b)"
standard_number_of_digits_re_2 = r"(bd{1}b)[s|]*º[s|]*dia"
re_1_and_re_2 = r"(?:" + standard_number_of_digits_re_1 + r"|" + standard_number_of_digits_re_2 + r")"
#for months
#standard_number_of_digits_re_3 = re_1_and_re_2 + r"(?:del mes|de el mes|del|de el)[s|]*(bd{1}b)"
#standard_number_of_digits_re_4 = re_1_and_re_2 + r"(?:del mes|de el mes|del|de el)[s|]*(bd{1}b)[s|]*(?:º[s|]*mes del año|º[s|]*mes)"
standard_number_of_digits_re_3 = r"(?:del mes|de el mes|del|de el)[s|]*(bd)"
standard_number_of_digits_re_4 = r"(?:del mes|de el mes|del|de el)[s|]*(bd)[s|]*(?:º[s|]*mes del año|º[s|]*mes)"
#replacement with this conditions, and put '0' in front a day number only if it is one number and not two(or more numbers)
# example: '1' --> '01' or '10' --> '10'
input_text = re.sub(standard_number_of_digits_re_3, r"01", input_text)
input_text = re.sub(standard_number_of_digits_re_4, r"01", input_text)
input_text = re.sub(standard_number_of_digits_re_1, r"01", input_text)
input_text = re.sub(standard_number_of_digits_re_2, r"01", input_text)
print(repr(input_text)) #output
The problem I’m having is that the regex don’t seem to work correctly since I can’t do the replacements, but trying in several code editors I notice these 2 things:
- In the Regex Debuger https://regex101.com/r/TfsKHZ/1 the regex appear to catch the correct text groups, but even so within the code they fail
- In the SublimeText editor, when I put an
r
in front of"(?:"
the whole line is immediately put as if it were a string of characters although you can see how this string is separated with operators in the middle in charge of the concatenation, however in StackOverflow that line is displayed correctly.
I do not know if these observations are important in the malfunction of these regex, but for some reason that I cannot find, the regex cannot extract the text to be edited by placing (if applicable) a 0 in front.
Leaving the output should be like this when printing them:
"hay alrededor de 4 coches el dia 07 del mes 05" #for example 1
"Serian 4 unidades de mermelada para el dia 04 del 08 de este año 2023" #for example 2
"Hay 10 unidades para el dia 15 del 12 y seran 9 ya para el 07 de noviembre" #for example 3
"Hay 10 unidades para el 15 del 01º mes del año y seran alrededor de 9 para el 07º dia del mes de noviembre" #for example 4
What modifications should I make to get these results? I think the problem is in this part of the regex (bd{1}b)
Edited code
import re
input_text = "hay alrededor de 4 coches el dia 7 del mes 5" #example 1
#input_text = "Serian 4 unidades de mermelada para el dia 04 del 8 de este año 2023" #example 2
input_text = "Hay 10 unidades para el dia 15 del 12 y seran 9 ya para el 7 de noviembre" #example 3
#input_text = "Hay 10 unidades para el 15 del 1º mes del año y seran alrededor de 9 para el 7º dia del mes de noviembre" #example 4
#for days
standard_number_of_digits_re_1 = r"(?:del dia|de el dia|el dia|del|de el|el)[s|]*(bd)"
standard_number_of_digits_re_2 = r"(bd)[s|]*º[s|]*dia"
#for months
standard_number_of_digits_re_3 = r"(?:del mes|de el mes|del|de el)[s|]*(bd)"
standard_number_of_digits_re_4 = r"(?:del mes|de el mes|del|de el)[s|]*(bd)[s|]*(?:º[s|]*mes del año|º[s|]*mes)"
#replacement with this conditions, and put '0' in front a day number only if it is one number and not two(or more numbers)
# example: '1' --> '01' or '10' --> '10'
input_text = re.sub(standard_number_of_digits_re_3, r"01", input_text)
print(repr(input_text)) #output
input_text = re.sub(standard_number_of_digits_re_4, r"01", input_text)
print(repr(input_text)) #output
input_text = re.sub(standard_number_of_digits_re_1, r"01", input_text)
print(repr(input_text)) #output
input_text = re.sub(standard_number_of_digits_re_2, r"01", input_text)
print(repr(input_text)) #output
The problem is for example in the example 3 :
'Hay 10 unidades para 015 012 y seran 9 ya para 07 de noviembre'
And the correct output is:
'Hay 10 unidades para 15 12 y seran 9 ya para 07 de noviembre'
Get rid of b
after d{1}
. Python considers º
to be a word character, so there’s no word boundary between 7
and º
. You don’t need to match a word boundary, since the regexp only allows spaces or |
between the number and º
.
There’s also no need for {1}
. All patterns match 1 time unless they’re quantified, so {1}
is redundant.
(bd)[s|]*º[s|]*dia