Remove the current year if another year was previously indicated after this regex pattern
Question:
This is my code where I indicate some possible examples to simulate the environment where this program will work
import re, datetime
#Example input cases
input_text = '[26 -- 31] de 10 del 200 de 2022' #example 1
input_text = '[26 -- 31] de 12 del 206 del 2022' #example 2
input_text = '[06 -- 11] del 09 del ano 2020 del 2022' #example 3
input_text = '[06 -- 06] del mes 09 del ano 20 del ano 2022' #example 4
input_text = '[16 -- 06] del mes 09 del 2022' #example 5 (not modify)
possible_year_num = r"d*" #I need one or more numbers (one or more numeric digits but never any number)
current_year = datetime.datetime.today().strftime('%Y')
month_context_regex = r"[s|]*(?:del[s|]*mes|de[s|]*el[s|]*mes|de[s|]*mes|del|de[s|]*el|de)[s|]*"
year_context_regex = r"[s|]*(?:del[s|]*año|de[s|]*el[s|]*año|de[s|]*año|del[s|]*ano|de[s|]*el[s|]*ano|de[s|]*ano|del|de[s|]*el|de)[s|]*"
#I combine those modular regex expressions to build a regex that serves to identify the substring in which the replacement must be performed
identity_replacement_case_regex = r"[d{2}" + " -- " + r"d{2}]" + month_context_regex + r"d{2}" + year_context_regex + possible_year_num + year_context_regex + current_year
#Only in this cases, I need replace with re.sub() and obtain this output string, for example 1, '[26 -- 31] de 10 del 200'
replacement_without_current_year = r"[d{2}" + " -- " + r"d{2}]" + month_context_regex + r"d{2}" + year_context_regex + possible_year_num
input_text = re.sub(identity_replacement_case_regex, replacement_without_current_year, input_text)
print(repr(input_text)) # --> output
The correct outputs should look like this:
'[26 -- 31] de 10 del 200' #for example 1
'[26 -- 31] de 12 del 206' #for example 2
'[06 -- 11] del 09 del ano 2020' #for example 3
'[06 -- 06] del mes 09 del ano 20' #for example 4
'[16 -- 06] del mes 09 del 2022' #for example 5 (not modified)
How should I put this replacement in the re.sub()
function to get these outputs?
I get this error, when I try this replacement
Traceback (most recent call last):
input_text_substring = re.sub(identity_replacement_case_regex, replacement_without_current_year, input_text_substring)
raise s.error('bad escape %s' % this, len(this))
re.error: bad escape d at position 2
Answers:
Rule:
D+(?<!Dd{2} S{3} )(?<!Dd{2} S{2} )2022
Demo: https://regex101.com/r/NRUEYO/1
Code:
import re
regex = r"D+(?<!Dd{2} S{3} )(?<!Dd{2} S{2} )2022"
input_text = '[26 -- 31] de 10 del 200 de 2022' #example 1
input_text = '[26 -- 31] de 12 del 206 del 2022' #example 2
input_text = '[06 -- 11] del 09 del ano 2020 del 2022' #example 3
input_text = '[06 -- 06] del mes 09 del ano 20 del ano 2022' #example 4
input_text = '[16 -- 06] del mes 09 del 2022' #example 5 (not modify)
replace_text = ""
result = re.sub(regex, replace_text, input_text)
if result:
print (result)
D
=> Any non-digit character
d
=> Any digit character
d{2}
=> Two digit character
S
=> Any non-whitespace character
S{3}
=> Three non-whitespace character
(?<!A)2022
=> There must not be an "A" character before 2022
(?<!Dd{2} S{3} )2022
=> There must not be an three character word before the 2022 and after the two-digit characters.
(?<!Dd{2} S{3} )(?<!Dd{2} S{2} )2022
=> There must not be an three or two character word before the 2022 and after the two-digit characters.
D+(?<!Dd{2} S{3} )(?<!Dd{2} S{2} )2022
=> Capture all non-digit characters before the (?<!Dd{2} S{3} )(?<!Dd{2} S{2} )2022
This is my code where I indicate some possible examples to simulate the environment where this program will work
import re, datetime
#Example input cases
input_text = '[26 -- 31] de 10 del 200 de 2022' #example 1
input_text = '[26 -- 31] de 12 del 206 del 2022' #example 2
input_text = '[06 -- 11] del 09 del ano 2020 del 2022' #example 3
input_text = '[06 -- 06] del mes 09 del ano 20 del ano 2022' #example 4
input_text = '[16 -- 06] del mes 09 del 2022' #example 5 (not modify)
possible_year_num = r"d*" #I need one or more numbers (one or more numeric digits but never any number)
current_year = datetime.datetime.today().strftime('%Y')
month_context_regex = r"[s|]*(?:del[s|]*mes|de[s|]*el[s|]*mes|de[s|]*mes|del|de[s|]*el|de)[s|]*"
year_context_regex = r"[s|]*(?:del[s|]*año|de[s|]*el[s|]*año|de[s|]*año|del[s|]*ano|de[s|]*el[s|]*ano|de[s|]*ano|del|de[s|]*el|de)[s|]*"
#I combine those modular regex expressions to build a regex that serves to identify the substring in which the replacement must be performed
identity_replacement_case_regex = r"[d{2}" + " -- " + r"d{2}]" + month_context_regex + r"d{2}" + year_context_regex + possible_year_num + year_context_regex + current_year
#Only in this cases, I need replace with re.sub() and obtain this output string, for example 1, '[26 -- 31] de 10 del 200'
replacement_without_current_year = r"[d{2}" + " -- " + r"d{2}]" + month_context_regex + r"d{2}" + year_context_regex + possible_year_num
input_text = re.sub(identity_replacement_case_regex, replacement_without_current_year, input_text)
print(repr(input_text)) # --> output
The correct outputs should look like this:
'[26 -- 31] de 10 del 200' #for example 1
'[26 -- 31] de 12 del 206' #for example 2
'[06 -- 11] del 09 del ano 2020' #for example 3
'[06 -- 06] del mes 09 del ano 20' #for example 4
'[16 -- 06] del mes 09 del 2022' #for example 5 (not modified)
How should I put this replacement in the re.sub()
function to get these outputs?
I get this error, when I try this replacement
Traceback (most recent call last):
input_text_substring = re.sub(identity_replacement_case_regex, replacement_without_current_year, input_text_substring)
raise s.error('bad escape %s' % this, len(this))
re.error: bad escape d at position 2
Rule:
D+(?<!Dd{2} S{3} )(?<!Dd{2} S{2} )2022
Demo: https://regex101.com/r/NRUEYO/1
Code:
import re
regex = r"D+(?<!Dd{2} S{3} )(?<!Dd{2} S{2} )2022"
input_text = '[26 -- 31] de 10 del 200 de 2022' #example 1
input_text = '[26 -- 31] de 12 del 206 del 2022' #example 2
input_text = '[06 -- 11] del 09 del ano 2020 del 2022' #example 3
input_text = '[06 -- 06] del mes 09 del ano 20 del ano 2022' #example 4
input_text = '[16 -- 06] del mes 09 del 2022' #example 5 (not modify)
replace_text = ""
result = re.sub(regex, replace_text, input_text)
if result:
print (result)
D
=> Any non-digit characterd
=> Any digit characterd{2}
=> Two digit characterS
=> Any non-whitespace characterS{3}
=> Three non-whitespace character(?<!A)2022
=> There must not be an "A" character before 2022(?<!Dd{2} S{3} )2022
=> There must not be an three character word before the 2022 and after the two-digit characters.(?<!Dd{2} S{3} )(?<!Dd{2} S{2} )2022
=> There must not be an three or two character word before the 2022 and after the two-digit characters.D+(?<!Dd{2} S{3} )(?<!Dd{2} S{2} )2022
=> Capture all non-digit characters before the(?<!Dd{2} S{3} )(?<!Dd{2} S{2} )2022