Remove the current year if another year was previously indicated after this regex pattern

Question:

This is my code where I indicate some possible examples to simulate the environment where this program will work

import re, datetime

#Example input cases
input_text = '[26 -- 31] de 10 del 200 de 2022' #example 1
input_text = '[26 -- 31] de 12 del 206 del 2022' #example 2
input_text = '[06 -- 11] del 09 del ano 2020 del 2022' #example 3
input_text = '[06 -- 06] del mes 09 del ano 20 del ano 2022' #example 4
input_text = '[16 -- 06] del mes 09 del 2022' #example 5 (not modify)


possible_year_num = r"d*" #I need one or more numbers (one or more numeric digits but never any number)

current_year = datetime.datetime.today().strftime('%Y')

month_context_regex = r"[s|]*(?:del[s|]*mes|de[s|]*el[s|]*mes|de[s|]*mes|del|de[s|]*el|de)[s|]*"
year_context_regex = r"[s|]*(?:del[s|]*año|de[s|]*el[s|]*año|de[s|]*año|del[s|]*ano|de[s|]*el[s|]*ano|de[s|]*ano|del|de[s|]*el|de)[s|]*"

#I combine those modular regex expressions to build a regex that serves to identify the substring in which the replacement must be performed
identity_replacement_case_regex = r"[d{2}" + " -- " + r"d{2}]" + month_context_regex + r"d{2}" + year_context_regex + possible_year_num + year_context_regex + current_year

#Only in this cases, I need replace with re.sub() and obtain this output string, for example 1, '[26 -- 31] de 10 del 200'
replacement_without_current_year = r"[d{2}" + " -- " + r"d{2}]" + month_context_regex + r"d{2}" + year_context_regex + possible_year_num
input_text = re.sub(identity_replacement_case_regex, replacement_without_current_year, input_text)

print(repr(input_text))  # --> output

The correct outputs should look like this:

'[26 -- 31] de 10 del 200' #for example 1
'[26 -- 31] de 12 del 206' #for example 2
'[06 -- 11] del 09 del ano 2020' #for example 3
'[06 -- 06] del mes 09 del ano 20' #for example 4
'[16 -- 06] del mes 09 del 2022' #for example 5 (not modified)

How should I put this replacement in the re.sub() function to get these outputs?

I get this error, when I try this replacement

Traceback (most recent call last):
input_text_substring = re.sub(identity_replacement_case_regex, replacement_without_current_year, input_text_substring)
raise s.error('bad escape %s' % this, len(this))
re.error: bad escape d at position 2

Answers:

Rule:

D+(?<!Dd{2} S{3} )(?<!Dd{2} S{2} )2022

Demo: https://regex101.com/r/NRUEYO/1

Code:

import re

regex = r"D+(?<!Dd{2} S{3} )(?<!Dd{2} S{2} )2022"

input_text = '[26 -- 31] de 10 del 200 de 2022' #example 1
input_text = '[26 -- 31] de 12 del 206 del 2022' #example 2
input_text = '[06 -- 11] del 09 del ano 2020 del 2022' #example 3
input_text = '[06 -- 06] del mes 09 del ano 20 del ano 2022' #example 4
input_text = '[16 -- 06] del mes 09 del 2022' #example 5 (not modify)

replace_text = ""

result = re.sub(regex, replace_text, input_text)

if result:
    print (result)
  • D => Any non-digit character
  • d => Any digit character
  • d{2} => Two digit character
  • S => Any non-whitespace character
  • S{3} => Three non-whitespace character
  • (?<!A)2022 => There must not be an "A" character before 2022
  • (?<!Dd{2} S{3} )2022 => There must not be an three character word before the 2022 and after the two-digit characters.
  • (?<!Dd{2} S{3} )(?<!Dd{2} S{2} )2022 => There must not be an three or two character word before the 2022 and after the two-digit characters.
  • D+(?<!Dd{2} S{3} )(?<!Dd{2} S{2} )2022 => Capture all non-digit characters before the (?<!Dd{2} S{3} )(?<!Dd{2} S{2} )2022
Answered By: Onur Uslu