Catch the following capture groups with a regex and then reorder them with re sub method if the pattern is detected
Question:
import re
input_text = "05 del 07 del 2000 del 09 hhggh" #example 0 - Not modify!
input_text = "04 del 05 del 07 del 2000" #example 1 - Not modify!
input_text = "04 05 del 06 de 200" #example 2 - Yes modify!
input_text = "04 05 del 06 de 20076 55" #example 3 - Yes modify!
detection_regex_obligatory_preposition = r"d{2}" + r"[s|](?:del|de[s|]el|de )[s|]" + r"d{2}" + r"[s|](?:del|de[s|]el|de )[s|]" + r"d*"
year, month, days_intervale_or_day = "", "", "" # = group()[2], group()[1], group()[0]
date_restructuring_structure = days_intervale_or_day + "-" + month + "-" + year
print(repr(date_restructuring_structure))
input_text = re.sub(detection_regex_obligatory_preposition, date_restructuring_structure, input_text)
print(repr(input_text)) # --> output
Correct outputs for each of these cases
""
"05 del 07 del 2000 del 09 hhggh" #example 0 - Not modify!
""
"04 del 05 del 07 del 2000" #example 1 - Not modify!
"05-06-200"
"04 05-06-200" #example 2 - Yes modify!
"05-06-20076"
"04 05-06-20076 55" #example 3 - Yes modify!
In the example 1 should not be replaced since there is more than one day indicated in front of it, leaving something like this
d{2} del d{2} del d{2} del d*
and not this d{2} del d{2} del d*
Something similar happens in example 0 where there is no need to perform the replacement since this d{2} del d{2} del d* de d{2}
or d{2} del d{2} del d* de d*
and not this d{2} del d{2} del d*
How to set the capture groups and the regex to be able to perform the replacements of examples 2 and 3, but not those of examples 0 and 1?
Answers:
Demo: https://regex101.com/r/w7Yp7J/1
import re
#input_text = "05 del 07 del 2000 del 09 hhggh" #example 0 - Not modify!
#input_text = "04 del 05 del 07 del 2000" #example 1 - Not modify!
input_text = "04 05 del 06 de 200" #example 2 - Yes modify!
input_text = "04 05 del 06 de 20076 55" #example 3 - Yes modify!
detection_regex_obligatory_preposition = r"(?P<startDay>d{2})[s|](?P<finishDay>d{2})[s|](?:del|de[s|]el|de )[s|](?P<month>d{2})[s|](?:del|de[s|]el|de)[s|](?P<year>d*)"
date_restructuring_structure = "g<startDay> g<finishDay>-g<month>-g<year>"
input_text = re.sub(detection_regex_obligatory_preposition, date_restructuring_structure, input_text)
print(repr(input_text)) # --> output
To see your code on Regex101, I combined your rules as the following:
d{2}[s|](?:del|de[s|]el|de )[s|]d{2}[s|](?:del|de[s|]el|de)[s|]d*
I realized that it grabs the inputs, which are the exact opposite of what we want. Like the following:
05 del 07 del 2000 del 09 hhggh #example 0 - Captured
04 del 05 del 07 del 2000 #example 1 - Captured
04 05 del 06 de 200 #example 2 - Not Captured
04 05 del 06 de 20076 55 #example 3 - Not Captured
To grab the correct inputs, I modified your rule by adding two digit number rule (d{2}
) to the beginning:
d{2}[s|]d{2}[s|](?:del|de[s|]el|de )[s|]d{2}[s|](?:del|de[s|]el|de)[s|]d*
Now, it grabs the correct inputs, and we can turn our faces to replacement rules. There are two kinds of replacement rules. The first one is the number format (Like: 1 2-3-4
in our case), which is the default behavior. When you wrap something with parenthesis, it is in number format. The second is name format (Like: g<startDay> g<finishDay>-g{month}-g{year}
in our case), which I prefer. To make name-format replacements, you need to use named capturing groups (?P<startDay>***)
.
Let’s add named capturing groups to our rule:
(?P<startDay>d{2})[s|](?P<finishDay>d{2})[s|](?:del|de[s|]el|de )[s|](?P<month>d{2})[s|](?:del|de[s|]el|de)[s|](?P<year>d*)
The final code:
import re
#input_text = "05 del 07 del 2000 del 09 hhggh" #example 0 - Not modify!
#input_text = "04 del 05 del 07 del 2000" #example 1 - Not modify!
input_text = "04 05 del 06 de 200" #example 2 - Yes modify!
input_text = "04 05 del 06 de 20076 55" #example 3 - Yes modify!
detection_regex_obligatory_preposition = r"(?P<startDay>d{2})[s|](?P<finishDay>d{2})[s|](?:del|de[s|]el|de )[s|](?P<month>d{2})[s|](?:del|de[s|]el|de)[s|](?P<year>d*)"
date_restructuring_structure = "g<startDay> g<finishDay>-g<month>-g<year>"
input_text = re.sub(detection_regex_obligatory_preposition, date_restructuring_structure, input_text)
print(repr(input_text)) # --> output
You can write and shorten the pattern to 4 capture groups:
(d{2})s(d{2})sdel?s(d{2})sdel?s(d+)
And in the replacement use the groups: 1 2-3-4
Or with the named captured groups:
(?P<startDay>d{2})s(?P<finishDay>d{2})sdel?s(?P<month>d{2})sdel?s(?P<year>d+)
Explanation
(?P<startDay>d{2})
Match 2 digits in group startDay
s
Match a whitespace char
(?P<finishDay>d{2})
Match 2 digits in grou finishDay
sdel?s
Match either de
or del
between whitespace chars
(?P<month>d{2})
Match 2 digits in group month
sdel?s
Match either de
or del
between whitespace chars
(?P<year>d+)
Match 1+ digits in group year
See a Regex demo.
import re
input_text = "05 del 07 del 2000 del 09 hhggh" #example 0 - Not modify!
input_text = "04 del 05 del 07 del 2000" #example 1 - Not modify!
input_text = "04 05 del 06 de 200" #example 2 - Yes modify!
input_text = "04 05 del 06 de 20076 55" #example 3 - Yes modify!
detection_regex_obligatory_preposition = r"d{2}" + r"[s|](?:del|de[s|]el|de )[s|]" + r"d{2}" + r"[s|](?:del|de[s|]el|de )[s|]" + r"d*"
year, month, days_intervale_or_day = "", "", "" # = group()[2], group()[1], group()[0]
date_restructuring_structure = days_intervale_or_day + "-" + month + "-" + year
print(repr(date_restructuring_structure))
input_text = re.sub(detection_regex_obligatory_preposition, date_restructuring_structure, input_text)
print(repr(input_text)) # --> output
Correct outputs for each of these cases
""
"05 del 07 del 2000 del 09 hhggh" #example 0 - Not modify!
""
"04 del 05 del 07 del 2000" #example 1 - Not modify!
"05-06-200"
"04 05-06-200" #example 2 - Yes modify!
"05-06-20076"
"04 05-06-20076 55" #example 3 - Yes modify!
In the example 1 should not be replaced since there is more than one day indicated in front of it, leaving something like this
d{2} del d{2} del d{2} del d*
and not this d{2} del d{2} del d*
Something similar happens in example 0 where there is no need to perform the replacement since this d{2} del d{2} del d* de d{2}
or d{2} del d{2} del d* de d*
and not this d{2} del d{2} del d*
How to set the capture groups and the regex to be able to perform the replacements of examples 2 and 3, but not those of examples 0 and 1?
Demo: https://regex101.com/r/w7Yp7J/1
import re
#input_text = "05 del 07 del 2000 del 09 hhggh" #example 0 - Not modify!
#input_text = "04 del 05 del 07 del 2000" #example 1 - Not modify!
input_text = "04 05 del 06 de 200" #example 2 - Yes modify!
input_text = "04 05 del 06 de 20076 55" #example 3 - Yes modify!
detection_regex_obligatory_preposition = r"(?P<startDay>d{2})[s|](?P<finishDay>d{2})[s|](?:del|de[s|]el|de )[s|](?P<month>d{2})[s|](?:del|de[s|]el|de)[s|](?P<year>d*)"
date_restructuring_structure = "g<startDay> g<finishDay>-g<month>-g<year>"
input_text = re.sub(detection_regex_obligatory_preposition, date_restructuring_structure, input_text)
print(repr(input_text)) # --> output
To see your code on Regex101, I combined your rules as the following:
d{2}[s|](?:del|de[s|]el|de )[s|]d{2}[s|](?:del|de[s|]el|de)[s|]d*
I realized that it grabs the inputs, which are the exact opposite of what we want. Like the following:
05 del 07 del 2000 del 09 hhggh #example 0 - Captured
04 del 05 del 07 del 2000 #example 1 - Captured
04 05 del 06 de 200 #example 2 - Not Captured
04 05 del 06 de 20076 55 #example 3 - Not Captured
To grab the correct inputs, I modified your rule by adding two digit number rule (d{2}
) to the beginning:
d{2}[s|]d{2}[s|](?:del|de[s|]el|de )[s|]d{2}[s|](?:del|de[s|]el|de)[s|]d*
Now, it grabs the correct inputs, and we can turn our faces to replacement rules. There are two kinds of replacement rules. The first one is the number format (Like: 1 2-3-4
in our case), which is the default behavior. When you wrap something with parenthesis, it is in number format. The second is name format (Like: g<startDay> g<finishDay>-g{month}-g{year}
in our case), which I prefer. To make name-format replacements, you need to use named capturing groups (?P<startDay>***)
.
Let’s add named capturing groups to our rule:
(?P<startDay>d{2})[s|](?P<finishDay>d{2})[s|](?:del|de[s|]el|de )[s|](?P<month>d{2})[s|](?:del|de[s|]el|de)[s|](?P<year>d*)
The final code:
import re
#input_text = "05 del 07 del 2000 del 09 hhggh" #example 0 - Not modify!
#input_text = "04 del 05 del 07 del 2000" #example 1 - Not modify!
input_text = "04 05 del 06 de 200" #example 2 - Yes modify!
input_text = "04 05 del 06 de 20076 55" #example 3 - Yes modify!
detection_regex_obligatory_preposition = r"(?P<startDay>d{2})[s|](?P<finishDay>d{2})[s|](?:del|de[s|]el|de )[s|](?P<month>d{2})[s|](?:del|de[s|]el|de)[s|](?P<year>d*)"
date_restructuring_structure = "g<startDay> g<finishDay>-g<month>-g<year>"
input_text = re.sub(detection_regex_obligatory_preposition, date_restructuring_structure, input_text)
print(repr(input_text)) # --> output
You can write and shorten the pattern to 4 capture groups:
(d{2})s(d{2})sdel?s(d{2})sdel?s(d+)
And in the replacement use the groups: 1 2-3-4
Or with the named captured groups:
(?P<startDay>d{2})s(?P<finishDay>d{2})sdel?s(?P<month>d{2})sdel?s(?P<year>d+)
Explanation
(?P<startDay>d{2})
Match 2 digits in group startDays
Match a whitespace char(?P<finishDay>d{2})
Match 2 digits in grou finishDaysdel?s
Match eitherde
ordel
between whitespace chars(?P<month>d{2})
Match 2 digits in group monthsdel?s
Match eitherde
ordel
between whitespace chars(?P<year>d+)
Match 1+ digits in group year
See a Regex demo.