Catch the following capture groups with a regex and then reorder them with re sub method if the pattern is detected

Question:

import re

input_text = "05 del 07 del 2000 del 09 hhggh" #example 0 - Not modify!
input_text = "04 del 05 del 07 del 2000" #example 1 - Not modify!
input_text = "04 05 del 06 de 200" #example 2 - Yes modify!
input_text = "04 05 del 06 de 20076 55" #example 3 - Yes modify!

detection_regex_obligatory_preposition = r"d{2}" + r"[s|](?:del|de[s|]el|de )[s|]" + r"d{2}" + r"[s|](?:del|de[s|]el|de )[s|]" + r"d*"

year, month, days_intervale_or_day = "", "", "" # = group()[2], group()[1], group()[0]
date_restructuring_structure = days_intervale_or_day + "-" + month + "-" + year
print(repr(date_restructuring_structure))

input_text = re.sub(detection_regex_obligatory_preposition, date_restructuring_structure, input_text)

print(repr(input_text)) # --> output

Correct outputs for each of these cases

""
"05 del 07 del 2000 del 09 hhggh" #example 0 - Not modify!

""
"04 del 05 del 07 del 2000" #example 1 - Not modify!

"05-06-200"
"04 05-06-200" #example 2 - Yes modify!

"05-06-20076"
"04 05-06-20076 55" #example 3 - Yes modify!

In the example 1 should not be replaced since there is more than one day indicated in front of it, leaving something like this
d{2} del d{2} del d{2} del d* and not this d{2} del d{2} del d*

Something similar happens in example 0 where there is no need to perform the replacement since this d{2} del d{2} del d* de d{2} or d{2} del d{2} del d* de d* and not this d{2} del d{2} del d*

How to set the capture groups and the regex to be able to perform the replacements of examples 2 and 3, but not those of examples 0 and 1?

Answers:

Demo: https://regex101.com/r/w7Yp7J/1

import re

#input_text = "05 del 07 del 2000 del 09 hhggh" #example 0 - Not modify!
#input_text = "04 del 05 del 07 del 2000" #example 1 - Not modify!
input_text = "04 05 del 06 de 200" #example 2 - Yes modify!
input_text = "04 05 del 06 de 20076 55" #example 3 - Yes modify!

detection_regex_obligatory_preposition = r"(?P<startDay>d{2})[s|](?P<finishDay>d{2})[s|](?:del|de[s|]el|de )[s|](?P<month>d{2})[s|](?:del|de[s|]el|de)[s|](?P<year>d*)"

date_restructuring_structure = "g<startDay> g<finishDay>-g<month>-g<year>"

input_text = re.sub(detection_regex_obligatory_preposition, date_restructuring_structure, input_text)

print(repr(input_text)) # --> output

To see your code on Regex101, I combined your rules as the following:

d{2}[s|](?:del|de[s|]el|de )[s|]d{2}[s|](?:del|de[s|]el|de)[s|]d*

I realized that it grabs the inputs, which are the exact opposite of what we want. Like the following:

05 del 07 del 2000 del 09 hhggh #example 0 - Captured
04 del 05 del 07 del 2000 #example 1 - Captured
04 05 del 06 de 200 #example 2 - Not Captured
04 05 del 06 de 20076 55 #example 3 - Not Captured

To grab the correct inputs, I modified your rule by adding two digit number rule (d{2}) to the beginning:

d{2}[s|]d{2}[s|](?:del|de[s|]el|de )[s|]d{2}[s|](?:del|de[s|]el|de)[s|]d*

Now, it grabs the correct inputs, and we can turn our faces to replacement rules. There are two kinds of replacement rules. The first one is the number format (Like: 1 2-3-4 in our case), which is the default behavior. When you wrap something with parenthesis, it is in number format. The second is name format (Like: g<startDay> g<finishDay>-g{month}-g{year} in our case), which I prefer. To make name-format replacements, you need to use named capturing groups (?P<startDay>***).

Let’s add named capturing groups to our rule:

(?P<startDay>d{2})[s|](?P<finishDay>d{2})[s|](?:del|de[s|]el|de )[s|](?P<month>d{2})[s|](?:del|de[s|]el|de)[s|](?P<year>d*)

The final code:

import re

#input_text = "05 del 07 del 2000 del 09 hhggh" #example 0 - Not modify!
#input_text = "04 del 05 del 07 del 2000" #example 1 - Not modify!
input_text = "04 05 del 06 de 200" #example 2 - Yes modify!
input_text = "04 05 del 06 de 20076 55" #example 3 - Yes modify!

detection_regex_obligatory_preposition = r"(?P<startDay>d{2})[s|](?P<finishDay>d{2})[s|](?:del|de[s|]el|de )[s|](?P<month>d{2})[s|](?:del|de[s|]el|de)[s|](?P<year>d*)"

date_restructuring_structure = "g<startDay> g<finishDay>-g<month>-g<year>"

input_text = re.sub(detection_regex_obligatory_preposition, date_restructuring_structure, input_text)

print(repr(input_text)) # --> output
Answered By: Onur Uslu

You can write and shorten the pattern to 4 capture groups:

(d{2})s(d{2})sdel?s(d{2})sdel?s(d+)

And in the replacement use the groups: 1 2-3-4

Regex demo


Or with the named captured groups:

(?P<startDay>d{2})s(?P<finishDay>d{2})sdel?s(?P<month>d{2})sdel?s(?P<year>d+)

Explanation

  • (?P<startDay>d{2}) Match 2 digits in group startDay
  • s Match a whitespace char
  • (?P<finishDay>d{2}) Match 2 digits in grou finishDay
  • sdel?s Match either de or del between whitespace chars
  • (?P<month>d{2}) Match 2 digits in group month
  • sdel?s Match either de or del between whitespace chars
  • (?P<year>d+) Match 1+ digits in group year

See a Regex demo.

Answered By: The fourth bird
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.