Failure to identify and concatenate using a capture group identified with regex as reference

Question:

import re

input_text = 'desde las 15:00 del 2002-11-01 hasta las 16 hs' #example

I have placed the pattern (?:(?<=s)|^) in front so that it only detects if it is the beginning of the string or if there are one or more whitespaces in front. Then there are other matches that must be present. And finally there is the time which is missing the minutes, and the program must add :00

input_text = re.sub(r'(?:(?<=s)|^)(?:a[s|]*las|a[s|]*la|de[s|]*las|de[s|]*la)s*(d{1,2})[s|]*(?::|)[s|]*(?:h. s.|h s.|h. s|h s|h.s.|hs.|h.s|hs|horas|hora)', r'1:00 hs', input_text)

print(repr(input_text)) # ---> output

I couldn’t do a regular concatenation either because I don’t know what could be next in the string.

I’m not really getting the proper replacement using this search regex pattern, and the correct output is this:

'desde las 15:00 del 2002-11-01 hasta las 16:00 hs'

I think that the (d{1,2}) capture group is failing and that is why it is not correctly replaced in the 1

Answers:

You may solve the current issue using

re.sub(r'(?<!S)((?:de(?:sde)?|(?:hast)?a)s*las?s*d{1,2})s*(?::s*)?(?:h.? ?s.?|horas?)(?!Bw)', r'1:00 hs', input_text)

See the regex demo. Please pay attention at the pattern description below:

  • (?<!S) – a left-hand whitespace boundary
  • ((?:de(?:sde)?|(?:hast)?a)s*las?s*d{1,2}) – Group 1 (1 refers to this value from the replacement pattern):
    • (?:de(?:sde)?|(?:hast)?a)de, desde, hasta, a
    • s* – zero or more whitespaces
    • las?la or las
    • s* – zero or more whitespaces
    • d{1,2} – one or two digits (note you might want to use (?:[01]?[0-9]|2[0-3]) to only match numbers from 0 to 23 to match 24h time format)
  • s* – zero or more whitespaces
  • (?::s*)? – an optional sequence of a colon and zero or more whitespaces
  • (?:h.? ?s.?|horas?)h, then an optional ., then an optional space, then an s and then an optional ., or hora or horas
  • (?!Bw) – adaptive dynamic word boundary, if there is a word char on the left, the word boundary is required.
Answered By: Wiktor Stribiżew