Put "0" in front of numeric quantities within a string that are missing a numeric figure following the context of a regex pattern

Question:

import re

input_text = '2000_-_9_-_01 8:1 am'  #example 1
input_text = '(2000_-_1_-_01) 18:1 pm'  #example 2
input_text = '(20000_-_12_-_1) (1:1 am)'  #example 3


identificate_hours = r"(?:as*las|as*la|)s*(d{1,2}):(d{1,2})s*(?:(am)|(pm)|)"

date_format_00 = r"(d*)_-_(d{1,2})_-_(d{1,2})"
identification_re_0 = r"(?:(|)s*" + date_format_00 + r"s*(?:)|)s*(?:as*las|as*la|)s*(?:(|)s*" + identificate_hours + r"s*(?:)|)"

input_text = re.sub(identification_re_0,
                    #lambda m: print(m[2]),
                    lambda m: (f"({m[1]}_-_{m[2]}_-_{m[3]}({m[4] or '00'}:{m[5] or '00'} {m[6] or m[7] or 'am'}))"),
                    input_text, re.IGNORECASE)

print(repr(input_text)) # --> output

Considering that there are 5 numerical values(year, month, day, hour, minutes) where possibly it should be corrected by adding a "0", and being 2 possibilities (add a zero "0" or not add a zero "0"), after using the combinatorics formula I can know that there would be a total of 32 possible combinations, which it’s too much to come up with 32 different regex that do or don’t add a "0" in front of every value that needs it. For this reason I feel that trying to repeat the regex, changing only the "(d{1,2})" one by one, would not be a good solution for this problem.

I was trying to standardize date-time data that is entered by users in natural language so that it can then be processed.

So, once the dates were obtained in this format, I needed those numerical values of months, days, hours and/or minutes that have remained with a single digit are standardized to 2 digits, placing a "0" before them to compensate for possible missing digits.

So that in the output the input date-time are expressed in this way:

YYYY_-_MM_-_DD(hh:mm am or pm)

'(2000_-_09_-_01(08:01 am))'   #for example 1
'(2000_-_01_-_01(18:01 pm))'   #for example 2
'(20000_-_12_-_01(18:01 am))'  #for example 3

I have used the re.sub() function because it contemplates the possibility that within the same input_text there is more than one occasion where a replacement of this type must be carried out. For example, in an input where '2000_-_9_-_01 8:1 am 2000_-_9_-_01 8:1 am', you should perform this procedure 2 times since there are 2 dates present (that is, there are 2 times where this pattern appears), and obtain this '(2000_-_09_-_01(08:01 am)) (2000_-_09_-_01(08:01 am))'

Answers:

I’m not sure I fully understood you, but I would solve it with datetime instead of regex. But that doesn’t support the year 20000, typo? or are you planning way ahead? 😀

from datetime import datetime

testDates = [
    '2000_-_9_-_01 8:1 am',  #example 1
    '(2000_-_1_-_01) 18:1 pm',  #example 2
    '(2000_-_12_-_1) (1:1 am)',  #example 3
]

for testDate in testDates:
    testDateClean = testDate
    for rm in ('(', ')'):
        testDateClean = testDateClean.replace(rm, '')
    date = datetime.strptime(testDateClean, '%Y_-_%m_-_%d %H:%M %p')
    print(date.strftime('%Y_-_%m_-_%d(%H:%M %p)'))

A regex solution which can handle all provided example strings:

import re

INPUT_DATES = [
    '(2000_-_09_-_01 (08:01 am)) (2001_-_10_-_01 (09:02 am))',
    '(20000_-_1_-_01) 18:1 pm',
    '2000_-_9_-_01 8:1 am',
    '(2000_-_12_-_1) (1:1 am)',
]

REGEX_SPLIT = re.compile(r'(([dpam_- :()]{10,})) (([dpam_- :()]{10,}))')
REGEX_DATE = re.compile(r'(?P<year>d{4,})_-_(?P<month>d{1,2})_-_(?P<day>d{1,2}) (?P<hour>d{1,2}):(?P<minute>d{1,2}) (?P<apm>[apm]{2})')

for testDates in INPUT_DATES:
    testDates = REGEX_SPLIT.split(testDates)
    for testDate in testDates:
        if len(testDate) < 10:
            continue
        testDateClean = testDate
        for rm in ('(', ')'):
            testDateClean = testDateClean.replace(rm, '')
        date = REGEX_DATE.match(testDateClean).groupdict()
        print(f'parsed out: {date["year"]}_-_{date["month"]:>02}_-_{date["day"]:>02}({date["hour"]:>02}:{date["minute"]:>02} {date["apm"]}), from in: {testDate}')

output:

parsed out: 2000_-_09_-_01(08:01 am), from in: 2000_-_09_-_01 (08:01 am)
parsed out: 2001_-_10_-_01(09:02 am), from in: 2001_-_10_-_01 (09:02 am)
parsed out: 20000_-_01_-_01(18:01 pm), from in: (20000_-_1_-_01) 18:1 pm
parsed out: 2000_-_09_-_01(08:01 am), from in: 2000_-_9_-_01 8:1 am
parsed out: 2000_-_12_-_01(01:01 am), from in: (2000_-_12_-_1) (1:1 am)
Answered By: phibel