How to extract time data from a string using regex.group()?

Question:

I’m supposed to normalize time statements in input, converting them to a standard format. The input statements contain an hour, possibly minutes, and a part of day (morning or evening). The part of day can be expressed multiple ways. The hour might be based on a 12-hour clock.

The output must use a 24 hour clock and "am" or "pm " for the time of day. Extra characters (such as spaces) in the time statement should be kept. Minutes shouldn’t be added; if the original statement doesn’t include minutes, they shouldn’t appear in the result.

Sample data

#input examples:
inputs = [
    "6 de la manana hdhd", #example 1
    "hdhhd 06: de la manana hdhd", #example 2
    "hd:00 06 : de la manana hdhd", #example 3
    "hdhhd 6 de la manana hdhd", #example 4
    "hdhhd 06:00 de la manana hdhd", #example 5
    "hdhhd 06 : 18 de la manana hdhd", #example 6
    "hdhhd 18 de la manana hdhd", #example 7
    "hdhhd 18:18 de la manana hdhd", #example 8
    "hdhhd 18 : 00 de la manana hdhd", #example 9
    "hdhhd 19 : 19 de la noche hdhd", #example 10
    "hdhhd 6  de la noche hdhd", #example 11
    ]

There are two cases where the hour might need to be changed.

  • The input might contain mistakes, where the hour is in the evening but the part-of-day is given as the morning (example 7). In this case, the part-of-day should be changed to match the hour.
  • The input might also use a 12 hour clock, where the hour is <= 12 and the part-of-day is in the evening (example 11). In this case, the hour should be changed to match the part-of-day.

This is my code so far, where I have managed to put together the structure of the replacements but I have not yet been able to extract the data that I will need in the process. I have put pseudocode in those parts that are not finished:

import re   #library for using regular expressions

am_list = ["manana", "mañana", "mediodia", "medio dia","madrugada","amanecer"]
pm_list = ["atardecer", "tarde", "ocaso", "noche", "anochecer"]

def fix_time(input_text):
    is_am_time, is_pm_time = False, False
    hour_number_fixed, civil_time_fixed = "", ""
    
    re_pattern_for_am = r"d{1,2})[s|:]*(d{0,2})s*(?:de la |de el)" + am_list
    if (identification condition for am):
        #extract with re.group()
        hour_number = int()  # <--- d{1,2}
        am_or_pm = str()     # <--- am_list
    
    re_pattern_for_pm = r"d{1,2})[s|:]*(d{0,2})s*(?:de la |de el)" + pm_list
    if (identification condition for pm):
        #extract with re.group()
        hour_number = int()  # <--- d{1,2}
        am_or_pm = str()     # <--- pm_list
    
    if (am_or_pm == one element in am_list):
        is_am_time = True
    elif (am_or_pm == one element in pm_list):
        is_pm_time = True
    
    if (is_am_time == True):
        if (hour_number >= 12):
            civil_time_fixed = "pm"
        else:
            civil_time_fixed = "am"
        hour_number_fixed = str(hour_number)
    elif (is_pm_time == True):
        if (hour_number < 12):
            hour_number_fixed = str(hour_number + 12 ) 
        civil_time_fixed = "pm"
    
    #replacement process
    input_text = input_text.replace(hour_number, hour_number_fixed, 1)
    input_text = input_text.replace(am_or_pm, civil_time_fixed, 1)
    
    return input_text

I need the program to decide and correct the schedules, using the data (hour_number and am_or_pm) that it must extract from the input_string with re.group(). This is what is giving me the most trouble. How can I get the regexes to capture the hour and part of day?

The correct output in each case:

"6 am hdhd"                   #for the example 1
"hdhhd 06: am hdhd"           #for the example 2
"hd:00 06 : am hdhd"          #for the example 3
"hdhhd 6 am hdhd"             #for the example 4
"hdhhd 06:00 am hdhd"         #for the example 5
"hdhhd 06 : 18 am hdhd"       #for the example 6
"hdhhd 18 pm hdhd"            #for the example 7
"hdhhd 18:18 pm hdhd"         #for the example 8
"hdhhd 18 : 00 pm hdhd"       #for the example 9
"hdhhd 19 : 19 pm hdhd"       #for the example 10
"hdhhd 18 pm hdhd"            #for the example 11

How do I do those data extractions with re.group() (or similar method) in this code?

Answers:

It seems imprudent to attempt a full solution so here is an example of how to extract the hour using named groups with a simplified regex.

input_text = "hdhhd 06:00 de la manana hdhd"
match = re.search(r"(?P<hour>dd?):(?P<minutes>dd)", input_text)
hour = match.group('hour')
print(hour)    # 06

Other than that what are the specific aspects of your problem that you are struggling with?

Answered By: MikeM

First, note that normalizing the hour is beyond the capabilities of regular expressions, so that will need to be performed in Python. Fortunately, re.sub accepts a function to create the replacement string.

Regex

The sample regex has a few large issues:

  • The group to capture the time is missing an open parentheses to start the group.
  • You can’t add a string and a list; the lists must be joined with a separator.
  • The AM and PM word patterns can’t simply be appended to the main patterns; they each must be in a group so they can use alternation.

There’s also a minor issue: the pattern will fail for strings with ‘de el’ because there’s no space between ‘el’ and the AM or PM word.

Note you can combine the two regexes into one, and then check whether the AM or PM subpattern was matched. An easy way to do this is to use two named groups, one for AM and one for PM, with the words and phrases for each period in the corresponding group.

The sub-expressions to match the hour and minute can also be named, for clarity of access. The time expression could also be named.

This gives the following Python to create the pattern:

am_pattern = '|'.join(am_list)
pm_pattern = '|'.join(pm_list)
time_pattern = r"(?P<time>(?P<hour>d{1,2})(?P<minute>[s:|]*d{0,2}))
pattern = f'{time_pattern}s*(?:de la|de el)s(?:(?P<am>{am_pattern})|(?P<pm>{pm_pattern}))'

Evaluated (in free-spacing mode, for clarity), the regex is:

(?P<time>
  (?P<hour>d{1,2})
  (?P<minute>[s:|]*d{0,2})
)
s*(?:de la|de el)s
(?:
  (?P<am>manana|mañana|mediodia|medio dia|madrugada|amanecer)
|
  (?P<pm>atardecer|tarde|ocaso|noche|anochecer)
)

There are a few minor improvements that can be made, such as:

  • Anchoring the beginning of the time sub-pattern at a word boundary to prevent a match when there are more than two digits (the existing pattern will match ‘123: 45’, as the d{1,2} will match the ’23’ of ‘123’).
  • The time sub-pattern will match any string of 3 or 4 digits, as the separator isn’t required. Instead, require the separator and make the minute sub-pattern optional.

With these changes, the regex construction becomes:

am_pattern = '|'.join(am_list)
pm_pattern = '|'.join(pm_list)
time_pattern = r"(?P<time>(?P<hour>bd{1,2})(?P<minute>[s:|]+d{0,2})?)"
pattern = f'{time_pattern}s*de (?:la|el) (?:(?P<am>{am_pattern})|(?P<pm>{pm_pattern}))'

Evaluated:

(P<time>
  (?P<hour>bd{1,2})
  (?P<minute>[s:|]+d{1,2})?
)
s*de (?:la|el)s
(?:
  (?P<am>manana|mañana|mediodia|medio dia|madrugada|amanecer)
|
  (?P<pm>atardecer|tarde|ocaso|noche|anochecer)
)'

Python

With the above regex to extract the necessary information, the replacement function has a few tasks:

  • check whether an AM or PM word was matched, and then use the correct replacement
  • AM/PM check & correct
  • hour check & correct
  • trim whitespace
def matched_group(match, groups, default='', throw=False):
    """
    Return the name of the first named group from 'groups' that had a match.
    """
    for group in groups:
        if match.group(group):
            return group
    if throw:
        raise KeyError(f'no group found from ({groups})')
    return default # could also throw

def repl_time(match):
    meridiem = matched_group(match, ['am', 'pm'])
    time, hour, minute = match.group('time', 'hour', 'minute')
    hour = int(hour)
    if hour > 12:
        meridiem = 'pm'
    elif 'pm' == meridiem: # hour <= 12
        hour += 12
        time = str(hour) + minute
    return time.rstrip() + ' ' + meridiem

reTime.sub(repl_time, input_text)

Applying the above to the samples produces the desired results:

samples = [
    "6 de la manana hdhd",
    "hdhhd 06: de la manana hdhd",
    "hd:00 06 : de la manana hdhd",
    "hdhhd 6 de la manana hdhd",
    "hdhhd 06:00 de la manana hdhd",
    "hdhhd 06 : 18 de la manana hdhd",
    "hdhhd 18 de la manana hdhd",
    "hdhhd 18:18 de la manana hdhd",
    "hdhhd 18 : 00 de la manana hdhd",
    "hdhhd 19 : 19 de la noche hdhd",
    "hdhhd 6  de la noche hdhd",
    ]

[reTime.sub(repl_time, sample) for sample in samples]

Results:

[
    '6 am hdhd',
    'hdhhd 06: am hdhd',
    'hd:00 06 : am hdhd',
    'hdhhd 6 am hdhd',
    'hdhhd 06:00 am hdhd',
    'hdhhd 06 : 18 am hdhd',
    'hdhhd 18 pm hdhd',
    'hdhhd 18:18 pm hdhd',
    'hdhhd 18 : 00 pm hdhd',
    'hdhhd 19 : 19 pm hdhd',
    'hdhhd 18 pm hdhd'
]
Answered By: outis