How to extract and print, maintaining the order of appearance within the original input string, the substrings that match one of these RegEx patterns?

Question:

I was having some trouble printing the desired elements from a string, since these desired elements respond to different patterns (in this case I simplified it to 3 regex patterns). The objective, as shown in the examples below, is to print what is stored in the substring_to_extract_1 or/and substring_to_extract_2 variable, and these information extractions must be printed in the console in the order in which they appear in the input string (what which is tricky since inside the read loop there are 3 patterns to check)

import re

input_text = "quizas seria mejor ir de 06 00 am a 11 59 ya que me gusta viajar a esas horas" #example 1

finding loop:

    continue_comparing = True

    # 1) The least restrictive RegEx, not too restrictive but must be extracted by 2 substrings
    #    examples: "entre las 19hs y las 20 30", "entre las 18 y las 20", "entre la 1 y las 15hs"
    regex_1 = "(?:apartir de |de|entre |desde )s*(?:las|la )s*" substring_to_extract_1 "s*(?:de la tarde|de la noche|de la mañana|)s*(?:y las |y la |hasta las |hasta la |hasta )s*" substring_to_extract_2 "s*(?:de la tarde|de la noche|de la mañana|)"
    if (regex_2 == True and continue_comparing == True):
        print(repr(substring_to_extract_1))
        print(repr(substring_to_extract_2))
        continue_comparing = False
    
    # 2) The intermediate restrictive RegEx, but it must be indicated if it is "am" or "pm"
    #    examples:  "las 18:00am", "las 1800pm", "las 1800 p.m.", "las 08 00 am", "la 01 00 a.m.", "las 19 15 pm", "las 23 pm"
    regex_2 = "(?:las|la )s*" substring_to_extract_1 "s*(?:a.m.|a.m|am.|am|a m|p.m.|p.m|pm.|pm|p m)s*(?:de la tarde|de la noche|de la mañana|)"
    if (regex_1 == True and continue_comparing == True):
        print(repr(substring_to_extract_1))
        continue_comparing = False

    # 3) The last and the most restrictive RegEx, it should only extract a substring like the first regex, but this requires that this substring be preceded by more things
    #    examples: "a las 17", "a las 6 y 15", "desde las 15 hs", "a las 15 y 45 hs", "hasta las 17 00", "antes de las 16:06"
    regex_3 = "(?:a eso de |a esas de |despues de |antes de |hasta |tipo |desde |apartir de |de |a )s*(?:las|la )s*" substring_to_extract_1 "s*(?:de la tarde|de la noche|de la mañana|)"
    if (regex_3 == True and continue_comparing == True):
        print(repr(substring_to_extract_1))
        continue_comparing = False


In the regex patterns I indicate as substring_to_extract_1 the location where the data will be extracted, but for example for the first regex, I think something like this r'(d{1,2}s*(?::| )s*d{1,2}s*(?:am|pm))' would work. I could then extract those matches using the function .groups()

For the other 2 regex I am not sure, since they will depend a lot on how the reading loop should be structured.


Some examples of possible input strings to parse:

Example 1:

input_text = "quizas seria mejor ir de 06 00 am a 11 59 ya que me gusta viajar a esas horas"

In this case you must extract the following substrings one by one and print them in order of appearance…

output that I need:

"de 06 00 am"   <--- was extracted by the second regex pattern
"a 11 59"       <--- was extracted by the third regex pattern

Example 2:

input_text = "a eso de las 6 de la tarde o las 19 15 pm deberiamos estar alli, no crees? recuerda que desde las 15 hs ya empieza el evento, pero podemos estar alli antes de las 14 30 hs, aunque eso depende porque el show recien empezara entre las 15 hs y las 18pm"

output that I need:

"de las 6 de la tarde"    <--- was extracted by the third regex pattern
"las 19 15 pm"            <--- was extracted by the second regex pattern
"desde las 15"            <--- was extracted by the third regex pattern
"de las 14 30"            <--- was extracted by the third regex pattern
"entre las 15"            <--- was extracted by the first regex pattern
"y las 18pm"              <--- was extracted by the first regex pattern

Example 3:

input_text = "Hay que estar presentes entre las 19hs y las 20 30 hs, y seguro salimos a las 23 pm"

output that I need:

"entre las 19"            <--- was extracted by the first regex pattern
"y las 20 30"             <--- was extracted by the first regex pattern
"a las 23 pm"             <--- was extracted by the second regex pattern

Example 4:

input_text = "A las 19 salimos!! es importante llegar alla antes de las 20 30 hs"

output that I need:

"A las 19"                <--- was extracted by the third regex pattern
"de las 20 30"            <--- was extracted by the third regex pattern

Example 5:

input_text = "A las 19:30 salimos!! es importante llegar alla antes de las 20 30 hs, ya que a las 21: pm cierran algunos negocios, sin embargo el cine esta abierto hasta las 23:30 pm de la noche"

output that I need:

"A las 19:30"                    <--- was extracted by the third regex pattern
"de las 20 30"                   <--- was extracted by the third regex pattern
"a las 21: pm"                   <--- was extracted by the second regex pattern
"hasta las 23:30 pm de la noche" <--- was extracted by the second regex pattern

How should I read the input string? And consequently, how should I structure the code block that allows us to evaluate the patterns in these cases?

Answers:

Try (regex101):

import re

test_cases = [
    "a eso de las 6 de la tarde o las 19 15 pm deberiamos estar alli, no crees? recuerda que desde las 15 hs ya empieza el evento, pero podemos estar alli antes de las 14 30 hs, aunque eso depende porque el show recien empezara entre las 15 hs y las 18pm",
    "quizas seria mejor ir de 06 00 am a 11 59 ya que me gusta viajar a esas horas",
    "Hay que estar presentes entre las 19hs y las 20 30 hs, y seguro salimos a las 23 pm",
    "A las 19 salimos!! es importante llegar alla antes de las 20 30 hs",
]

pat = re.compile(
    r"b(?:de las|entre las|desde las|y las|a las|las|de|a)s+d+(?:s+d+)?s*(?:pm|am|de la tarde)?",
    flags=re.I,
)

for t in test_cases:
    x = pat.findall(t)
    print(t)
    print("-" * 80)
    print(*map(str.strip, x), sep="n")
    print()

Prints:

a eso de las 6 de la tarde o las 19 15 pm deberiamos estar alli, no crees? recuerda que desde las 15 hs ya empieza el evento, pero podemos estar alli antes de las 14 30 hs, aunque eso depende porque el show recien empezara entre las 15 hs y las 18pm
--------------------------------------------------------------------------------
de las 6 de la tarde
las 19 15 pm
desde las 15
de las 14 30
entre las 15
y las 18pm

quizas seria mejor ir de 06 00 am a 11 59 ya que me gusta viajar a esas horas
--------------------------------------------------------------------------------
de 06 00 am
a 11 59

Hay que estar presentes entre las 19hs y las 20 30 hs, y seguro salimos a las 23 pm
--------------------------------------------------------------------------------
entre las 19
y las 20 30
a las 23 pm

A las 19 salimos!! es importante llegar alla antes de las 20 30 hs
--------------------------------------------------------------------------------
A las 19
de las 20 30


EDIT: To save it as substrings:

out = []
for t in test_cases:
    x = pat.findall(t)
    out.append(list(map(str.strip, x)))

print(out)

Prints:

[
    [
        "de las 6 de la tarde",
        "las 19 15 pm",
        "desde las 15",
        "de las 14 30",
        "entre las 15",
        "y las 18pm",
    ],
    ["de 06 00 am", "a 11 59"],
    ["entre las 19", "y las 20 30", "a las 23 pm"],
    ["A las 19", "de las 20 30"],
]

EDIT 2: With ::

import re

test_cases = [
    "a eso de las 6 de la tarde o las 19 15 pm deberiamos estar alli, no crees? recuerda que desde las 15 hs ya empieza el evento, pero podemos estar alli antes de las 14 30 hs, aunque eso depende porque el show recien empezara entre las 15 hs y las 18pm",
    "quizas seria mejor ir de 06 00 am a 11 59 ya que me gusta viajar a esas horas",
    "Hay que estar presentes entre las 19hs y las 20 30 hs, y seguro salimos a las 23 pm",
    "A las 19 salimos!! es importante llegar alla antes de las 20 30 hs",
    "A las 19:30 salimos!! es importante llegar alla antes de las 20 30 hs, ya que a las 21: pm cierran algunos negocios, sin embargo el cine esta abierto hasta las 23:30 pm de la noche",
]

pat = re.compile(
    r"b(?:de las|entre las|desde las|y las|a las|las|de|a)s+d+(?:[s:]+)?(?:d+)?s*(?:pm|am|de la tarde)?",
    flags=re.I,
)

out = []
for t in test_cases:
    x = pat.findall(t)
    out.append(list(map(str.strip, x)))

print(out)

Prints:

[
    [
        "de las 6 de la tarde",
        "las 19 15 pm",
        "desde las 15",
        "de las 14 30",
        "entre las 15",
        "y las 18pm",
    ],
    ["de 06 00 am", "a 11 59"],
    ["entre las 19", "y las 20 30", "a las 23 pm"],
    ["A las 19", "de las 20 30"],
    ["A las 19:30", "de las 20 30", "a las 21: pm", "las 23:30 pm"],
]
Answered By: Andrej Kesely