make regex pattern to grab that ends with a period, one or more spaces, or the end of the string

Question:

import re

#regex pattern
time_in_numbers = r"(?:por el|entrada el|entrado el|del|)s*(?:a[s|]*.[s|]*m[s|]*.|a[s|]*m[s|]*.|a[s|]*.[s|]*m|a[s|]*m|p[s|]*.[s|]*m[s|]*.|p[s|]*m[s|]*.|p[s|]*.[s|]*m|p[s|]*m|)"

#if it detects the regex pattern condition in the input string then it performs a replacement with the re.sub() function
input_text = re.sub(time_in_numbers, "replacement!!!", input_text)

Some example cases:

input_text = "por el a.m.anecer"  #accept
input_text = "por el amanecer"  #not accept
input_text = "por el a.manecer" #not accept
input_text = "por el a.m anecer" #accept
input_text = "por el am anecer" #accept
input_text = "por el am.anecer" #accept
input_text = "por el a.m." #accept
input_text = "por el a.m" #accept
input_text = input_text + "jhfsjh"
input_text = "por el a.mjhfsjh" #accept

I try to add "jhfsjh" in the end of the regex patterns, in those cases where "am" or "pm" does not end with a dot "." after

time_in_numbers = r"(?:por el|entrada el|entrado el|del|)s*(?:|a[s|]*.[s|]*mjhfsjh|a[s|]*mjhfsjh|p[s|]*.[s|]*mjhfsjh|p[s|]*mjhfsjh|a[s|]*.[s|]*m|a[s|]*ms|p[s|]*.[s|]*ms|p[s|]*ms|)"

input_text = re.sub(time_in_numbers, "replacement!!!", input_text)

input_text = input_text.replace("jhfsjh", "") #accept

Is there another way for the condition to end with a period, one or more empty spaces, or the end of the string r[.|s*|the end of the string] , without doing this?

Answers:

If you want to create a regex, you can use Regex101. By selecting the python flavor option, the site will match and capture strings using the python regex syntax wihtout needing to execute your python script every time.

To create this answer I’ve used this site. Here’s the regex I got: (?:(?:por el)|(?:entrada el)|(?:entrado el)|(?:del)|)s+(?:(?:a *.? *m *[. ] *.*)|(?:a *.? *m *[s])).

By using this big string as input:

por el a.m.anecer
por el a.m anecer
por el am anecer
por el am.anecer
por el a.m.
por el a.m

por el amanecer
por el a.manecer

Only the first block is matched using the site. You can easily test it using a Ctrl+C and Ctrl+V.

Answered By: Carl HR

There are some issues in your current regex:

  • It seems that with [s|]* you want to make a space optional, but within square brackets a pipe symbol is taken as a literal character. Moreover, as you already have *, the space is already optional. So you can shorten this to just s*

  • By putting | without option at the left or right of it, you want to make the rest of the expression optional, but to achieve such there is the ? operator. So instead of (?:por el|entrada el|entrado el|del|) do (?:por el|entrada el|entrado el|del)?.

  • Your regex lists several possibilities separately, but these can be combined. For instance, you have the same options with a as with p. These can be combined by using [ap].

  • Your workaround to test the end of the string is not necessary. There is $ for that purpose. But this is a case you really don’t need to test separately. All you want is to make sure that the m is not followed by another alphanumerical character. Again there is a provision for that: use b.

  • As everything is optional in your regex, it will also match empty strings, which explains why your sub is resulting in so many "replacement!!!" insertions. Better make sure the regex is required to match something at least.

I did not quite understand what you wanted to achieve with sub, but as your question was about the matching itself, I provide here a regex with a sub call that will insert parentheses around the parts it matched:

import re

time_in_numbers = r"(por el|entrada el|entrado el|del)?s*([ap]s*(?:.s*)?mb(?:s*.)?)"

tests = [
    "por el a.m.anecer",  #accept
    "por el amanecer",  #not accept
    "por el a.manecer", #not accept
    "por el a.m anecer", #accept
    "por el am anecer", #accept
    "por el am.anecer", #accept
    "por el a.m.", #accept
    "por el a.m" #accept
]

for input_text in tests:
    result = re.sub(time_in_numbers, r"(1)(2)", input_text)
    print (result)

The output of this script is:

(por el)(a.m.)anecer
por el amanecer
por el a.manecer
(por el)(a.m) anecer
(por el)(am) anecer
(por el)(am.)anecer
(por el)(a.m.)
(por el)(a.m)

The lines that have the parentheses had a match, the two other lines not.

Answered By: trincot