How to generalize this regex so that it starts capturing substrings at the beginning of a string or if it is followed by some other word?

Question:

import re

name = "John"

#In these examples it works fine
input_sense_aux = "These sound system are too many, I think John can help us, otherwise it will be waiting for a while longer"
#input_sense_aux = "These sound system are too many but I know that John can help us, otherwise it will be waiting for a while longer"
#input_sense_aux = "These sound system are too many but I know that John can help us. otherwise it will be waiting for a while longer"
#input_sense_aux = "Do you know if John with the others could come this afternoon?"

#In these examples it does not work well
#input_sense_aux = "John can help us, otherwise it will be waiting for a while longer"
#input_sense_aux = "Can you help us, otherwise it will be waiting for a while longer for John"
#input_sense_aux = "sorry! can you help us? otherwise it will be waiting for a while longer for John"



regex_patron_m1 = r"s*((?:ws*)+)s*?" + name + r"s*((?:ws*)+)s*??"
m1 = re.search(regex_patron_m1, input_sense_aux, re.IGNORECASE) #Con esto valido la regex haber si entra o no en el bloque de code
if m1:
    something_1, something_2 = m1.groups()

    something_1 = something_1.strip()
    something_2 = something_2.strip()

    print(repr(something_1))
    print(repr(something_2))

I need the regex to grab the content before "John" like this:

(start of sentence|¿|¡|,|;|:|(|[|.) s* "content for something_1" s* John

And then:

John s* "content for something_2" s* (end of sentence|?|!|,|;|:|)|]|.)

In the fists examples, the regex works fine:

'these teams are too many but I know that'
'can help us'
'Do you know if'
'with the others could come this afternoon'

But with the cases of the last 3 examples the regex does not return anything

And I need help to be able to generalize my regex to all these cases and at the same time respect the conditions in which it must extract the content of something_1 and something_2

For the 3 last examples, the expected results are:

''
' can help us'
' otherwise it will be waiting for a while longer for '
''
' otherwise it will be waiting for a while longer for '
''

Answers:

Take this improved version of the code alongside some explanation, so you can customize it how you want:

import re

name = "John"

#In these examples it works fine
# input_sense_aux = "These sound system are too many, I think John can help us, otherwise it will be waiting for a while longer"
#input_sense_aux = "These sound system are too many but I know that John can help us, otherwise it will be waiting for a while longer"
# input_sense_aux = "These sound system are too many but I know that John can help us. otherwise it will be waiting for a while longer"
# input_sense_aux = "Do you know if John with the others could come this afternoon?"

#In these examples it does not work well
# input_sense_aux = "John can help us, otherwise it will be waiting for a while longer"
# input_sense_aux = "Can you help us, otherwise it will be waiting for a while longer for John"
# input_sense_aux = "sorry! can you help us? otherwise it will be waiting for a while longer for John"



regex_patron_m1 = r"s*([?:ws]+)?s*" + name + r"s*([?:ws]+)?s*"
m1 = re.search(regex_patron_m1, input_sense_aux, re.IGNORECASE) #Con esto valido la regex haber si entra o no en el bloque de code
if m1:
    something_1, something_2 = m1.groups()

    if not something_1 is None:
        something_1 = something_1.strip()
        print(repr(something_1))
    if not something_2 is None:
        something_2 = something_2.strip()
        print(repr(something_2))

In the first two examples that did not work you have put John at the start/end of the string. This means that one of the two something variables could be None. I have fixed you code to check for that.

Now to the regex:

This was the original: r"s*((?:ws*)+)s*?" + name + r"s*((?:ws*)+)s*??"

I made the following changes:

  • Removed ?? from the end. A questionmark is a quantifier and means "once or none" but you already have * for spaces which means "zero or mire times" so you have two quantifiers in a row, which is not needed
  • Changed the inner statements from () to []. round brackets are for groups, for example to get a certain part of the string, square brackets are for character groups to check "is any of these characters here?". You are currently checking if there are word-characters w, spaces s, colons : or questionmarks ? present. To check for more you would have to add characters inside the square brackets, but beware: . + * ? [ ^ ] $ ( ) { } = ! < > | : - # need to be escaped with a preceeding backslash
  • Made the character groups optional with ?. When "John" is the first part of the string, you dont have something to match in front of it. Therefore your regex fails. By making the before- and after-part optional, you can also match those strings

If you have any remaining questions feel free to ask in the comments.

Answered By: Leander Hass

You can use

import re

name = "John"

input_sense_auxs = [
    "These sound system are too many, I think John can help us, otherwise it will be waiting for a while longer",
    "These sound system are too many but I know that John can help us, otherwise it will be waiting for a while longer",
    "These sound system are too many but I know that John can help us. otherwise it will be waiting for a while longer",
    "Do you know if John with the others could come this afternoon?",

    "John can help us, otherwise it will be waiting for a while longer",
    "Can you help us, otherwise it will be waiting for a while longer for John",
    "sorry! can you help us? otherwise it will be waiting for a while longer for John"]

regex_patron_m1 = fr'(?:^|[?!¿¡,;:([.])s*(?:(w+(?:s+w+)*)s*)?{name}(?:s*(w+(?:s+w+)*))?s*(?:$|[]?!,;:).])'
# r"s*((?:ws*)+)s*?" + name + r"s*((?:ws*)+)s*??"
for input_sense_aux in input_sense_auxs:
    print(f'--- {input_sense_aux} ---')
    m1 = re.search(regex_patron_m1, input_sense_aux, re.IGNORECASE) #Con esto valido la regex haber si entra o no en el bloque de code
    if m1:
        something_1, something_2 = m1.groups()

        something_1 = something_1.strip() if something_1 else ""
        something_2 = something_2.strip() if something_2 else ""

        print(repr(something_1))
        print(repr(something_2))

Output:

--- These sound system are too many, I think John can help us, otherwise it will be waiting for a while longer ---
'I think'
'can help us'
--- These sound system are too many but I know that John can help us, otherwise it will be waiting for a while longer ---
'These sound system are too many but I know that'
'can help us'
--- These sound system are too many but I know that John can help us. otherwise it will be waiting for a while longer ---
'These sound system are too many but I know that'
'can help us'
--- Do you know if John with the others could come this afternoon? ---
'Do you know if'
'with the others could come this afternoon'
--- John can help us, otherwise it will be waiting for a while longer ---
''
'can help us'
--- Can you help us, otherwise it will be waiting for a while longer for John ---
'otherwise it will be waiting for a while longer for'
''
--- sorry! can you help us? otherwise it will be waiting for a while longer for John ---
'otherwise it will be waiting for a while longer for'
''

See the Python demo.

Details:

  • (?:^|[?!¿¡,;:([.])s*(?:(w+(?:s+w+)*)s*)? – the prefix, the left-hand side part, that matches
    • (?:^|[?!¿¡,;:([.]) – either start of string or a char from the ?!¿¡,;:([. set
    • s* – zero or more whitespaces
    • (?:(w+(?:s+w+)*)s*)? – an optional occurrence of
      • (w+(?:s+w+)*) – Group 1: one or more word chars and then zero or more sequences of one or more whitespaces and one or more word chars
      • s* – zero or more whitespaces
  • John – the name
  • (?:s*(w+(?:s+w+)*))?s*(?:$|[]?!,;:).]) – the right-hand part:
    • s* – zero or more whitespaces
    • (w+(?:s+w+)*))? – Group 2: an optional sequence of one or more word chars and then zero or more occurrences of one or more whitespaces followed with one or more word chars
    • s* – zero or more whitespaces
    • (?:$|[]?!,;:).]) – end of string or a char from the ]?!,;:). charset.

See the regex demo.

Answered By: Wiktor Stribiżew