Set regex pattern that concatenates one capture group or another depending on whether or not the input string starts with certain symbols

Question:

import re
word = ""

input_text = "Creo que July no se trata de un nombre" #example 1, should match with the Case 00
#input_text = "Creo que July Moore no se trata de un nombre" #example 2, should not match any case
#input_text = "Efectivamente esa es una lista de nombres. July Moore no se trata de un nombre" #example 3, should match with the Case 01
#input_text = "July Moore no se trata de un nombre" #example 4, should match with the Case 01

name_capture_pattern_00 = r"((?:w+))?"         # does not tolerate whitespace in middle

#name_capture_pattern_01 = r"((?:ws*)+)"
name_capture_pattern_01 = r"(^[A-Z](?:ws*)+)"      # tolerates that there are spaces but forces it to be a word that begins with a capital letter

#Case 00
regex_pattern_00 = name_capture_pattern_00 + r"s*(?i:no)s*(?i:ses*tratar[íi]as*des*uns*nombre|ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"
#Case 01
regex_pattern_01 = r"(?:^|[.;,]s*)" + name_capture_pattern_01 + r"s*(?i:no)s*(?i:ses*tratar[íi]as*des*uns*nombre|ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"

#Taking the regex pattern(case 00 or case 01), it will search the string and then try to extract the substring of interest using capturing groups.

n0 = re.search(regex_pattern_00, input_text)
if n0 and word == "":
    word, = n0.groups()
    word = word.strip()

print(repr(word)) # --> print the substring that I captured with the capturing group

n1 = re.search(regex_pattern_01, input_text)
if n1 and word == "":
    word, = n1.groups()
    word = word.strip()

print(repr(word)) # --> print the substring that I captured with the capturing group

If in front of the pattern there is a .s* , a ,s* , a ;s* , or if it is simply the beginning of the input string, then use this capture pattern name_capture_pattern_01 = r"((?:ws*)+)?", but if that is not the case, use this other capture pattern name_capture_pattern_00 = r"((?:w+))?"

I think that in case 00 you should add something like this at the beginning of the pattern (?:(?<=s)|^)

That way you would get these 2 possible resulting patterns after concatenate, where perhaps an or condition | can be set inside the search pattern:

In Case 00

(?:.|;|,) or the start of the string + ((?:ws*)+)? + r"s*(?i:no)s*(?i:ses*tratar[íi]as*des*uns*nombre|ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"

In other case (Case 01)…

((?:w+))?? + r"s*(?i:no)s*(?i:ses*tratar[íi]as*des*uns*nombre|ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"

But in both cases (Case 00 or Case 01, depending on what the program identifies) it should match the pattern and extract the capturing group to store it in the variable called as word .

And the correct output for each of these cases would be the capture group that should be obtained and printed in each of these examples:

'July'         #for the example 1
''             #for the example 2
'July Moore'   #for the example 3
'July Moore'   #for the example 4

EDIT CODE:

This code, although it appears that the regex patterns are well established, fails by returning as output only the last part of the name, in this case "Moore", and not the full name "July Moore"

import re

#Here are 2 examples where you can see this "capture error"
input_text = "HghD djkf ; July Moore no se trata de un nombre"
input_text = "July Moore no se trata de un nombre"

word = ""

#name_capture_pattern_01 = r"((?:ws*)+)"
name_capture_pattern_01 = r"([A-Z][a-z]+(?:s*[A-Z][a-z]+)*)"

#Case 01
regex_pattern_01 = r"(?:^|[.;,]s*)" + name_capture_pattern_01 + r"s*(?i:no)s*(?i:ses*tratar[íi]as*des*uns*nombre|ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"

n1 = re.search(regex_pattern_01, input_text)
if n1 and word == "":
    word, = n1.groups()
    word = word.strip()

print(repr(word))

In both examples, since it complies with starting with (?:^|[.;,]s*) and starting with a capital letter like this pattern ([A-Z][a-z]+(?:s*[A-Z][a-z]+)*), it should print the full name in the console July Moore. It’s quite curious but placing this pattern makes it impossible for me to capture a complete name under these conditions established by the search pattern.

Answers:

If I understood correctly, you want to exclude cases where both of the following are true:

  • The name consists of more than one word; AND
  • The name does not occur at the start of a sentence

You could use just one regex and then inspect the match to decide whether the above condition occurs.

Here is a script I tested with:

import re

texts = [
    # Name is NOT at start of sentence, Name has SINGLE word: 
    "Creo que July no se trata de un nombre", 
    # Name is NOT at start of sentence, Name has MULTIPLE words: 
    "Creo que July Moore no se trata de un nombre", 
    # Name is at START of sentence, Name has MULTIPLE words: 
    "Efectivamente esa es una lista de nombres. July Moore no se trata de un nombre", 
    "July Moore Donald no se trata de un nombre",
    # Name is at START of sentence, Name has SINGLE word: 
    "July no se trata de un nombre",
]

for input_text in texts:
    regex = r"(^|[.;,]s*)?([A-Z][a-z]+(s*[A-Z][a-z]+)*)s*(?i:no)s*(?i:ses*tratar[íi]as*de|ses*tratas*de|(?:ser[íi]a|es))s*uns*nombre"
    
    print("input:", input_text)
    for match in re.finditer(regex, input_text):
        word = ""
        # match[1] is not None => match is at start of a sentence.
        # match[3] is not None => match has name with more than one word.
        if match[1] is not None or not match[3]:
            word = match[2]
        print("    match:", repr(word) if word else "(no match)")

Notes:

  • I used finditer as in theory there might be more than one match in an input string
  • The use of s* instead of s+ is odd, but in comments you indicated that this is intended as you want to capture cases where some space separation is left out.
  • Names can look more complex than just [A-Z][a-z]+. Some names include hyphens, apostrophes or other characters, not to mention letters from other alphabets. The letter following a hyphen might be upper or lower case… etc.
Answered By: trincot