Set regex pattern that concatenates one capture group or another depending on whether or not the input string starts with certain symbols
Question:
import re
word = ""
input_text = "Creo que July no se trata de un nombre" #example 1, should match with the Case 00
#input_text = "Creo que July Moore no se trata de un nombre" #example 2, should not match any case
#input_text = "Efectivamente esa es una lista de nombres. July Moore no se trata de un nombre" #example 3, should match with the Case 01
#input_text = "July Moore no se trata de un nombre" #example 4, should match with the Case 01
name_capture_pattern_00 = r"((?:w+))?" # does not tolerate whitespace in middle
#name_capture_pattern_01 = r"((?:ws*)+)"
name_capture_pattern_01 = r"(^[A-Z](?:ws*)+)" # tolerates that there are spaces but forces it to be a word that begins with a capital letter
#Case 00
regex_pattern_00 = name_capture_pattern_00 + r"s*(?i:no)s*(?i:ses*tratar[íi]as*des*uns*nombre|ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"
#Case 01
regex_pattern_01 = r"(?:^|[.;,]s*)" + name_capture_pattern_01 + r"s*(?i:no)s*(?i:ses*tratar[íi]as*des*uns*nombre|ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"
#Taking the regex pattern(case 00 or case 01), it will search the string and then try to extract the substring of interest using capturing groups.
n0 = re.search(regex_pattern_00, input_text)
if n0 and word == "":
word, = n0.groups()
word = word.strip()
print(repr(word)) # --> print the substring that I captured with the capturing group
n1 = re.search(regex_pattern_01, input_text)
if n1 and word == "":
word, = n1.groups()
word = word.strip()
print(repr(word)) # --> print the substring that I captured with the capturing group
If in front of the pattern there is a .s*
, a ,s*
, a ;s*
, or if it is simply the beginning of the input string, then use this capture pattern name_capture_pattern_01 = r"((?:ws*)+)?"
, but if that is not the case, use this other capture pattern name_capture_pattern_00 = r"((?:w+))?"
I think that in case 00 you should add something like this at the beginning of the pattern (?:(?<=s)|^)
That way you would get these 2 possible resulting patterns after concatenate, where perhaps an or
condition |
can be set inside the search pattern:
In Case 00
…
(?:.|;|,)
or the start of the string
+
((?:ws*)+)?
+
r"s*(?i:no)s*(?i:ses*tratar[íi]as*des*uns*nombre|ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"
In other case (Case 01
)…
((?:w+))??
+
r"s*(?i:no)s*(?i:ses*tratar[íi]as*des*uns*nombre|ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"
But in both cases (Case 00
or Case 01
, depending on what the program identifies) it should match the pattern and extract the capturing group to store it in the variable called as word
.
And the correct output for each of these cases would be the capture group that should be obtained and printed in each of these examples:
'July' #for the example 1
'' #for the example 2
'July Moore' #for the example 3
'July Moore' #for the example 4
EDIT CODE:
This code, although it appears that the regex patterns are well established, fails by returning as output only the last part of the name, in this case "Moore"
, and not the full name "July Moore"
import re
#Here are 2 examples where you can see this "capture error"
input_text = "HghD djkf ; July Moore no se trata de un nombre"
input_text = "July Moore no se trata de un nombre"
word = ""
#name_capture_pattern_01 = r"((?:ws*)+)"
name_capture_pattern_01 = r"([A-Z][a-z]+(?:s*[A-Z][a-z]+)*)"
#Case 01
regex_pattern_01 = r"(?:^|[.;,]s*)" + name_capture_pattern_01 + r"s*(?i:no)s*(?i:ses*tratar[íi]as*des*uns*nombre|ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"
n1 = re.search(regex_pattern_01, input_text)
if n1 and word == "":
word, = n1.groups()
word = word.strip()
print(repr(word))
In both examples, since it complies with starting with (?:^|[.;,]s*)
and starting with a capital letter like this pattern ([A-Z][a-z]+(?:s*[A-Z][a-z]+)*)
, it should print the full name in the console July Moore
. It’s quite curious but placing this pattern makes it impossible for me to capture a complete name under these conditions established by the search pattern.
Answers:
If I understood correctly, you want to exclude cases where both of the following are true:
- The name consists of more than one word; AND
- The name does not occur at the start of a sentence
You could use just one regex and then inspect the match to decide whether the above condition occurs.
Here is a script I tested with:
import re
texts = [
# Name is NOT at start of sentence, Name has SINGLE word:
"Creo que July no se trata de un nombre",
# Name is NOT at start of sentence, Name has MULTIPLE words:
"Creo que July Moore no se trata de un nombre",
# Name is at START of sentence, Name has MULTIPLE words:
"Efectivamente esa es una lista de nombres. July Moore no se trata de un nombre",
"July Moore Donald no se trata de un nombre",
# Name is at START of sentence, Name has SINGLE word:
"July no se trata de un nombre",
]
for input_text in texts:
regex = r"(^|[.;,]s*)?([A-Z][a-z]+(s*[A-Z][a-z]+)*)s*(?i:no)s*(?i:ses*tratar[íi]as*de|ses*tratas*de|(?:ser[íi]a|es))s*uns*nombre"
print("input:", input_text)
for match in re.finditer(regex, input_text):
word = ""
# match[1] is not None => match is at start of a sentence.
# match[3] is not None => match has name with more than one word.
if match[1] is not None or not match[3]:
word = match[2]
print(" match:", repr(word) if word else "(no match)")
Notes:
- I used
finditer
as in theory there might be more than one match in an input string
- The use of
s*
instead of s+
is odd, but in comments you indicated that this is intended as you want to capture cases where some space separation is left out.
- Names can look more complex than just
[A-Z][a-z]+
. Some names include hyphens, apostrophes or other characters, not to mention letters from other alphabets. The letter following a hyphen might be upper or lower case… etc.
import re
word = ""
input_text = "Creo que July no se trata de un nombre" #example 1, should match with the Case 00
#input_text = "Creo que July Moore no se trata de un nombre" #example 2, should not match any case
#input_text = "Efectivamente esa es una lista de nombres. July Moore no se trata de un nombre" #example 3, should match with the Case 01
#input_text = "July Moore no se trata de un nombre" #example 4, should match with the Case 01
name_capture_pattern_00 = r"((?:w+))?" # does not tolerate whitespace in middle
#name_capture_pattern_01 = r"((?:ws*)+)"
name_capture_pattern_01 = r"(^[A-Z](?:ws*)+)" # tolerates that there are spaces but forces it to be a word that begins with a capital letter
#Case 00
regex_pattern_00 = name_capture_pattern_00 + r"s*(?i:no)s*(?i:ses*tratar[íi]as*des*uns*nombre|ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"
#Case 01
regex_pattern_01 = r"(?:^|[.;,]s*)" + name_capture_pattern_01 + r"s*(?i:no)s*(?i:ses*tratar[íi]as*des*uns*nombre|ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"
#Taking the regex pattern(case 00 or case 01), it will search the string and then try to extract the substring of interest using capturing groups.
n0 = re.search(regex_pattern_00, input_text)
if n0 and word == "":
word, = n0.groups()
word = word.strip()
print(repr(word)) # --> print the substring that I captured with the capturing group
n1 = re.search(regex_pattern_01, input_text)
if n1 and word == "":
word, = n1.groups()
word = word.strip()
print(repr(word)) # --> print the substring that I captured with the capturing group
If in front of the pattern there is a .s*
, a ,s*
, a ;s*
, or if it is simply the beginning of the input string, then use this capture pattern name_capture_pattern_01 = r"((?:ws*)+)?"
, but if that is not the case, use this other capture pattern name_capture_pattern_00 = r"((?:w+))?"
I think that in case 00 you should add something like this at the beginning of the pattern (?:(?<=s)|^)
That way you would get these 2 possible resulting patterns after concatenate, where perhaps an or
condition |
can be set inside the search pattern:
In Case 00
…
(?:.|;|,)
or the start of the string
+
((?:ws*)+)?
+
r"s*(?i:no)s*(?i:ses*tratar[íi]as*des*uns*nombre|ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"
In other case (Case 01
)…
((?:w+))??
+
r"s*(?i:no)s*(?i:ses*tratar[íi]as*des*uns*nombre|ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"
But in both cases (Case 00
or Case 01
, depending on what the program identifies) it should match the pattern and extract the capturing group to store it in the variable called as word
.
And the correct output for each of these cases would be the capture group that should be obtained and printed in each of these examples:
'July' #for the example 1
'' #for the example 2
'July Moore' #for the example 3
'July Moore' #for the example 4
EDIT CODE:
This code, although it appears that the regex patterns are well established, fails by returning as output only the last part of the name, in this case "Moore"
, and not the full name "July Moore"
import re
#Here are 2 examples where you can see this "capture error"
input_text = "HghD djkf ; July Moore no se trata de un nombre"
input_text = "July Moore no se trata de un nombre"
word = ""
#name_capture_pattern_01 = r"((?:ws*)+)"
name_capture_pattern_01 = r"([A-Z][a-z]+(?:s*[A-Z][a-z]+)*)"
#Case 01
regex_pattern_01 = r"(?:^|[.;,]s*)" + name_capture_pattern_01 + r"s*(?i:no)s*(?i:ses*tratar[íi]as*des*uns*nombre|ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"
n1 = re.search(regex_pattern_01, input_text)
if n1 and word == "":
word, = n1.groups()
word = word.strip()
print(repr(word))
In both examples, since it complies with starting with (?:^|[.;,]s*)
and starting with a capital letter like this pattern ([A-Z][a-z]+(?:s*[A-Z][a-z]+)*)
, it should print the full name in the console July Moore
. It’s quite curious but placing this pattern makes it impossible for me to capture a complete name under these conditions established by the search pattern.
If I understood correctly, you want to exclude cases where both of the following are true:
- The name consists of more than one word; AND
- The name does not occur at the start of a sentence
You could use just one regex and then inspect the match to decide whether the above condition occurs.
Here is a script I tested with:
import re
texts = [
# Name is NOT at start of sentence, Name has SINGLE word:
"Creo que July no se trata de un nombre",
# Name is NOT at start of sentence, Name has MULTIPLE words:
"Creo que July Moore no se trata de un nombre",
# Name is at START of sentence, Name has MULTIPLE words:
"Efectivamente esa es una lista de nombres. July Moore no se trata de un nombre",
"July Moore Donald no se trata de un nombre",
# Name is at START of sentence, Name has SINGLE word:
"July no se trata de un nombre",
]
for input_text in texts:
regex = r"(^|[.;,]s*)?([A-Z][a-z]+(s*[A-Z][a-z]+)*)s*(?i:no)s*(?i:ses*tratar[íi]as*de|ses*tratas*de|(?:ser[íi]a|es))s*uns*nombre"
print("input:", input_text)
for match in re.finditer(regex, input_text):
word = ""
# match[1] is not None => match is at start of a sentence.
# match[3] is not None => match has name with more than one word.
if match[1] is not None or not match[3]:
word = match[2]
print(" match:", repr(word) if word else "(no match)")
Notes:
- I used
finditer
as in theory there might be more than one match in an input string - The use of
s*
instead ofs+
is odd, but in comments you indicated that this is intended as you want to capture cases where some space separation is left out. - Names can look more complex than just
[A-Z][a-z]+
. Some names include hyphens, apostrophes or other characters, not to mention letters from other alphabets. The letter following a hyphen might be upper or lower case… etc.