Match words using this regex pattern, only if these words do not appear within a list of substrings
Question:
import re
input_text = "a áshgdhSdah saasas a corrEr, assasass a saltó sasass a sdssaa" #example
list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "corré", "corrió", "corría", "correr", "saltó", "salta", "salto", "circularías", "circularía", "circulando", "circula", "consiste", "consistían", "consistía", "consistió", "ladró", "ladrando", "ladra", "visualizar", "ver", "vieron", "vió"]
noun_pattern = r"((?:w+))" # pattern that doesnt tolerate whitespace in middle
imput_text = re.sub(r"(?:^|s+)as+" + noun_pattern,
"(g<0>)",
input_text, re.IGNORECASE)
print(repr(input_text)) # --> output
I need the regex to identify and replace a substring containing no whitespaces in between "((?:w+))"
when it is at the beginning of the line or preceded by "a", "(?:^|s+)as+"
, only if "((?:w+))"
does not match any of the strings that are inside the list list_verbs_in_this_input
or a dot .
, using a regex pattern similar to this re.compile(r"(?:" + rf"({'|'.join(list_verbs_in_this_input)})" + r"|[.;n]|$)", flags = re.IGNORECASE)
And the correct output should look like this:
'(áshgdhSdah) saasas a corrEr, assasass a saltó sasass (sdssaa)'
Note that the substrings "a corrEr"
and "a saltó"
were not modified, since they contained substring(words) that are in the list_verbs_in_this_input
list
Answers:
To exclude some words, you can use a negative look ahead assertion when at the start of the word you’re about to match.
A few things to correct:
re.sub
takes the flags as 5th argument, not 4th
"("
is not an escape sequence, so you should just do "(g<0>)"
without "escaping" the parentheses — they have no special meaning in that string.
r"(?:^|s+)as+"
will always require the a
to be there. From your description I understood that the a
could be optional when the word is at the start of a line, so r"(?:bas|^)s*"
- In the regex that should match the forbidden words, make sure to require that the word ends right after the match, so add
b
in the pattern.
Here is what you could do:
import re
input_text = "a áshgdhSdah saasas a corrEr, assasass a saltó sasass a sdssaa" #example
list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "corré", "corrió", "corría", "correr", "saltó", "salta", "salto", "circularías", "circularía", "circulando", "circula", "consiste", "consistían", "consistía", "consistió", "ladró", "ladrando", "ladra", "visualizar", "ver", "vieron", "vió"]
noun_pattern = r"w+"
exclude = rf"(?!b(?:{'|'.join(list_verbs_in_this_input)})b)"
article = r"(?:bas|^)s*"
regex = article + exclude + noun_pattern
input_text = re.sub(regex, "(g<0>)", input_text, flags=re.I|re.U)
print(repr(input_text))
import re
input_text = "a áshgdhSdah saasas a corrEr, assasass a saltó sasass a sdssaa" #example
list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "corré", "corrió", "corría", "correr", "saltó", "salta", "salto", "circularías", "circularía", "circulando", "circula", "consiste", "consistían", "consistía", "consistió", "ladró", "ladrando", "ladra", "visualizar", "ver", "vieron", "vió"]
noun_pattern = r"((?:w+))" # pattern that doesnt tolerate whitespace in middle
imput_text = re.sub(r"(?:^|s+)as+" + noun_pattern,
"(g<0>)",
input_text, re.IGNORECASE)
print(repr(input_text)) # --> output
I need the regex to identify and replace a substring containing no whitespaces in between "((?:w+))"
when it is at the beginning of the line or preceded by "a", "(?:^|s+)as+"
, only if "((?:w+))"
does not match any of the strings that are inside the list list_verbs_in_this_input
or a dot .
, using a regex pattern similar to this re.compile(r"(?:" + rf"({'|'.join(list_verbs_in_this_input)})" + r"|[.;n]|$)", flags = re.IGNORECASE)
And the correct output should look like this:
'(áshgdhSdah) saasas a corrEr, assasass a saltó sasass (sdssaa)'
Note that the substrings "a corrEr"
and "a saltó"
were not modified, since they contained substring(words) that are in the list_verbs_in_this_input
list
To exclude some words, you can use a negative look ahead assertion when at the start of the word you’re about to match.
A few things to correct:
re.sub
takes the flags as 5th argument, not 4th"("
is not an escape sequence, so you should just do"(g<0>)"
without "escaping" the parentheses — they have no special meaning in that string.r"(?:^|s+)as+"
will always require thea
to be there. From your description I understood that thea
could be optional when the word is at the start of a line, sor"(?:bas|^)s*"
- In the regex that should match the forbidden words, make sure to require that the word ends right after the match, so add
b
in the pattern.
Here is what you could do:
import re
input_text = "a áshgdhSdah saasas a corrEr, assasass a saltó sasass a sdssaa" #example
list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "corré", "corrió", "corría", "correr", "saltó", "salta", "salto", "circularías", "circularía", "circulando", "circula", "consiste", "consistían", "consistía", "consistió", "ladró", "ladrando", "ladra", "visualizar", "ver", "vieron", "vió"]
noun_pattern = r"w+"
exclude = rf"(?!b(?:{'|'.join(list_verbs_in_this_input)})b)"
article = r"(?:bas|^)s*"
regex = article + exclude + noun_pattern
input_text = re.sub(regex, "(g<0>)", input_text, flags=re.I|re.U)
print(repr(input_text))