Match words using this regex pattern, only if these words do not appear within a list of substrings

Question:

import re

input_text = "a áshgdhSdah saasas a corrEr, assasass a saltó sasass a sdssaa" #example

list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "corré", "corrió", "corría", "correr", "saltó", "salta", "salto", "circularías", "circularía", "circulando", "circula", "consiste", "consistían", "consistía", "consistió", "ladró", "ladrando", "ladra", "visualizar", "ver", "vieron", "vió"]

noun_pattern = r"((?:w+))" # pattern that doesnt tolerate whitespace in middle

imput_text = re.sub(r"(?:^|s+)as+" + noun_pattern, 
                    "(g<0>)", 
                    input_text, re.IGNORECASE)

print(repr(input_text)) # --> output

I need the regex to identify and replace a substring containing no whitespaces in between "((?:w+))" when it is at the beginning of the line or preceded by "a", "(?:^|s+)as+", only if "((?:w+))" does not match any of the strings that are inside the list list_verbs_in_this_input or a dot . , using a regex pattern similar to this re.compile(r"(?:" + rf"({'|'.join(list_verbs_in_this_input)})" + r"|[.;n]|$)", flags = re.IGNORECASE)

And the correct output should look like this:

'(áshgdhSdah) saasas a corrEr, assasass a saltó sasass (sdssaa)'

Note that the substrings "a corrEr" and "a saltó" were not modified, since they contained substring(words) that are in the list_verbs_in_this_input list

Asked By: Matt095

||

Answers:

To exclude some words, you can use a negative look ahead assertion when at the start of the word you’re about to match.

A few things to correct:

  • re.sub takes the flags as 5th argument, not 4th
  • "(" is not an escape sequence, so you should just do "(g<0>)" without "escaping" the parentheses — they have no special meaning in that string.
  • r"(?:^|s+)as+" will always require the a to be there. From your description I understood that the a could be optional when the word is at the start of a line, so r"(?:bas|^)s*"
  • In the regex that should match the forbidden words, make sure to require that the word ends right after the match, so add b in the pattern.

Here is what you could do:

import re

input_text = "a áshgdhSdah saasas a corrEr, assasass a saltó sasass a sdssaa" #example

list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "corré", "corrió", "corría", "correr", "saltó", "salta", "salto", "circularías", "circularía", "circulando", "circula", "consiste", "consistían", "consistía", "consistió", "ladró", "ladrando", "ladra", "visualizar", "ver", "vieron", "vió"]

noun_pattern = r"w+"
exclude = rf"(?!b(?:{'|'.join(list_verbs_in_this_input)})b)"
article = r"(?:bas|^)s*"
regex = article + exclude + noun_pattern

input_text = re.sub(regex, "(g<0>)", input_text, flags=re.I|re.U)

print(repr(input_text))
Answered By: trincot