How to remove consecutively repeated strings only if this strings are in the middle of "((VERB)" and ")"?
Question:
import re
input_text = "((VERB) saltar a nosotros a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar a nosotros a nosotros)"
input_text = re.sub(r"((VERB)" + r"((?:ws*)+)" + r")",
lambda x: re.sub(r"(a nosotros)s*1+", r"1", x.group()),
input_text)
print(input_text) # --> output
In this code I was trying to remove consecutively repeated "a nosotros"
strings only if this strings are in the middle of "((VERB)"
and )"
, that is, that string that captures the capturing group r"((VERB)" + r"((?:ws*)+)" + r")"
This is the output you should be getting when running this script:
"((VERB) saltar a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar a nosotros)"
Although the code that I have placed in the question does edit the input string, what should i modify?
Answers:
You can use
input_text = re.sub(r"((VERB)[ws]*)", lambda x: re.sub(r"ba nosotros(?:s+a nosotros)*b", "a nosotros", x.group()), input_text)
The main pattern is ((VERB)[ws]*)
, it matches ((VERB)
+ zero or more word or whitespace chars and then a )
char.
The re.sub(r"ba nosotros(?:s+a nosotros)*b", "a nosotros", x.group())
part removes all consecutive whole words a nosotros
inside the match.
Python’s optional regex engine module (developed by Matthew Barnett) supports the K
directive, which resets the starting point of the reported match to the current string pointer locations and discards any previously consumed characters from the final match. By employing that directive one can simply replace matches in the string with empty strings.
The code for doing that is as follows.
import regex
rgx = r"((VERB)(?:(?!ba nosotrosb|)).)*Kba nosotrosb(?=[^)]*ba nosotrosb)"
txt_in = "((VERB) saltar a nosotros a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar a nosotros a nosotros)"
txt_out = regex.sub(rgx, '', txt_in)
print(txt_out)
-> ((VERB) saltar a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar a nosotros)
The regular expression can be broken down as follows.
((VERB) # match literal
(?: # begin non-capture group
(?! # begin negative lookahead
ba nosotrosb # match literal surrounded by word boundaries
| # or
) # match literal
) # end of negative lookahead
. # match any character other than a line terminator
)* # end non-capture group and execute zero or more times
K # see the first paragraph of this answer
ba nosotrosb # match literal surrounded by word boundaries
(?= # begin positive lookahead
[^)]* # match any characters other than ')' zero or more times
ba nosotrosb # match literal surrounded by word boundaries
) # end positive lookahead
Python demo<-(ツ)/->Regex demo
The technique of matching one character at a time with a negative lookahead (here (?:(?!ba nosotrosb|)).)
) is called the tempered greedy token solution.
import re
input_text = "((VERB) saltar a nosotros a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar a nosotros a nosotros)"
input_text = re.sub(r"((VERB)" + r"((?:ws*)+)" + r")",
lambda x: re.sub(r"(a nosotros)s*1+", r"1", x.group()),
input_text)
print(input_text) # --> output
In this code I was trying to remove consecutively repeated "a nosotros"
strings only if this strings are in the middle of "((VERB)"
and )"
, that is, that string that captures the capturing group r"((VERB)" + r"((?:ws*)+)" + r")"
This is the output you should be getting when running this script:
"((VERB) saltar a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar a nosotros)"
Although the code that I have placed in the question does edit the input string, what should i modify?
You can use
input_text = re.sub(r"((VERB)[ws]*)", lambda x: re.sub(r"ba nosotros(?:s+a nosotros)*b", "a nosotros", x.group()), input_text)
The main pattern is ((VERB)[ws]*)
, it matches ((VERB)
+ zero or more word or whitespace chars and then a )
char.
The re.sub(r"ba nosotros(?:s+a nosotros)*b", "a nosotros", x.group())
part removes all consecutive whole words a nosotros
inside the match.
Python’s optional regex engine module (developed by Matthew Barnett) supports the K
directive, which resets the starting point of the reported match to the current string pointer locations and discards any previously consumed characters from the final match. By employing that directive one can simply replace matches in the string with empty strings.
The code for doing that is as follows.
import regex
rgx = r"((VERB)(?:(?!ba nosotrosb|)).)*Kba nosotrosb(?=[^)]*ba nosotrosb)"
txt_in = "((VERB) saltar a nosotros a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar a nosotros a nosotros)"
txt_out = regex.sub(rgx, '', txt_in)
print(txt_out)
-> ((VERB) saltar a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar a nosotros)
The regular expression can be broken down as follows.
((VERB) # match literal
(?: # begin non-capture group
(?! # begin negative lookahead
ba nosotrosb # match literal surrounded by word boundaries
| # or
) # match literal
) # end of negative lookahead
. # match any character other than a line terminator
)* # end non-capture group and execute zero or more times
K # see the first paragraph of this answer
ba nosotrosb # match literal surrounded by word boundaries
(?= # begin positive lookahead
[^)]* # match any characters other than ')' zero or more times
ba nosotrosb # match literal surrounded by word boundaries
) # end positive lookahead
Python demo<-(ツ)/->Regex demo
The technique of matching one character at a time with a negative lookahead (here (?:(?!ba nosotrosb|)).)
) is called the tempered greedy token solution.