How to remove consecutively repeated strings only if this strings are in the middle of "((VERB)" and ")"?

Question:

import re

input_text = "((VERB) saltar a nosotros a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar a nosotros a nosotros)"

input_text = re.sub(r"((VERB)" + r"((?:ws*)+)" + r")", 
                    lambda x: re.sub(r"(a nosotros)s*1+", r"1", x.group()), 
                    input_text)

print(input_text) # --> output

In this code I was trying to remove consecutively repeated "a nosotros" strings only if this strings are in the middle of "((VERB)" and )", that is, that string that captures the capturing group r"((VERB)" + r"((?:ws*)+)" + r")"

This is the output you should be getting when running this script:

"((VERB) saltar a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar a nosotros)"

Although the code that I have placed in the question does edit the input string, what should i modify?

Asked By: Matt095

||

Answers:

You can use

input_text = re.sub(r"((VERB)[ws]*)",  lambda x: re.sub(r"ba nosotros(?:s+a nosotros)*b", "a nosotros", x.group()), input_text)

The main pattern is ((VERB)[ws]*), it matches ((VERB) + zero or more word or whitespace chars and then a ) char.

The re.sub(r"ba nosotros(?:s+a nosotros)*b", "a nosotros", x.group()) part removes all consecutive whole words a nosotros inside the match.

Answered By: Wiktor Stribiżew

Python’s optional regex engine module (developed by Matthew Barnett) supports the K directive, which resets the starting point of the reported match to the current string pointer locations and discards any previously consumed characters from the final match. By employing that directive one can simply replace matches in the string with empty strings.

The code for doing that is as follows.

import regex

rgx = r"((VERB)(?:(?!ba nosotrosb|)).)*Kba nosotrosb(?=[^)]*ba nosotrosb)"

txt_in = "((VERB) saltar a nosotros a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar a nosotros a nosotros)"

txt_out = regex.sub(rgx, '', txt_in)

print(txt_out)
-> ((VERB) saltar  a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar  a nosotros)

The regular expression can be broken down as follows.

((VERB)          # match literal
(?:                 # begin non-capture group
  (?!               # begin negative lookahead
    ba nosotrosb  # match literal surrounded by word boundaries
    |               # or 
    )              # match literal 
  )                 # end of negative lookahead
  .                 # match any character other than a line terminator
)*                  # end non-capture group and execute zero or more times
K                  # see the first paragraph of this answer
ba nosotrosb      # match literal surrounded by word boundaries
(?=                 # begin positive lookahead
  [^)]*             # match any characters other than ')' zero or more times
  ba nosotrosb    # match literal surrounded by word boundaries
)                   # end positive lookahead

Python demo<-(ツ)/->Regex demo

The technique of matching one character at a time with a negative lookahead (here (?:(?!ba nosotrosb|)).)) is called the tempered greedy token solution.

Answered By: Cary Swoveland