How to find nested patterns within a string, and merge them into one using a regex reordering of the string?

Question:

import re

#example input string:
input_text = "here ((PERS)the ((PERS)Andys) ) ((PERS)assása asas) ((VERB)asas (asas)) ((PERS)saasas ((PERS)Asasas ((PERS)bbbg gg)))"

def remove_nested_pers(match):
    # match is a re.Match object representing the nested pattern, and I want to remove it
    nested_text = match.group(1)
    
    # recursively remove nested patterns
    nested_text = re.sub(r"((PERS)(.*?))", lambda m: m.group(1), nested_text)
    #nested_text = re.sub(r"((PERS)((?:ws*)+))", lambda m: m.group(1), nested_text)
    
    # replace nested pattern with cleaned text
    return nested_text


# recursively remove nested PERS patterns
input_text = re.sub(r"((PERS)(.*?))", remove_nested_pers, input_text)

print(input_text) # --> output

I need to remove the ((PERS) something_1) that are inside another ((PERS) something_2) , for example ((PERS)something_1 ((PERS)something_2)) should become ((PERS)something_1 something_2)

Or for example, ((PERS)something_1 ((PERS)something_2 ((PERS)something_3)) ((PERS)something_4))should become ((PERS)something_1 something_2 something_3 something_4)

In this way, encapsulations within other encapsulations would be avoided.

I’ve used the (.*?) capturing group which looks for anything (including new line characters) between the previous pattern and the next one. Although perhaps a pattern like ((?:ws*)+) is better to avoid capturing elements of the sequence ((PERS) ). Although regardless of this, this code fails to correctly join the content of the nested patterns, eliminating necessary parts.

This is the output you should be getting when running this script:

"here ((PERS)the Andys ) ((PERS)assása asas) ((VERB)asas (asas)) ((PERS)saasas Asasas bbbg gg)"

So the nested patterns ((PERS) ) should have been removed from the input text and the remaining patterns are not modified.

Asked By: Matt095

||

Answers:

In absence of recursion support in the native re module, you could do this iteratively, from the inside-out. As it is not expected that the nesting is going to be very deep (like 100 levels deep), this is a pragmatic solution:

import re

input_text = "here ((PERS)the ((PERS)Andys) ) ((PERS)assása asas) ((VERB)asas (asas)) ((PERS)saasas ((PERS)Asasas ((PERS)bbbg gg)))"

size = len(input_text) + 1
while size > len(input_text):
    size = len(input_text)
    input_text = re.sub(r"(((PERS)(?:(?!(()[^)])*)((PERS)((?:(?!(()[^)])*))", r"12", input_text)

print(input_text)

Output:

here ((PERS)the Andys ) ((PERS)assása asas) ((VERB)asas (asas)) ((PERS)saasas Asasas bbbg gg)
Answered By: trincot