How to split and reorder the content inside the ((PERS)) tag by ' y ' or ' y)' using Python regular expressions?

Question:

import re

input_text = "((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #example 1
input_text = "ashsahghgsa ((PERS) María y Rosa ds) son alumnas de esa escuela y juegan juntas" #example 2

input_text = re.sub(
                    r"((PERS)" + r"((?:ws*)+(?:sys(?:ws*)+)+)(?=s*ys*(?:)|())",
                    #lambda m: (f"((PERS)){m[1]}) y"),
                    lambda m: (f"((PERS)){m[1].replace(' y', ') y ((PERS)')}"),
                    input_text, re.IGNORECASE)

print(input_text) # --> output

I need to separate the content inside a ((PERS) ) tag if there is a " y " or a " y)" in between.
So get the " y" or the " y " out of the ((PERS) ) tag and the rest of the content (in case it finds as is the case in example 2) left in another ((PERS) ) tag. I try with s+ys+? and with s+ys+

To achieve the desired output, I tried with a regex to match all the names inside the ((PERS) ) tag that are separated by " y " or " y)". For that I tried to use a positive lookahead to check for " y " or " y)" after each name, and then group all the names together. But this lookahead dont works well.

So get this output for each of the examples respectively

"((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #for example 1

"ashsahghgsa ((PERS) María) y ((PERS)Rosa ds) son alumnas de esa escuela y juegan juntas" #for example 2

This regex is for content that does or does have to start with a capital letter r"([A-Z][wí]+s*)" although I think that in this case it would be better to simply use r"((?:ws*)+)" since the content is already encapsulated.

Asked By: Matt095

||

Answers:

You could just use 2 regexes which simplifies it a lot. First:

input_text = re.sub(
  r"((PERS)s+([ws]+)s+y)s+((PERS)s+([ws]+))",
  lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
  input_text,
  re.IGNORECASE)

This one covers your 1st use case and matches:

  • ((PERS)
  • followed by some whitespace s+
  • some mixed word characters and whitespaces that get captured ([ws]+), as I understand without any other characters like -
  • some more whitespaces until y)
  • then again the same except without y: ((PERS)s+([ws]+))
    Then we format both matched groups into ((PERS) {m[1]}) y ((PERS) {m[2]}) format.

The 2nd part of solution is very similar, except it just matches the 2nd group inside the 1st parentheses:

input_text = re.sub(
  r"((PERS)s+([ws]+)s+ys+([ws]+))",
  lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
  input_text,
  re.IGNORECASE)

You could ofc do it with a much more convoluted regex and replacement lambda, but I see no point. This regex would work, for instance:
((PERS)s+([ws]+)s+(y|ys+([ws]+)))(s+((PERS)s+([ws]+)))? but then you’d need to cover for cases when there’s group 1 and group 5 or otherwise use logic for group 1 and 3.

Answered By: Destroy666

In my opinion, two separated regex would be simplier and clearer. Tests: simple, then expanded (with partial).
Example 1 seems to be a bug, while example 2 needs to be splitted:

input_text = ''
input_text += "((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #example 1
input_text += "n"
input_text += "ashsahghgsa ((PERS) María y Rosa ds) son alumnas de esa escuela y juegan juntas" #example 2

input_text += "nn" 
    + "((PERS) Marcos Sy y) ((PERS) Lucy) ((PERS) Marcos Sy y Ana) estuvieron ((VERB) jugando) sddsn
ashsahghgsa ((PERS) María y Isabel y Ana y Rosa ds) son alumnas de esa escuela y juegan juntas"
    # example 1+2 expanded


import re

# first: for example 2

# # for example 2 expanded
input_text = re.sub(
    r"((PERS)    (?P<multiple>    (?:  (?: s [A-Zí][wí]+(?:s[a-zí]+)? )* sy  )+    (?:s[^)]+)    )    )",
    lambda m: (f"((PERS){m['multiple'].replace(' y', ') y ((PERS)')})"),
    input_text, flags = re.IGNORECASE | re.VERBOSE # re.VERBOSE == re.X # extended (ignore white space)
)

# # for example 2 (simple)
# input_text = re.sub( 
#     r'(((PERS)(?:s(?!y)(?:[wí]+))*)sy(s[A-Zí][wí]+(?:s[a-zí]+)?))', 
#     r'g<1>) y ((PERS)g<2>', 
#     input_text, 
#     flags = re.MULTILINE)

# second: for example 1

input_text = re.sub( 
    r'(((PERS)(?:s(?:[wí]+))*)sy)', 
    r'1) y', 
    input_text, 
    flags = re.MULTILINE)

print(input_text)

result (original examples 1+2):

"((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds"
"ashsahghgsa ((PERS) María) y ((PERS) Rosa ds) son alumnas de esa escuela y juegan juntas"

result (expanded example 1+2):

"((PERS) Marcos Sy) y ((PERS) Lucy) ((PERS) Marcos Sy) y ((PERS) Ana) estuvieron ((VERB) jugando) sdds"
"ashsahghgsa ((PERS) María) y ((PERS) Isabel) y ((PERS) Ana) y ((PERS) Rosa ds) son alumnas de esa escuela y juegan juntas"

(Are you sure expecting ((PERS)Rosa ds) – without space? And it’s no clear you need "ds" after "Rosa"? I don’t speak Spanish, maybe that? 😉 but dealt with it 🙂 )

Answered By: msegit

If there can not be any other occurrence of a parenthesis, you might use a pattern with 2 capture groups, and then use split on the second group to get the separate parts between y so that there can also be multiple names.

Pattern to get the ((PERS)...) parts with y

(((PERS)s*)((?:(?![()]|syb).)* yb[^()]*?)s*)

Regex demo

After these replacements, you can put y between all the remaining consecutive ((PERS)...) parts with another pattern:

(((PERS)[^()]*))s*(?=((PERS)[^()]*))

Regex demo

import re

pattern = r"(((PERS)s*)((?:(?![()]|syb).)* yb[^()]*?)s*)"
s = ("((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sddsn"
            "ashsahghgsa ((PERS) María y Rosa ds) son alumnas de esa escuela y juegan juntasn"
            "ashsahghgsa ((PERS) María y Rosa ds y Test Person 1 y test person 2) son alumnas de esa escuela y juegan juntas")


def custom_replacement(m):
    return m.group(1) + " y ((PERS) ".join([p + ")" for p in re.split(r"s+ybs*", m.group(2)) if p])


replaced_names = re.sub(pattern, custom_replacement, s)
replaced_pers = re.sub(r"(((PERS)[^()]*))s*(?=((PERS)[^()]*))", r"1 y ", replaced_names)
print(replaced_pers)

Output

((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds
ashsahghgsa ((PERS) María) y ((PERS) Rosa ds) son alumnas de esa escuela y juegan juntas
ashsahghgsa ((PERS) María) y ((PERS) Rosa ds) y ((PERS) Test Person 1) y ((PERS) test person 2) son alumnas de esa escuela y juegan juntas

See a Python demo.

Answered By: The fourth bird
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.