How to split and reorder the content inside the ((PERS)) tag by ' y ' or ' y)' using Python regular expressions?
Question:
import re
input_text = "((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #example 1
input_text = "ashsahghgsa ((PERS) María y Rosa ds) son alumnas de esa escuela y juegan juntas" #example 2
input_text = re.sub(
r"((PERS)" + r"((?:ws*)+(?:sys(?:ws*)+)+)(?=s*ys*(?:)|())",
#lambda m: (f"((PERS)){m[1]}) y"),
lambda m: (f"((PERS)){m[1].replace(' y', ') y ((PERS)')}"),
input_text, re.IGNORECASE)
print(input_text) # --> output
I need to separate the content inside a ((PERS) )
tag if there is a " y "
or a " y)"
in between.
So get the " y"
or the " y "
out of the ((PERS) )
tag and the rest of the content (in case it finds as is the case in example 2
) left in another ((PERS) )
tag. I try with s+ys+?
and with s+ys+
To achieve the desired output, I tried with a regex to match all the names inside the ((PERS) )
tag that are separated by " y "
or " y)"
. For that I tried to use a positive lookahead to check for " y "
or " y)"
after each name, and then group all the names together. But this lookahead dont works well.
So get this output for each of the examples respectively
"((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #for example 1
"ashsahghgsa ((PERS) María) y ((PERS)Rosa ds) son alumnas de esa escuela y juegan juntas" #for example 2
This regex is for content that does or does have to start with a capital letter r"([A-Z][wí]+s*)"
although I think that in this case it would be better to simply use r"((?:ws*)+)"
since the content is already encapsulated.
Answers:
You could just use 2 regexes which simplifies it a lot. First:
input_text = re.sub(
r"((PERS)s+([ws]+)s+y)s+((PERS)s+([ws]+))",
lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
input_text,
re.IGNORECASE)
This one covers your 1st use case and matches:
((PERS)
- followed by some whitespace
s+
- some mixed word characters and whitespaces that get captured
([ws]+)
, as I understand without any other characters like -
- some more whitespaces until
y)
- then again the same except without
y
: ((PERS)s+([ws]+))
Then we format both matched groups into ((PERS) {m[1]}) y ((PERS) {m[2]})
format.
The 2nd part of solution is very similar, except it just matches the 2nd group inside the 1st parentheses:
input_text = re.sub(
r"((PERS)s+([ws]+)s+ys+([ws]+))",
lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
input_text,
re.IGNORECASE)
You could ofc do it with a much more convoluted regex and replacement lambda, but I see no point. This regex would work, for instance:
((PERS)s+([ws]+)s+(y|ys+([ws]+)))(s+((PERS)s+([ws]+)))?
but then you’d need to cover for cases when there’s group 1 and group 5 or otherwise use logic for group 1 and 3.
In my opinion, two separated regex would be simplier and clearer. Tests: simple, then expanded (with partial).
Example 1 seems to be a bug, while example 2 needs to be splitted:
input_text = ''
input_text += "((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #example 1
input_text += "n"
input_text += "ashsahghgsa ((PERS) María y Rosa ds) son alumnas de esa escuela y juegan juntas" #example 2
input_text += "nn"
+ "((PERS) Marcos Sy y) ((PERS) Lucy) ((PERS) Marcos Sy y Ana) estuvieron ((VERB) jugando) sddsn
ashsahghgsa ((PERS) María y Isabel y Ana y Rosa ds) son alumnas de esa escuela y juegan juntas"
# example 1+2 expanded
import re
# first: for example 2
# # for example 2 expanded
input_text = re.sub(
r"((PERS) (?P<multiple> (?: (?: s [A-Zí][wí]+(?:s[a-zí]+)? )* sy )+ (?:s[^)]+) ) )",
lambda m: (f"((PERS){m['multiple'].replace(' y', ') y ((PERS)')})"),
input_text, flags = re.IGNORECASE | re.VERBOSE # re.VERBOSE == re.X # extended (ignore white space)
)
# # for example 2 (simple)
# input_text = re.sub(
# r'(((PERS)(?:s(?!y)(?:[wí]+))*)sy(s[A-Zí][wí]+(?:s[a-zí]+)?))',
# r'g<1>) y ((PERS)g<2>',
# input_text,
# flags = re.MULTILINE)
# second: for example 1
input_text = re.sub(
r'(((PERS)(?:s(?:[wí]+))*)sy)',
r'1) y',
input_text,
flags = re.MULTILINE)
print(input_text)
result (original examples 1+2):
"((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds"
"ashsahghgsa ((PERS) María) y ((PERS) Rosa ds) son alumnas de esa escuela y juegan juntas"
result (expanded example 1+2):
"((PERS) Marcos Sy) y ((PERS) Lucy) ((PERS) Marcos Sy) y ((PERS) Ana) estuvieron ((VERB) jugando) sdds"
"ashsahghgsa ((PERS) María) y ((PERS) Isabel) y ((PERS) Ana) y ((PERS) Rosa ds) son alumnas de esa escuela y juegan juntas"
(Are you sure expecting ((PERS)Rosa ds)
– without space? And it’s no clear you need "ds" after "Rosa"? I don’t speak Spanish, maybe that? 😉 but dealt with it 🙂 )
If there can not be any other occurrence of a parenthesis, you might use a pattern with 2 capture groups, and then use split on the second group to get the separate parts between y
so that there can also be multiple names.
Pattern to get the ((PERS)...)
parts with y
(((PERS)s*)((?:(?![()]|syb).)* yb[^()]*?)s*)
After these replacements, you can put y
between all the remaining consecutive ((PERS)...)
parts with another pattern:
(((PERS)[^()]*))s*(?=((PERS)[^()]*))
import re
pattern = r"(((PERS)s*)((?:(?![()]|syb).)* yb[^()]*?)s*)"
s = ("((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sddsn"
"ashsahghgsa ((PERS) María y Rosa ds) son alumnas de esa escuela y juegan juntasn"
"ashsahghgsa ((PERS) María y Rosa ds y Test Person 1 y test person 2) son alumnas de esa escuela y juegan juntas")
def custom_replacement(m):
return m.group(1) + " y ((PERS) ".join([p + ")" for p in re.split(r"s+ybs*", m.group(2)) if p])
replaced_names = re.sub(pattern, custom_replacement, s)
replaced_pers = re.sub(r"(((PERS)[^()]*))s*(?=((PERS)[^()]*))", r"1 y ", replaced_names)
print(replaced_pers)
Output
((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds
ashsahghgsa ((PERS) María) y ((PERS) Rosa ds) son alumnas de esa escuela y juegan juntas
ashsahghgsa ((PERS) María) y ((PERS) Rosa ds) y ((PERS) Test Person 1) y ((PERS) test person 2) son alumnas de esa escuela y juegan juntas
See a Python demo.
import re
input_text = "((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #example 1
input_text = "ashsahghgsa ((PERS) María y Rosa ds) son alumnas de esa escuela y juegan juntas" #example 2
input_text = re.sub(
r"((PERS)" + r"((?:ws*)+(?:sys(?:ws*)+)+)(?=s*ys*(?:)|())",
#lambda m: (f"((PERS)){m[1]}) y"),
lambda m: (f"((PERS)){m[1].replace(' y', ') y ((PERS)')}"),
input_text, re.IGNORECASE)
print(input_text) # --> output
I need to separate the content inside a ((PERS) )
tag if there is a " y "
or a " y)"
in between.
So get the " y"
or the " y "
out of the ((PERS) )
tag and the rest of the content (in case it finds as is the case in example 2
) left in another ((PERS) )
tag. I try with s+ys+?
and with s+ys+
To achieve the desired output, I tried with a regex to match all the names inside the ((PERS) )
tag that are separated by " y "
or " y)"
. For that I tried to use a positive lookahead to check for " y "
or " y)"
after each name, and then group all the names together. But this lookahead dont works well.
So get this output for each of the examples respectively
"((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #for example 1
"ashsahghgsa ((PERS) María) y ((PERS)Rosa ds) son alumnas de esa escuela y juegan juntas" #for example 2
This regex is for content that does or does have to start with a capital letter r"([A-Z][wí]+s*)"
although I think that in this case it would be better to simply use r"((?:ws*)+)"
since the content is already encapsulated.
You could just use 2 regexes which simplifies it a lot. First:
input_text = re.sub(
r"((PERS)s+([ws]+)s+y)s+((PERS)s+([ws]+))",
lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
input_text,
re.IGNORECASE)
This one covers your 1st use case and matches:
((PERS)
- followed by some whitespace
s+
- some mixed word characters and whitespaces that get captured
([ws]+)
, as I understand without any other characters like-
- some more whitespaces until
y)
- then again the same except without
y
:((PERS)s+([ws]+))
Then we format both matched groups into((PERS) {m[1]}) y ((PERS) {m[2]})
format.
The 2nd part of solution is very similar, except it just matches the 2nd group inside the 1st parentheses:
input_text = re.sub(
r"((PERS)s+([ws]+)s+ys+([ws]+))",
lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
input_text,
re.IGNORECASE)
You could ofc do it with a much more convoluted regex and replacement lambda, but I see no point. This regex would work, for instance:
((PERS)s+([ws]+)s+(y|ys+([ws]+)))(s+((PERS)s+([ws]+)))?
but then you’d need to cover for cases when there’s group 1 and group 5 or otherwise use logic for group 1 and 3.
In my opinion, two separated regex would be simplier and clearer. Tests: simple, then expanded (with partial).
Example 1 seems to be a bug, while example 2 needs to be splitted:
input_text = ''
input_text += "((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #example 1
input_text += "n"
input_text += "ashsahghgsa ((PERS) María y Rosa ds) son alumnas de esa escuela y juegan juntas" #example 2
input_text += "nn"
+ "((PERS) Marcos Sy y) ((PERS) Lucy) ((PERS) Marcos Sy y Ana) estuvieron ((VERB) jugando) sddsn
ashsahghgsa ((PERS) María y Isabel y Ana y Rosa ds) son alumnas de esa escuela y juegan juntas"
# example 1+2 expanded
import re
# first: for example 2
# # for example 2 expanded
input_text = re.sub(
r"((PERS) (?P<multiple> (?: (?: s [A-Zí][wí]+(?:s[a-zí]+)? )* sy )+ (?:s[^)]+) ) )",
lambda m: (f"((PERS){m['multiple'].replace(' y', ') y ((PERS)')})"),
input_text, flags = re.IGNORECASE | re.VERBOSE # re.VERBOSE == re.X # extended (ignore white space)
)
# # for example 2 (simple)
# input_text = re.sub(
# r'(((PERS)(?:s(?!y)(?:[wí]+))*)sy(s[A-Zí][wí]+(?:s[a-zí]+)?))',
# r'g<1>) y ((PERS)g<2>',
# input_text,
# flags = re.MULTILINE)
# second: for example 1
input_text = re.sub(
r'(((PERS)(?:s(?:[wí]+))*)sy)',
r'1) y',
input_text,
flags = re.MULTILINE)
print(input_text)
result (original examples 1+2):
"((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds"
"ashsahghgsa ((PERS) María) y ((PERS) Rosa ds) son alumnas de esa escuela y juegan juntas"
result (expanded example 1+2):
"((PERS) Marcos Sy) y ((PERS) Lucy) ((PERS) Marcos Sy) y ((PERS) Ana) estuvieron ((VERB) jugando) sdds"
"ashsahghgsa ((PERS) María) y ((PERS) Isabel) y ((PERS) Ana) y ((PERS) Rosa ds) son alumnas de esa escuela y juegan juntas"
(Are you sure expecting ((PERS)Rosa ds)
– without space? And it’s no clear you need "ds" after "Rosa"? I don’t speak Spanish, maybe that? 😉 but dealt with it 🙂 )
If there can not be any other occurrence of a parenthesis, you might use a pattern with 2 capture groups, and then use split on the second group to get the separate parts between y
so that there can also be multiple names.
Pattern to get the ((PERS)...)
parts with y
(((PERS)s*)((?:(?![()]|syb).)* yb[^()]*?)s*)
After these replacements, you can put y
between all the remaining consecutive ((PERS)...)
parts with another pattern:
(((PERS)[^()]*))s*(?=((PERS)[^()]*))
import re
pattern = r"(((PERS)s*)((?:(?![()]|syb).)* yb[^()]*?)s*)"
s = ("((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sddsn"
"ashsahghgsa ((PERS) María y Rosa ds) son alumnas de esa escuela y juegan juntasn"
"ashsahghgsa ((PERS) María y Rosa ds y Test Person 1 y test person 2) son alumnas de esa escuela y juegan juntas")
def custom_replacement(m):
return m.group(1) + " y ((PERS) ".join([p + ")" for p in re.split(r"s+ybs*", m.group(2)) if p])
replaced_names = re.sub(pattern, custom_replacement, s)
replaced_pers = re.sub(r"(((PERS)[^()]*))s*(?=((PERS)[^()]*))", r"1 y ", replaced_names)
print(replaced_pers)
Output
((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds
ashsahghgsa ((PERS) María) y ((PERS) Rosa ds) son alumnas de esa escuela y juegan juntas
ashsahghgsa ((PERS) María) y ((PERS) Rosa ds) y ((PERS) Test Person 1) y ((PERS) test person 2) son alumnas de esa escuela y juegan juntas
See a Python demo.