Why when split a string into a list of substrings, without removing the separators, parts of this original string are lost in the splitting process?

Question

import re
from itertools import chain

def identification_of_nominal_complements(input_text):

    pat_identifier_noun_with_modifiers = r"((?:l[oa]s|l[oa])s+.+?)s*(?=((VERB))"
    substrings_with_nouns_and_their_modifiers_list = re.findall(pat_identifier_noun_with_modifiers, input_text)
    separator_elements = r"s*(?:,|(,|)s*y)s*"

    substrings_with_nouns_and_their_modifiers_list = [re.split(separator_elements, s) for s in substrings_with_nouns_and_their_modifiers_list]
    substrings_with_nouns_and_their_modifiers_list = list(chain.from_iterable(substrings_with_nouns_and_their_modifiers_list))
    substrings_with_nouns_and_their_modifiers_list = list(filter(lambda x: x is not None and x.strip() != '', substrings_with_nouns_and_their_modifiers_list))
    print(substrings_with_nouns_and_their_modifiers_list) # --> list output

    pat = re.compile(rf"(?<!(PERS))({'|'.join(substrings_with_nouns_and_their_modifiers_list)})(?!['w)-])")
    input_text = re.sub(pat, r'((PERS)1)', input_text)

    return input_text

#example 1, it works well:
input_text = "He ((VERB)visto) la maceta de la señora de rojo ((VERB)es) grande. He ((VERB)visto) que la maceta de la señora de rojo y a ((PERS)Lucila) ((VERB)es) grande."

#example 2, it works wrong and gives error:
input_text = "((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi"


input_text = identification_of_nominal_complements(input_text)
print(input_text) # --> string output

Why does this function with example 2 cut off the ((PERS) part of some of the elements of the substrings_with_nouns_and_their_modifiers_list list, and in example 1 this same function doesn’t?
For this reason, elements are generated with unbalanced parentheses, which generates a re.error: unbalanced parenthesis later, specifically on the line where the re.compile() function is used.

For example 1, the output obtained is correct, they are not removed unnecessarily ((PERS) and consequently the error of unbalanced parentheses is not obtained

['la maceta de la señora de rojo', 'la maceta de la señora de rojo', 'a ((PERS)Lucila)']

'He ((VERB)visto) ((PERS)la maceta de la señora de rojo) ((VERB)es) grande. He ((VERB)visto) que ((PERS)la maceta de la señora de rojo) y a ((PERS)Lucila) ((VERB)es) grande.'

In example 2, is where the problem is, although the function with which the string is processed is the same, for some reason the substring ((PERS) is removed from some elements of the substrings_with_nouns_and_their_modifiers_list list , which will trigger an unbalanced parenthesis error when using re.compile(), because, in this particular case, there are some substrings that contain ) but not (, because the ((PERS) was removed

['los viejos gabinetes)', 'los viejos gabinetes)', 'los viejos gabinetes)', 'a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)', 'los candelabros) son brillantes los candelabros', 'los candelabros)']

Traceback (most recent call last):
pat = re.compile(rf"(?<!(PERS))({'|'.join(substrings_with_nouns_and_their_modifiers_list)})(?!['w)-])")
raise source.error("unbalanced parenthesis")
re.error: unbalanced parenthesis at position 56

And if the identification_of_nominal_complements() function worked correctly, these should be the outputs you would get when sending the function the string from example 2, where not removing some ((PERS) avoids the unbalanced parenthesis error when using re.compile(). This is the correct output for the example 2 string:

['((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', 'a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)', '((PERS)los candelabros) son brillantes los candelabros', '((PERS)los candelabros)']

'((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi'

What should I modify in the identification_of_nominal_complements() function so that when sending the string of example 2 I don’t have the unbalanced parentheses error and I can get this correct output

Asked By: Matt095

||

Source

Answer 1

"Why does this function with example 2 cut off the ((PERS) part of some of the elements…" Because of no pattern [^s]* at the beginning:

pat_identifier_noun_with_modifiers = r"([^s]*(?:l[oa]s|l[oa])s+.+?)s*(?=((VERB))"

And now result is:

['((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', 'a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)', '((PERS)los candelabros) son brillantes los candelabros', '((PERS)los candelabros)']

'((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi'

Answered By: msegit

Why when split a string into a list of substrings, without removing the separators, parts of this original string are lost in the splitting process?

Question:

Answers: