Why does this regex pattern freeze and get stuck infinitely?

Question:

import re

input_text = "Creo que ((PERS)el viejo gabinete) estan en desuso, hay que hacer algo con él. ya que él aún es útil. Él sirve para tareas de ofimatica. ((PERS)el viejo mouse) es algo comodo, aunque el clic de él falla."


personal_article_with_subject = r"(?:[A-Z]|els*)(?:ws*)+"

#personal_pronoun = r"b[ée]lb"
personal_pronoun = r"bélb"

subject_of_this_part_of_the_sentence_pattern = r"((?:ws*)+)"
substring_in_middle_pattern = r"((?:[ws,;']+)+)"

separation_list_pattern = re.compile(r"(?=((PERS)s*" + personal_article_with_subject + r"s*)")
sentences_separated_by_subject_list = re.split(separation_list_pattern, input_text)

sentences_separated_by_subject_list_aux = []
for sentences_separated in sentences_separated_by_subject_list:
    substring_after_a_dot_pattern = r"((?:[ws,;.']+)+)"

    pattern_02 = r"((?:.s*n|n|.|))s*" + r"((PERS)s*(" + personal_article_with_subject + r")s*)s*" + substring_in_middle_pattern + r"(?:.s*n|n|.)" + substring_after_a_dot_pattern
    match_02 = re.search(pattern_02, sentences_separated, flags = re.IGNORECASE)
    if match_02:
        separator_symbol, subject_of_this_part_of_the_sentence, substring_in_middle, substring_after_a_dot = match_02.group(1), match_02.group(2), match_02.group(3), match_02.group(4)

        substring_after_a_dot_aux = re.sub(r"(?<!S)" + personal_pronoun, "((PERS)" + subject_of_this_part_of_the_sentence + ")", substring_after_a_dot, flags = re.IGNORECASE)

        sentences_separated = re.sub(substring_after_a_dot, separator_symbol + substring_after_a_dot_aux, sentences_separated, flags = re.IGNORECASE)
    sentences_separated_by_subject_list_aux.append(sentences_separated)

input_text = ''.join(sentences_separated_by_subject_list_aux)
print(repr(input_text)) # --> output

Why does this regex hang and cause the program to freeze for hours?

I think this code may be taking time to execute or may have gotten stuck due to the regular expression pattern used in the personal_article_with_subject variable.

The regex pattern used in personal_article_with_subject is quite broad and can match various character combinations, which can slow down the execution of the script, although for a relatively modern PC it shouldn’t be a problem so I assume something is wrong with the script. script execution.

This is the correct outputs:

"Creo que ((PERS)el viejo gabinete) estan en desuso, hay que hacer algo con él. ya que ((PERS)el viejo gabinete) aún es útil. ((PERS)el viejo gabinete) sirve para tareas de ofimatica. ((PERS)el viejo mouse) es algo comodo, aunque el clic de ((PERS)el viejo mouse) falla."
Asked By: Matt095

||

Answers:

I think the major issue is with substring_after_a_dot_pattern. It was having an issue at ((PERS)el viejo mouse) es algo comodo, aunque el clic de él falla. considering that there is nothing past the . and the regex looks for one or more. I’m not sure why it didn’t just fail the match, but if i change the regex for that to substring_after_a_dot_pattern = r"((?:[ws,;.']+)*)" or add a " " whitespace string to the end of the input string, it completes without an issue.

import re

input_text = "Creo que ((PERS)el viejo gabinete) estan en desuso, hay que hacer algo con él. ya que él aún es útil. Él sirve para tareas de ofimatica. ((PERS)el viejo mouse) es algo comodo, aunque el clic de él falla."


personal_article_with_subject = r"(?:[A-Z]|els*)(?:ws*)+"

#personal_pronoun = r"b[ée]lb"
personal_pronoun = r"bélb"

subject_of_this_part_of_the_sentence_pattern = r"((?:ws*)+)"
substring_in_middle_pattern = r"((?:[ws,;']+)+)"

separation_list_pattern = re.compile(r"(?=((PERS)s*" + personal_article_with_subject + r"s*)")
sentences_separated_by_subject_list = re.split(separation_list_pattern, input_text)

sentences_separated_by_subject_list_aux = []
for sentences_separated in sentences_separated_by_subject_list:
    print(sentences_separated)
    substring_after_a_dot_pattern = r"((?:[ws,;.']+)*)"

    pattern_02 = r"((?:.s*n|n|.|))s*" + r"((PERS)s*(" + personal_article_with_subject + r")s*)s*" + substring_in_middle_pattern + r"(?:.s*n|n|.)" + substring_after_a_dot_pattern
    match_02 = re.search(pattern_02, sentences_separated, flags = re.IGNORECASE)
    if match_02:
        separator_symbol, subject_of_this_part_of_the_sentence, substring_in_middle, substring_after_a_dot = match_02.group(1), match_02.group(2), match_02.group(3), match_02.group(4)

        substring_after_a_dot_aux = re.sub(r"(?<!S)" + personal_pronoun, "((PERS)" + subject_of_this_part_of_the_sentence + ")", substring_after_a_dot, flags = re.IGNORECASE)

        sentences_separated = re.sub(substring_after_a_dot, separator_symbol + substring_after_a_dot_aux, sentences_separated, flags = re.IGNORECASE)
        print(sentences_separated)
    sentences_separated_by_subject_list_aux.append(sentences_separated)

input_text = ''.join(sentences_separated_by_subject_list_aux)
print(repr(input_text)) # --> output

#output
'Creo que ((PERS)el viejo gabinete) estan en desuso, hay que hacer algo con él. ya que ((PERS)el viejo gabinete) aún es útil. ((PERS)el viejo gabinete) sirve para tareas de ofimatica. ((PERS)el viejo mouse) es algo comodo, aunque el clic de él falla.'

If you’d like to keep the regex as is you could just add a + " " to the input string and then .trim() the output too. I’m not familiar enough with the regex engine to determine why this is occurring instead of failing the match so hopefully someone else can.

Answered By: Shorn