Separate this string using these separator elements but without removing them from the resulting strings

Question

import re

input_string = "Sus cosas deben ser llevadas allí, ella tomo a sí, Lucy la hermana menor, esta muy entusiasmada. por verte hoy por la tarden sdsdsd"

#result_list = re.split(r"(?:.s*n|.|n|;|,s*[A-Z])", input_string)
result_list = re.split(r"(?=[.,;]|(?<=s)[A-Z])", input_string)

print(result_list)

Separate the string input_string using these separators r"(?:.s*n|.|n|;|,s*[A-Z])" , but without removing them from the substrings of the resulting list.

When I use a positive lookahead assertion instead of a non-capturing group. This will split the input string at the positions immediately before the separators, while keeping the separators in the substrings. But I get this wrong output list

['Sus cosas deben ser llevadas allí', ', ella tomo a sí', ', ', 'Lucy la hermana menor', ', esta muy entusiasmada', '. por verte hoy por la tarden sdsdsd']

In order to obtain this correct list of output, when printing

["Sus cosas deben ser llevadas allí, ella tomo a sí,", " Lucy la hermana menor, esta muy entusiasmada.", " por verte hoy por la tarden", " sdsdsd"]

Asked By: Matt095

||

Source

Answer 1

Condense your delimiter list pattern to the following "([.;n]|,(?=s*[A-Z]))" and use itertools.zip_longest to combine resulting substrings with followed delimiters:

import re
from itertools import zip_longest

input_string = "Sus cosas deben ser llevadas allí, ella tomo a sí, Lucy la hermana menor, esta muy entusiasmada. por verte hoy por la tarden sdsdsd"
res = re.split(r"([.;n]|,(?=s*[A-Z]))", input_string)
res = list(map(''.join, zip_longest(res[::2], res[1::2], fillvalue='')))
print(res)

['Sus cosas deben ser llevadas allí, ella tomo a sí,', ' Lucy la hermana menor, esta muy entusiasmada.', ' por verte hoy por la tarden', ' sdsdsd']

Answered By: RomanPerekhrest

Answer 2

Suppose the text when printed appeared as follows.

Sus cosas deben ser llevadas allí, ella tomo a sí, Lucy la hermana menor, esta muy entusiasmada. por verte hoy por la tarde
 sdsdsd
abc; def. ghi.   
jkl

That is, the original string is obtained by joining these lines with newline characters. This is just to make it easier to visualize the given string.

Notice that the first part of the string is Spanish, which includes non-ASCII letters. We therefore need to set the re.U flag to match full Unicode.

If we replace each match of the regular expression

r'([;.,])(?:(?<=,)(?=s*[A-Z])|(?<!,))s*'

with r'1n' (1 being the content of capture group 1) we obtain the following string, as shown when printed.

Sus cosas deben ser llevadas allí, ella tomo a sí,
Lucy la hermana menor, esta muy entusiasmada.
por verte hoy por la tarde
 sdsdsd
abc;
def.
ghi.
jkl

Python demo^_<-_(ツ)/^_->Regex demo

Note that p{Lu} matches any uppercase Unicode letter.

It remains to simply split this string on r'ns*' to obtain the desired result:

["Sus cosas deben ser llevadas allí, ella tomo a sí,",
 "Lucy la hermana menor, esta muy entusiasmada.",
 "por verte hoy por la tarde", "sdsdsd", "abc;", "def.", "ghi.", "jkl"]

Python demo

In sum, we can write

re.split(r'ns*', re.sub(r'([;.,])(?:(?<=,)(?=s*[A-Z])|(?<!,))s*', repl, str, flags=re.U))

Python demo

The regular expression has the following elements.

([;.,])       Match a character in character class and save to capture group 1
(?:           Begin a non-capture group
  (?<=        Begin a positive lookbehind
    ,         Match ','
  )           End positive lookbehind
  (?=         Begin a positive lookahead
    s*       Match zero or more whitespaces
    p{Lu}    Match a Unicode uppercase letter
  )           End positive lookahead
|             Or
  (?<!        Begin a negative lookbehind
    ,         Match ','
  )           End negative lookbehind
)             End non-capture group
s*           Match zero or more whitespaces

Answered By: Cary Swoveland

Separate this string using these separator elements but without removing them from the resulting strings

Question:

Answers: