Separate this string using these separator elements but without removing them from the resulting strings
Question:
import re
input_string = "Sus cosas deben ser llevadas allí, ella tomo a sí, Lucy la hermana menor, esta muy entusiasmada. por verte hoy por la tarden sdsdsd"
#result_list = re.split(r"(?:.s*n|.|n|;|,s*[A-Z])", input_string)
result_list = re.split(r"(?=[.,;]|(?<=s)[A-Z])", input_string)
print(result_list)
Separate the string input_string
using these separators r"(?:.s*n|.|n|;|,s*[A-Z])"
, but without removing them from the substrings of the resulting list.
When I use a positive lookahead assertion instead of a non-capturing group. This will split the input string at the positions immediately before the separators, while keeping the separators in the substrings. But I get this wrong output list
['Sus cosas deben ser llevadas allí', ', ella tomo a sí', ', ', 'Lucy la hermana menor', ', esta muy entusiasmada', '. por verte hoy por la tarden sdsdsd']
In order to obtain this correct list of output, when printing
["Sus cosas deben ser llevadas allí, ella tomo a sí,", " Lucy la hermana menor, esta muy entusiasmada.", " por verte hoy por la tarden", " sdsdsd"]
Answers:
Condense your delimiter list pattern to the following "([.;n]|,(?=s*[A-Z]))"
and use itertools.zip_longest
to combine resulting substrings with followed delimiters:
import re
from itertools import zip_longest
input_string = "Sus cosas deben ser llevadas allí, ella tomo a sí, Lucy la hermana menor, esta muy entusiasmada. por verte hoy por la tarden sdsdsd"
res = re.split(r"([.;n]|,(?=s*[A-Z]))", input_string)
res = list(map(''.join, zip_longest(res[::2], res[1::2], fillvalue='')))
print(res)
['Sus cosas deben ser llevadas allí, ella tomo a sí,', ' Lucy la hermana menor, esta muy entusiasmada.', ' por verte hoy por la tarden', ' sdsdsd']
Suppose the text when printed appeared as follows.
Sus cosas deben ser llevadas allí, ella tomo a sí, Lucy la hermana menor, esta muy entusiasmada. por verte hoy por la tarde
sdsdsd
abc; def. ghi.
jkl
That is, the original string is obtained by joining these lines with newline characters. This is just to make it easier to visualize the given string.
Notice that the first part of the string is Spanish, which includes non-ASCII letters. We therefore need to set the re.U
flag to match full Unicode.
If we replace each match of the regular expression
r'([;.,])(?:(?<=,)(?=s*[A-Z])|(?<!,))s*'
with r'1n'
(1
being the content of capture group 1) we obtain the following string, as shown when printed.
Sus cosas deben ser llevadas allí, ella tomo a sí,
Lucy la hermana menor, esta muy entusiasmada.
por verte hoy por la tarde
sdsdsd
abc;
def.
ghi.
jkl
Python demo<-(ツ)/->Regex demo
Note that p{Lu}
matches any uppercase Unicode letter.
It remains to simply split this string on r'ns*'
to obtain the desired result:
["Sus cosas deben ser llevadas allí, ella tomo a sí,",
"Lucy la hermana menor, esta muy entusiasmada.",
"por verte hoy por la tarde", "sdsdsd", "abc;", "def.", "ghi.", "jkl"]
In sum, we can write
re.split(r'ns*', re.sub(r'([;.,])(?:(?<=,)(?=s*[A-Z])|(?<!,))s*', repl, str, flags=re.U))
The regular expression has the following elements.
([;.,]) Match a character in character class and save to capture group 1
(?: Begin a non-capture group
(?<= Begin a positive lookbehind
, Match ','
) End positive lookbehind
(?= Begin a positive lookahead
s* Match zero or more whitespaces
p{Lu} Match a Unicode uppercase letter
) End positive lookahead
| Or
(?<! Begin a negative lookbehind
, Match ','
) End negative lookbehind
) End non-capture group
s* Match zero or more whitespaces
import re
input_string = "Sus cosas deben ser llevadas allí, ella tomo a sí, Lucy la hermana menor, esta muy entusiasmada. por verte hoy por la tarden sdsdsd"
#result_list = re.split(r"(?:.s*n|.|n|;|,s*[A-Z])", input_string)
result_list = re.split(r"(?=[.,;]|(?<=s)[A-Z])", input_string)
print(result_list)
Separate the string input_string
using these separators r"(?:.s*n|.|n|;|,s*[A-Z])"
, but without removing them from the substrings of the resulting list.
When I use a positive lookahead assertion instead of a non-capturing group. This will split the input string at the positions immediately before the separators, while keeping the separators in the substrings. But I get this wrong output list
['Sus cosas deben ser llevadas allí', ', ella tomo a sí', ', ', 'Lucy la hermana menor', ', esta muy entusiasmada', '. por verte hoy por la tarden sdsdsd']
In order to obtain this correct list of output, when printing
["Sus cosas deben ser llevadas allí, ella tomo a sí,", " Lucy la hermana menor, esta muy entusiasmada.", " por verte hoy por la tarden", " sdsdsd"]
Condense your delimiter list pattern to the following "([.;n]|,(?=s*[A-Z]))"
and use itertools.zip_longest
to combine resulting substrings with followed delimiters:
import re
from itertools import zip_longest
input_string = "Sus cosas deben ser llevadas allí, ella tomo a sí, Lucy la hermana menor, esta muy entusiasmada. por verte hoy por la tarden sdsdsd"
res = re.split(r"([.;n]|,(?=s*[A-Z]))", input_string)
res = list(map(''.join, zip_longest(res[::2], res[1::2], fillvalue='')))
print(res)
['Sus cosas deben ser llevadas allí, ella tomo a sí,', ' Lucy la hermana menor, esta muy entusiasmada.', ' por verte hoy por la tarden', ' sdsdsd']
Suppose the text when printed appeared as follows.
Sus cosas deben ser llevadas allí, ella tomo a sí, Lucy la hermana menor, esta muy entusiasmada. por verte hoy por la tarde
sdsdsd
abc; def. ghi.
jkl
That is, the original string is obtained by joining these lines with newline characters. This is just to make it easier to visualize the given string.
Notice that the first part of the string is Spanish, which includes non-ASCII letters. We therefore need to set the re.U
flag to match full Unicode.
If we replace each match of the regular expression
r'([;.,])(?:(?<=,)(?=s*[A-Z])|(?<!,))s*'
with r'1n'
(1
being the content of capture group 1) we obtain the following string, as shown when printed.
Sus cosas deben ser llevadas allí, ella tomo a sí,
Lucy la hermana menor, esta muy entusiasmada.
por verte hoy por la tarde
sdsdsd
abc;
def.
ghi.
jkl
Python demo<-(ツ)/->Regex demo
Note that p{Lu}
matches any uppercase Unicode letter.
It remains to simply split this string on r'ns*'
to obtain the desired result:
["Sus cosas deben ser llevadas allí, ella tomo a sí,",
"Lucy la hermana menor, esta muy entusiasmada.",
"por verte hoy por la tarde", "sdsdsd", "abc;", "def.", "ghi.", "jkl"]
In sum, we can write
re.split(r'ns*', re.sub(r'([;.,])(?:(?<=,)(?=s*[A-Z])|(?<!,))s*', repl, str, flags=re.U))
The regular expression has the following elements.
([;.,]) Match a character in character class and save to capture group 1
(?: Begin a non-capture group
(?<= Begin a positive lookbehind
, Match ','
) End positive lookbehind
(?= Begin a positive lookahead
s* Match zero or more whitespaces
p{Lu} Match a Unicode uppercase letter
) End positive lookahead
| Or
(?<! Begin a negative lookbehind
, Match ','
) End negative lookbehind
) End non-capture group
s* Match zero or more whitespaces