Python split string with split character and escape character
Question:
In python, how can I split a string with an regex by the following ruleset:
- Split by a split char (e.g.
;
)
- Don’t split if that split char is escaped by an escape char (e.g.
:
).
- Do the split, if the escape char is escaped by itself
So splitting
"foo;bar:;baz::;one:two;::three::::;four;;five:::;six;:seven;::eight"
should yield
["foo", "bar:;baz::", "one:two", "::three::::", "four", "", "five:::;six", ":seven", "::eight"]
My own attempt was:
re.split(r'(?<!:);', str)
Which cannot handle rule #3
Answers:
You could use regex
module with the following pattern to split on:
(?<!:)(?:::)*K;
See an online demo
(?<!:)
– Negative lookbehind.
(?:::)*
– A non capturing group for 0+ times 2 literal colons.
K
– Reset starting point of reported match.
;
– A literal semi-colon.
For example:
import regex as re
s = 'foo;bar:;baz::;one:two;::three::::;four;;five:::;six;:seven;::eight'
lst = re.split(r'(?<!:)(?:::)*K;', s)
print(lst) # ['foo', 'bar:;baz::', 'one:two', '::three::::', 'four', '', 'five:::;six', ':seven', '::eight']
If matching is also an option, and the empty match ''
is not required:
(?::[:;]|[^;n])+
(?:
Non capture group
:[:;]
Match :
followed by either :
or ;
|
Or
[^;n]
Match 1+ times any char except ;
or a newline
)+
Close non capture group and repeat 1+ times
import re
regex = r"(?::[:;]|[^;n])+"
str = "foo;bar:;baz::;one:two;::three::::;four;;five:::;six;:seven;::eight"
print(re.findall(regex, str))
Output
['foo', 'bar:;baz::', 'one:two', '::three::::', 'four', 'five:::;six', ':seven', '::eight']
If you want an empty match, but not with an escaped delimiter like field:;;field
you can use the PyPi regex module, asserting not ;
followed by 1+ times a :
to the left of the current posititon.
(?::[:;]|[^;n]|(?<=(?<!:);+)(?=;))+
Example
import regex as re
s = 'foo;bar:;baz::;one:two;::three::::;four;;five:::;six;:seven;::eight:;;field'
pattern = r'(?::[:;]|[^;n]|(?<=(?<!:);+)(?=;))+'
res = re.findall(pattern, s)
print(res)
Output
['foo', 'bar:;baz::', 'one:two', '::three::::', 'four', '', 'five:::;six', ':seven', '::eight:;', 'field']
In python, how can I split a string with an regex by the following ruleset:
- Split by a split char (e.g.
;
) - Don’t split if that split char is escaped by an escape char (e.g.
:
). - Do the split, if the escape char is escaped by itself
So splitting
"foo;bar:;baz::;one:two;::three::::;four;;five:::;six;:seven;::eight"
should yield
["foo", "bar:;baz::", "one:two", "::three::::", "four", "", "five:::;six", ":seven", "::eight"]
My own attempt was:
re.split(r'(?<!:);', str)
Which cannot handle rule #3
You could use regex
module with the following pattern to split on:
(?<!:)(?:::)*K;
See an online demo
(?<!:)
– Negative lookbehind.(?:::)*
– A non capturing group for 0+ times 2 literal colons.K
– Reset starting point of reported match.;
– A literal semi-colon.
For example:
import regex as re
s = 'foo;bar:;baz::;one:two;::three::::;four;;five:::;six;:seven;::eight'
lst = re.split(r'(?<!:)(?:::)*K;', s)
print(lst) # ['foo', 'bar:;baz::', 'one:two', '::three::::', 'four', '', 'five:::;six', ':seven', '::eight']
If matching is also an option, and the empty match ''
is not required:
(?::[:;]|[^;n])+
(?:
Non capture group:[:;]
Match:
followed by either:
or;
|
Or[^;n]
Match 1+ times any char except;
or a newline
)+
Close non capture group and repeat 1+ times
import re
regex = r"(?::[:;]|[^;n])+"
str = "foo;bar:;baz::;one:two;::three::::;four;;five:::;six;:seven;::eight"
print(re.findall(regex, str))
Output
['foo', 'bar:;baz::', 'one:two', '::three::::', 'four', 'five:::;six', ':seven', '::eight']
If you want an empty match, but not with an escaped delimiter like field:;;field
you can use the PyPi regex module, asserting not ;
followed by 1+ times a :
to the left of the current posititon.
(?::[:;]|[^;n]|(?<=(?<!:);+)(?=;))+
Example
import regex as re
s = 'foo;bar:;baz::;one:two;::three::::;four;;five:::;six;:seven;::eight:;;field'
pattern = r'(?::[:;]|[^;n]|(?<=(?<!:);+)(?=;))+'
res = re.findall(pattern, s)
print(res)
Output
['foo', 'bar:;baz::', 'one:two', '::three::::', 'four', '', 'five:::;six', ':seven', '::eight:;', 'field']