Python split string with split character and escape character

Question:

In python, how can I split a string with an regex by the following ruleset:

  1. Split by a split char (e.g. ;)
  2. Don’t split if that split char is escaped by an escape char (e.g. :).
  3. Do the split, if the escape char is escaped by itself

So splitting

"foo;bar:;baz::;one:two;::three::::;four;;five:::;six;:seven;::eight"

should yield

["foo", "bar:;baz::", "one:two", "::three::::", "four", "", "five:::;six", ":seven", "::eight"]

My own attempt was:

re.split(r'(?<!:);', str)

Which cannot handle rule #3

Asked By: Kilian Röhner

||

Answers:

You could use regex module with the following pattern to split on:

(?<!:)(?:::)*K;

See an online demo

  • (?<!:) – Negative lookbehind.
  • (?:::)* – A non capturing group for 0+ times 2 literal colons.
  • K – Reset starting point of reported match.
  • ; – A literal semi-colon.

For example:

import regex as re
s = 'foo;bar:;baz::;one:two;::three::::;four;;five:::;six;:seven;::eight'
lst = re.split(r'(?<!:)(?:::)*K;', s)
print(lst) # ['foo', 'bar:;baz::', 'one:two', '::three::::', 'four', '', 'five:::;six', ':seven', '::eight']
Answered By: JvdV

If matching is also an option, and the empty match '' is not required:

(?::[:;]|[^;n])+
  • (?: Non capture group
    • :[:;] Match : followed by either : or ;
    • | Or
    • [^;n] Match 1+ times any char except ; or a newline
  • )+ Close non capture group and repeat 1+ times

Regex demo

import re

regex = r"(?::[:;]|[^;n])+"
str = "foo;bar:;baz::;one:two;::three::::;four;;five:::;six;:seven;::eight"
print(re.findall(regex, str))

Output

['foo', 'bar:;baz::', 'one:two', '::three::::', 'four', 'five:::;six', ':seven', '::eight']

Python demo


If you want an empty match, but not with an escaped delimiter like field:;;field you can use the PyPi regex module, asserting not ; followed by 1+ times a : to the left of the current posititon.

(?::[:;]|[^;n]|(?<=(?<!:);+)(?=;))+

Regex demo | Python demo

Example

import regex as re
s = 'foo;bar:;baz::;one:two;::three::::;four;;five:::;six;:seven;::eight:;;field'
pattern = r'(?::[:;]|[^;n]|(?<=(?<!:);+)(?=;))+'
res = re.findall(pattern, s)
print(res)

Output

['foo', 'bar:;baz::', 'one:two', '::three::::', 'four', '', 'five:::;six', ':seven', '::eight:;', 'field']
Answered By: The fourth bird
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.