Is there a way to split a string on delimiters including colon(:) except when it involves time?

Question:

I am trying to split the string below on a number of delimiters including n, comma(,), and colon(:) except when the colon is part of a time value. Below is my string:

values = 'City:hellnCountry:romenUpdate date: 2022-09-26 00:00:00'

I have tried:

result = re.split(':|,|n', values)

However, this ends up splitting the time resulting in `

['City','hell','Country','rome','Update date',' 2022-09-26 00','00','00']

Whereas the expected outcome is

['City','hell','Country','rome','Update date', '2022-09-26 00:00:00']

Any help/assistance will be appreciated

Answers:

Use a negative lookahead:

re.split(':(?!d)|,|n', values)

This says split by colon, only if not followed by a /digit

Answered By: TheMaster

Solution without re:

values = "City:hellnCountry:romenUpdate date: 2022-09-26 00:00:00"

out = [
    v.strip()
    for l in (line.split(":", maxsplit=1) for line in values.splitlines())
    for v in l
]
print(out)

Prints:

['City', 'hell', 'Country', 'rome', 'Update date', '2022-09-26 00:00:00']
Answered By: Andrej Kesely

You could use look-behind to ensure that what is before : is not a pair of digits

re.split('(?<![0-9]{2}):s*|,|n', values)

It separates by

  • colons with optional spaces when they are not preceded by digits
  • ,
  • n

So : is a separator (when not preceded by a pair of digits). But so is : or : (still, when they are not preceded by a pair of digits). Consequence is that if, as it is the case if your string, there is a space after a colon, then that space is not included in the next field (since it is part of the separator, not of a field)

Or, you could also keep the first version of my answer (without s*) and just .strip() the fields.

Answered By: chrslg

What about doing a negative lookahead and look behind for digits?

something like..

re.split("(?<![0-9]):(?![0-9])|n", values)

This will work so long as you don’t have a key with numbers and a value with numbers.

Answered By: Jill HM

You can split with a regex where the colon is matched only when not in between digits:

re.split(r'[,n]|:(?!(?<=d.)d)', text)

See the regex demo. Here, [,n]|:(?!(?<=d.)d) matches a comma, a newline char, or a colon that is not immediately followed with a digit that is immediately preceded with a digit and any char (here, :).

You can match and extract the time pattern – b(?:[01]?[0-9]|2[0-3]):[0-5]?d:[0-5]?db – or any char other than a newline, colon and comma one or more times:

re.findall(r'(?:b(?:[01]?[0-9]|2[0-3]):[0-5]?d:[0-5]?db|[^:,n])+', text)

See the regex demo.

Details:

  • (?: – start of a non-capturing group
    • b(?:[01]?[0-9]|2[0-3]):[0-5]?d:[0-5]?db – word boundary, a number from 0 to 23, and then two occurrences of : char and a number from 0 to 59
    • | – or
    • [^:,n] – any char other than :, , and a newline
  • )+ – end of the grouping, one or more times.
Answered By: Wiktor Stribiżew

If all your time values are ##:##:## you can be extra careful to only replace that particular pattern, by using a substitute delimiter temporarily (as per my comment to your original question):

import re

values = 'City:hellnCountry:romenUpdate date: 2022-09-26 12:34:56'

newvalues = re.sub(r"(dd):(dd):(dd)",r"1&2&3",values)

splitvalues = re.split(':|,|n', newvalues)

splitsrepaired = list(map(lambda x: re.sub(r"(dd)&(dd)&(dd)",r"1:2:3",x),splitvalues))

splitsrepaired=[‘City’, ‘hell’, ‘Country’, ‘rome’, ‘Update date’, ‘ 2022-09-26 12:34:56’]

print(f"{splitsrepaired=}")
Answered By: RufusVS