Is there a way to split a string on delimiters including colon(:) except when it involves time?
Question:
I am trying to split the string below on a number of delimiters including n, comma(,), and colon(:) except when the colon is part of a time value. Below is my string:
values = 'City:hellnCountry:romenUpdate date: 2022-09-26 00:00:00'
I have tried:
result = re.split(':|,|n', values)
However, this ends up splitting the time resulting in `
['City','hell','Country','rome','Update date',' 2022-09-26 00','00','00']
Whereas the expected outcome is
['City','hell','Country','rome','Update date', '2022-09-26 00:00:00']
Any help/assistance will be appreciated
Answers:
Use a negative lookahead:
re.split(':(?!d)|,|n', values)
This says split by colon, only if not followed by a /d
igit
Solution without re
:
values = "City:hellnCountry:romenUpdate date: 2022-09-26 00:00:00"
out = [
v.strip()
for l in (line.split(":", maxsplit=1) for line in values.splitlines())
for v in l
]
print(out)
Prints:
['City', 'hell', 'Country', 'rome', 'Update date', '2022-09-26 00:00:00']
You could use look-behind to ensure that what is before : is not a pair of digits
re.split('(?<![0-9]{2}):s*|,|n', values)
It separates by
- colons with optional spaces when they are not preceded by digits
,
n
So :
is a separator (when not preceded by a pair of digits). But so is :
or :
(still, when they are not preceded by a pair of digits). Consequence is that if, as it is the case if your string, there is a space after a colon, then that space is not included in the next field (since it is part of the separator, not of a field)
Or, you could also keep the first version of my answer (without s*
) and just .strip()
the fields.
What about doing a negative lookahead and look behind for digits?
something like..
re.split("(?<![0-9]):(?![0-9])|n", values)
This will work so long as you don’t have a key with numbers and a value with numbers.
You can split with a regex where the colon is matched only when not in between digits:
re.split(r'[,n]|:(?!(?<=d.)d)', text)
See the regex demo. Here, [,n]|:(?!(?<=d.)d)
matches a comma, a newline char, or a colon that is not immediately followed with a digit that is immediately preceded with a digit and any char (here, :
).
You can match and extract the time pattern – b(?:[01]?[0-9]|2[0-3]):[0-5]?d:[0-5]?db
– or any char other than a newline, colon and comma one or more times:
re.findall(r'(?:b(?:[01]?[0-9]|2[0-3]):[0-5]?d:[0-5]?db|[^:,n])+', text)
See the regex demo.
Details:
(?:
– start of a non-capturing group
b(?:[01]?[0-9]|2[0-3]):[0-5]?d:[0-5]?db
– word boundary, a number from 0
to 23
, and then two occurrences of :
char and a number from 0
to 59
|
– or
[^:,n]
– any char other than :
, ,
and a newline
)+
– end of the grouping, one or more times.
If all your time values are ##:##:## you can be extra careful to only replace that particular pattern, by using a substitute delimiter temporarily (as per my comment to your original question):
import re
values = 'City:hellnCountry:romenUpdate date: 2022-09-26 12:34:56'
newvalues = re.sub(r"(dd):(dd):(dd)",r"1&2&3",values)
splitvalues = re.split(':|,|n', newvalues)
splitsrepaired = list(map(lambda x: re.sub(r"(dd)&(dd)&(dd)",r"1:2:3",x),splitvalues))
splitsrepaired=[‘City’, ‘hell’, ‘Country’, ‘rome’, ‘Update date’, ‘ 2022-09-26 12:34:56’]
print(f"{splitsrepaired=}")
I am trying to split the string below on a number of delimiters including n, comma(,), and colon(:) except when the colon is part of a time value. Below is my string:
values = 'City:hellnCountry:romenUpdate date: 2022-09-26 00:00:00'
I have tried:
result = re.split(':|,|n', values)
However, this ends up splitting the time resulting in `
['City','hell','Country','rome','Update date',' 2022-09-26 00','00','00']
Whereas the expected outcome is
['City','hell','Country','rome','Update date', '2022-09-26 00:00:00']
Any help/assistance will be appreciated
Use a negative lookahead:
re.split(':(?!d)|,|n', values)
This says split by colon, only if not followed by a /d
igit
Solution without re
:
values = "City:hellnCountry:romenUpdate date: 2022-09-26 00:00:00"
out = [
v.strip()
for l in (line.split(":", maxsplit=1) for line in values.splitlines())
for v in l
]
print(out)
Prints:
['City', 'hell', 'Country', 'rome', 'Update date', '2022-09-26 00:00:00']
You could use look-behind to ensure that what is before : is not a pair of digits
re.split('(?<![0-9]{2}):s*|,|n', values)
It separates by
- colons with optional spaces when they are not preceded by digits
,
n
So :
is a separator (when not preceded by a pair of digits). But so is :
or :
(still, when they are not preceded by a pair of digits). Consequence is that if, as it is the case if your string, there is a space after a colon, then that space is not included in the next field (since it is part of the separator, not of a field)
Or, you could also keep the first version of my answer (without s*
) and just .strip()
the fields.
What about doing a negative lookahead and look behind for digits?
something like..
re.split("(?<![0-9]):(?![0-9])|n", values)
This will work so long as you don’t have a key with numbers and a value with numbers.
You can split with a regex where the colon is matched only when not in between digits:
re.split(r'[,n]|:(?!(?<=d.)d)', text)
See the regex demo. Here, [,n]|:(?!(?<=d.)d)
matches a comma, a newline char, or a colon that is not immediately followed with a digit that is immediately preceded with a digit and any char (here, :
).
You can match and extract the time pattern – b(?:[01]?[0-9]|2[0-3]):[0-5]?d:[0-5]?db
– or any char other than a newline, colon and comma one or more times:
re.findall(r'(?:b(?:[01]?[0-9]|2[0-3]):[0-5]?d:[0-5]?db|[^:,n])+', text)
See the regex demo.
Details:
(?:
– start of a non-capturing groupb(?:[01]?[0-9]|2[0-3]):[0-5]?d:[0-5]?db
– word boundary, a number from0
to23
, and then two occurrences of:
char and a number from0
to59
|
– or[^:,n]
– any char other than:
,,
and a newline
)+
– end of the grouping, one or more times.
If all your time values are ##:##:## you can be extra careful to only replace that particular pattern, by using a substitute delimiter temporarily (as per my comment to your original question):
import re
values = 'City:hellnCountry:romenUpdate date: 2022-09-26 12:34:56'
newvalues = re.sub(r"(dd):(dd):(dd)",r"1&2&3",values)
splitvalues = re.split(':|,|n', newvalues)
splitsrepaired = list(map(lambda x: re.sub(r"(dd)&(dd)&(dd)",r"1:2:3",x),splitvalues))
splitsrepaired=[‘City’, ‘hell’, ‘Country’, ‘rome’, ‘Update date’, ‘ 2022-09-26 12:34:56’]
print(f"{splitsrepaired=}")