Split string by comma, but ignore commas within brackets
Question:
I’m trying to split a string by commas using python:
s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
But I want to ignore any commas within brackets []. So the result for above would be:
["year:2020", "concepts:[ab553,cd779]", "publisher:elsevier"]
Anybody have advice on how to do this? I tried to use re.split like so:
params = re.split(",(?![wds])", param)
But it is not working properly.
Answers:
result = re.split(r",(?!(?:[^,[]]+,)*[^,[]]+])", subject, 0)
, # Match the character “,” literally
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
(?: # Match the regular expression below
[^,[]] # Match any single character NOT present in the list below
# The literal character “,”
# The literal character “[”
# The literal character “]”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
, # Match the character “,” literally
)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[^,[]] # Match any single character NOT present in the list below
# The literal character “,”
# The literal character “[”
# The literal character “]”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
] # Match the character “]” literally
)
Updated to support more than 2 items in brackets. E.g.
year:2020,concepts:[ab553,cd779],publisher:elsevier,year:2020,concepts:[ab553,cd779,xx345],publisher:elsevier
You can work this out using a user-defined function instead of split:
s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
def split_by_commas(s):
lst = list()
last_bracket = ''
word = ""
for c in s:
if c == '[' or c == ']':
last_bracket = c
if c == ',' and last_bracket == ']':
lst.append(word)
word = ""
continue
elif c == ',' and last_bracket == '[':
word += c
continue
elif c == ',':
lst.append(word)
word = ""
continue
word += c
lst.append(word)
return lst
main_lst = split_by_commas(s)
print(main_lst)
The result of the run of above code:
['year:2020', 'concepts:[ab553,cd779]', 'publisher:elsevier']
This regex works on your example:
,(?=[^,]+?:)
Here, we use a positive lookahead to look for commas followed by non-comma and colon characters, then a colon. This correctly finds the <comma><key>
pattern you are searching for. Of course, if the keys are allowed to have commas, this would have to be adapted a little further.
You can check out the regexr here
Using a pattern with only a lookahead to assert a character to the right, will not assert if there is an accompanying character on the left.
Instead of using split, you could either match 1 or more repetitions of values between square brackets, or match any character except a comma.
(?:[^,]*[[^][]*])+[^,]*|[^,]+
s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
params = re.findall(r"(?:[^,]*[[^][]*])+[^,]*|[^,]+", s)
print(params)
Output
['year:2020', 'concepts:[ab553,cd779]', 'publisher:elsevier']
I adapted @Bemwa’s solution (which didn’t work for my use-case)
def split_by_commas(s):
lst = list()
brackets = 0
word = ""
for c in s:
if c == "[":
brackets += 1
elif c == "]":
if brackets > 0:
brackets -= 1
elif c == "," and not brackets:
lst.append(word)
word = ""
continue
word += c
lst.append(word)
return lst
I’m trying to split a string by commas using python:
s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
But I want to ignore any commas within brackets []. So the result for above would be:
["year:2020", "concepts:[ab553,cd779]", "publisher:elsevier"]
Anybody have advice on how to do this? I tried to use re.split like so:
params = re.split(",(?![wds])", param)
But it is not working properly.
result = re.split(r",(?!(?:[^,[]]+,)*[^,[]]+])", subject, 0)
, # Match the character “,” literally
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
(?: # Match the regular expression below
[^,[]] # Match any single character NOT present in the list below
# The literal character “,”
# The literal character “[”
# The literal character “]”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
, # Match the character “,” literally
)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[^,[]] # Match any single character NOT present in the list below
# The literal character “,”
# The literal character “[”
# The literal character “]”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
] # Match the character “]” literally
)
Updated to support more than 2 items in brackets. E.g.
year:2020,concepts:[ab553,cd779],publisher:elsevier,year:2020,concepts:[ab553,cd779,xx345],publisher:elsevier
You can work this out using a user-defined function instead of split:
s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
def split_by_commas(s):
lst = list()
last_bracket = ''
word = ""
for c in s:
if c == '[' or c == ']':
last_bracket = c
if c == ',' and last_bracket == ']':
lst.append(word)
word = ""
continue
elif c == ',' and last_bracket == '[':
word += c
continue
elif c == ',':
lst.append(word)
word = ""
continue
word += c
lst.append(word)
return lst
main_lst = split_by_commas(s)
print(main_lst)
The result of the run of above code:
['year:2020', 'concepts:[ab553,cd779]', 'publisher:elsevier']
This regex works on your example:
,(?=[^,]+?:)
Here, we use a positive lookahead to look for commas followed by non-comma and colon characters, then a colon. This correctly finds the <comma><key>
pattern you are searching for. Of course, if the keys are allowed to have commas, this would have to be adapted a little further.
You can check out the regexr here
Using a pattern with only a lookahead to assert a character to the right, will not assert if there is an accompanying character on the left.
Instead of using split, you could either match 1 or more repetitions of values between square brackets, or match any character except a comma.
(?:[^,]*[[^][]*])+[^,]*|[^,]+
s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
params = re.findall(r"(?:[^,]*[[^][]*])+[^,]*|[^,]+", s)
print(params)
Output
['year:2020', 'concepts:[ab553,cd779]', 'publisher:elsevier']
I adapted @Bemwa’s solution (which didn’t work for my use-case)
def split_by_commas(s):
lst = list()
brackets = 0
word = ""
for c in s:
if c == "[":
brackets += 1
elif c == "]":
if brackets > 0:
brackets -= 1
elif c == "," and not brackets:
lst.append(word)
word = ""
continue
word += c
lst.append(word)
return lst