Split string by comma, but ignore commas within brackets

Question

I’m trying to split a string by commas using python:

s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"

But I want to ignore any commas within brackets []. So the result for above would be:

["year:2020", "concepts:[ab553,cd779]", "publisher:elsevier"]

Anybody have advice on how to do this? I tried to use re.split like so:

params = re.split(",(?![wds])", param)

But it is not working properly.

Asked By: Casey

||

Source

Answer 1

result = re.split(r",(?!(?:[^,[]]+,)*[^,[]]+])", subject, 0)

,                 # Match the character “,” literally
(?!               # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
   (?:               # Match the regular expression below
      [^,[]]          # Match any single character NOT present in the list below
                           # The literal character “,”
                           # The literal character “[”
                           # The literal character “]”
         +                 # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      ,                 # Match the character “,” literally
   )
      *                 # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   [^,[]]          # Match any single character NOT present in the list below
                        # The literal character “,”
                        # The literal character “[”
                        # The literal character “]”
      +                 # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   ]                 # Match the character “]” literally
)

Updated to support more than 2 items in brackets. E.g.

year:2020,concepts:[ab553,cd779],publisher:elsevier,year:2020,concepts:[ab553,cd779,xx345],publisher:elsevier

Answered By: Dean Taylor

Answer 2

You can work this out using a user-defined function instead of split:

s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"


def split_by_commas(s):
    lst = list()
    last_bracket = ''
    word = ""
    for c in s:
        if c == '[' or c == ']':
            last_bracket = c
        if c == ',' and last_bracket == ']':
            lst.append(word)
            word = ""
            continue
        elif c == ',' and last_bracket == '[':
            word += c
            continue
        elif c == ',':
            lst.append(word)
            word = ""
            continue
        word += c
    lst.append(word)
    return lst
main_lst = split_by_commas(s)

print(main_lst)

The result of the run of above code:

['year:2020', 'concepts:[ab553,cd779]', 'publisher:elsevier']

Answered By: Bemwa Malak

Answer 3

This regex works on your example:

,(?=[^,]+?:)

Here, we use a positive lookahead to look for commas followed by non-comma and colon characters, then a colon. This correctly finds the <comma><key> pattern you are searching for. Of course, if the keys are allowed to have commas, this would have to be adapted a little further.

You can check out the regexr here

Answered By: pvandyken

Answer 4

Using a pattern with only a lookahead to assert a character to the right, will not assert if there is an accompanying character on the left.

Instead of using split, you could either match 1 or more repetitions of values between square brackets, or match any character except a comma.

(?:[^,]*[[^][]*])+[^,]*|[^,]+

Regex demo

s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
params = re.findall(r"(?:[^,]*[[^][]*])+[^,]*|[^,]+", s)
print(params)

Output

['year:2020', 'concepts:[ab553,cd779]', 'publisher:elsevier']

Answered By: The fourth bird

Answer 5

I adapted @Bemwa’s solution (which didn’t work for my use-case)

def split_by_commas(s):
    lst = list()
    brackets = 0
    word = ""
    for c in s:
        if c == "[":
            brackets += 1
        elif c == "]":
            if brackets > 0:
                brackets -= 1
        elif c == "," and not brackets:
            lst.append(word)
            word = ""
            continue
        word += c
    lst.append(word)
    return lst

Answered By: mnieber

Split string by comma, but ignore commas within brackets

Question:

Answers: