Remove All Commas Between Quotes
Question:
I’m trying to remove all commas that are inside quotes ("
) with python:
'please,remove all the commas between quotes,"like in here, here, here!"'
^ ^
I tried this, but it only removes the first comma inside the quotes:
re.sub(r'(".*?),(.*?")',r'12','please,remove all the commas between quotes,"like in here, here, here!"')
Output:
'please,remove all the commas between quotes,"like in here here, here!"'
How can I make it remove all the commas inside the quotes?
Answers:
Assuming you don’t have unbalanced or escaped quotes, you can use this regex based on negative lookahead:
>>> str = r'foo,bar,"foobar, barfoo, foobarfoobar"'
>>> re.sub(r'(?!(([^"]*"){2})*[^"]*$),', '', str)
'foo,bar,"foobar barfoo foobarfoobar"'
This regex will find commas if those are inside the double quotes by using a negative lookahead to assert there are NOT even number of quotes after the comma.
Note about the lookaead (?!...)
:
([^"]*"){2}
finds a pair of quotes
(([^"]*"){2})*
finds 0 or more pair of quotes
[^"]*$
makes sure we don’t have any more quotes after last matched quote
- So
(?!...)
asserts that we don’t have even number of quotes ahead thus matching commas inside the quoted string only.
You can pass a function as the repl
argument instead of a replacement string. Just get the entire quoted string and do a simple string replace on the commas.
>>> s = 'foo,bar,"foobar, barfoo, foobarfoobar"'
>>> re.sub(r'"[^"]*"', lambda m: m.group(0).replace(',', ''), s)
'foo,bar,"foobar barfoo foobarfoobar"'
What about doing it with out regex?
input_str = '...'
first_slice = input_str.split('"')
second_slice = [first_slice[0]]
for slc in first_slice[1:]:
second_slice.extend(slc.split(','))
result = ''.join(second_slice)
Here is another option I came up with if you don’t want to use regex.
input_str = 'please,remove all the commas between quotes,"like in here, here, here!"'
quotes = False
def noCommas(string):
quotes = False
output = ''
for char in string:
if char == '"':
quotes = True
if quotes == False:
output += char
if char != ',' and quotes == True:
output += char
return output
print noCommas(input_str)
The above answer with for-looping through the string is very slow, if you want to apply your algorithm to a 5 MB csv file.
This seems to be reasonably fast and provides the same result as the for loop:
#!/bin/python3
data = 'hoko foko; moko soko; "aaa mo; bia"; "ee mo"; "eka koka"; "koni; masa"; "co co"; ehe mo; "bi; ko"; ko man "ka ku"; "ki; ko"n "ko;ma"; "ki ma"n"ehe;";koko'
first_split=data.split('"')
split01=[]
split02=[]
for slc in first_split[0::2]:
split01.append(slc)
for slc in first_split[1::2]:
slc_new=",".join(slc.split(";"))
split02.append(slc_new)
resultlist = [item for sublist in zip(split01, split02) for item in sublist]
if len(split01) > len (split02):
resultlist.append(split01[-1])
if len(split01) < len (split02):
resultlist.append(split02[-1])
result='"'.join(resultlist)
print(data)
print(split01)
print(split02)
print(result)
Results in:
hoko foko; moko soko; "aaa mo; bia"; "ee mo"; "eka koka"; "koni; masa"; "co co"; ehe mo; "bi; ko"; ko ma
"ka ku"; "ki; ko"
"ko;ma"; "ki ma"
"ehe;";koko
['hoko foko; moko soko; ', '; ', '; ', '; ', '; ', '; ehe mo; ', '; ko man ', '; ', 'n ', '; ', 'n', ';koko']
['aaa mo, bia', 'ee mo', 'eka koka', 'koni, masa', 'co co', 'bi, ko', 'ka ku', 'ki, ko', 'ko,ma', 'ki ma', 'ehe,']
hoko foko; moko soko; "aaa mo, bia"; "ee mo"; "eka koka"; "koni, masa"; "co co"; ehe mo; "bi, ko"; ko ma
"ka ku"; "ki, ko"
"ko,ma"; "ki ma"
"ehe,";koko
I’m trying to remove all commas that are inside quotes ("
) with python:
'please,remove all the commas between quotes,"like in here, here, here!"'
^ ^
I tried this, but it only removes the first comma inside the quotes:
re.sub(r'(".*?),(.*?")',r'12','please,remove all the commas between quotes,"like in here, here, here!"')
Output:
'please,remove all the commas between quotes,"like in here here, here!"'
How can I make it remove all the commas inside the quotes?
Assuming you don’t have unbalanced or escaped quotes, you can use this regex based on negative lookahead:
>>> str = r'foo,bar,"foobar, barfoo, foobarfoobar"'
>>> re.sub(r'(?!(([^"]*"){2})*[^"]*$),', '', str)
'foo,bar,"foobar barfoo foobarfoobar"'
This regex will find commas if those are inside the double quotes by using a negative lookahead to assert there are NOT even number of quotes after the comma.
Note about the lookaead (?!...)
:
([^"]*"){2}
finds a pair of quotes(([^"]*"){2})*
finds 0 or more pair of quotes[^"]*$
makes sure we don’t have any more quotes after last matched quote- So
(?!...)
asserts that we don’t have even number of quotes ahead thus matching commas inside the quoted string only.
You can pass a function as the repl
argument instead of a replacement string. Just get the entire quoted string and do a simple string replace on the commas.
>>> s = 'foo,bar,"foobar, barfoo, foobarfoobar"'
>>> re.sub(r'"[^"]*"', lambda m: m.group(0).replace(',', ''), s)
'foo,bar,"foobar barfoo foobarfoobar"'
What about doing it with out regex?
input_str = '...'
first_slice = input_str.split('"')
second_slice = [first_slice[0]]
for slc in first_slice[1:]:
second_slice.extend(slc.split(','))
result = ''.join(second_slice)
Here is another option I came up with if you don’t want to use regex.
input_str = 'please,remove all the commas between quotes,"like in here, here, here!"'
quotes = False
def noCommas(string):
quotes = False
output = ''
for char in string:
if char == '"':
quotes = True
if quotes == False:
output += char
if char != ',' and quotes == True:
output += char
return output
print noCommas(input_str)
The above answer with for-looping through the string is very slow, if you want to apply your algorithm to a 5 MB csv file.
This seems to be reasonably fast and provides the same result as the for loop:
#!/bin/python3
data = 'hoko foko; moko soko; "aaa mo; bia"; "ee mo"; "eka koka"; "koni; masa"; "co co"; ehe mo; "bi; ko"; ko man "ka ku"; "ki; ko"n "ko;ma"; "ki ma"n"ehe;";koko'
first_split=data.split('"')
split01=[]
split02=[]
for slc in first_split[0::2]:
split01.append(slc)
for slc in first_split[1::2]:
slc_new=",".join(slc.split(";"))
split02.append(slc_new)
resultlist = [item for sublist in zip(split01, split02) for item in sublist]
if len(split01) > len (split02):
resultlist.append(split01[-1])
if len(split01) < len (split02):
resultlist.append(split02[-1])
result='"'.join(resultlist)
print(data)
print(split01)
print(split02)
print(result)
Results in:
hoko foko; moko soko; "aaa mo; bia"; "ee mo"; "eka koka"; "koni; masa"; "co co"; ehe mo; "bi; ko"; ko ma
"ka ku"; "ki; ko"
"ko;ma"; "ki ma"
"ehe;";koko
['hoko foko; moko soko; ', '; ', '; ', '; ', '; ', '; ehe mo; ', '; ko man ', '; ', 'n ', '; ', 'n', ';koko']
['aaa mo, bia', 'ee mo', 'eka koka', 'koni, masa', 'co co', 'bi, ko', 'ka ku', 'ki, ko', 'ko,ma', 'ki ma', 'ehe,']
hoko foko; moko soko; "aaa mo, bia"; "ee mo"; "eka koka"; "koni, masa"; "co co"; ehe mo; "bi, ko"; ko ma
"ka ku"; "ki, ko"
"ko,ma"; "ki ma"
"ehe,";koko