How to replace commas with semi-colon except commas in quotes Apache beam python
Question:
I want to replace commas from text and replace them with semi-colons except for the commas that are in quotation marks. The text lines look like this:
‘1001,838,"Calabash, Water Spinach",2000-01-01’
I tried by creating a DoFn class function which I then called into ParDo as follows:
class fixFormat(beam.DoFn):
def process(self, element):
orderl = element.split('"')
leftp = orderl[0].split(',')
rightp = orderl[2].split(',')
middlepart = orderl[1]
finalp = leftp + [middlepart] + rightp
new_line = ''
for part in finalp:
# to prevent the empty strings to be added add the if condition
if part:
new_line += part + ';'
yield new_line
class Transform(beam.DoFn):
def process(self, element):
yield elemententer code here
Create_2 = (p | 'Read lines2' >> beam.io.ReadFromText('orders_v.csv', skip_header_lines=1)
| 'format line2' >> beam.ParDo(Transform())
| 'fix' >> beam.ParDo(fixFormat()))
ib.show(Create_2, n =5, duration = 5)
I get the following error:
IndexError Traceback (most recent call last)
<ipython-input-5-b79548323b98> in process(self, element)
3 orderl = element.split('"')
4 leftp = orderl[0].split(',')
----> 5 rightp = orderl[2].split(',')
6 middlepart = orderl[1]
7 finalp = leftp + [middlepart] + rightp
IndexError: list index out of range [while running '[7]: fix']
It seems it’s taking the element as a non-separated text even though the fixFormat transform function splits the text in the first line. Not sure what I am missing. Please assist
Answers:
Here is a one-liner approach using a regex with re.findall
. We can eagerly try to first find doubly quoted terms. Only if that fails do we match a single CSV term. This approach preserves the commas inside double quotes. Then we join the list from re.findall
by semicolon.
import re
inp = '1001,838,"Calabash, Water Spinach",2000-01-01'
terms = ';'.join(re.findall(r'".*?"|[^,]+', inp))
print(terms) # 1001;838;"Calabash, Water Spinach";2000-01-01
You could also consider using the dataframes API if you’re trying to manipulate CSV files.
I want to replace commas from text and replace them with semi-colons except for the commas that are in quotation marks. The text lines look like this:
‘1001,838,"Calabash, Water Spinach",2000-01-01’
I tried by creating a DoFn class function which I then called into ParDo as follows:
class fixFormat(beam.DoFn):
def process(self, element):
orderl = element.split('"')
leftp = orderl[0].split(',')
rightp = orderl[2].split(',')
middlepart = orderl[1]
finalp = leftp + [middlepart] + rightp
new_line = ''
for part in finalp:
# to prevent the empty strings to be added add the if condition
if part:
new_line += part + ';'
yield new_line
class Transform(beam.DoFn):
def process(self, element):
yield elemententer code here
Create_2 = (p | 'Read lines2' >> beam.io.ReadFromText('orders_v.csv', skip_header_lines=1)
| 'format line2' >> beam.ParDo(Transform())
| 'fix' >> beam.ParDo(fixFormat()))
ib.show(Create_2, n =5, duration = 5)
I get the following error:
IndexError Traceback (most recent call last)
<ipython-input-5-b79548323b98> in process(self, element)
3 orderl = element.split('"')
4 leftp = orderl[0].split(',')
----> 5 rightp = orderl[2].split(',')
6 middlepart = orderl[1]
7 finalp = leftp + [middlepart] + rightp
IndexError: list index out of range [while running '[7]: fix']
It seems it’s taking the element as a non-separated text even though the fixFormat transform function splits the text in the first line. Not sure what I am missing. Please assist
Here is a one-liner approach using a regex with re.findall
. We can eagerly try to first find doubly quoted terms. Only if that fails do we match a single CSV term. This approach preserves the commas inside double quotes. Then we join the list from re.findall
by semicolon.
import re
inp = '1001,838,"Calabash, Water Spinach",2000-01-01'
terms = ';'.join(re.findall(r'".*?"|[^,]+', inp))
print(terms) # 1001;838;"Calabash, Water Spinach";2000-01-01
You could also consider using the dataframes API if you’re trying to manipulate CSV files.