How to replace commas with semi-colon except commas in quotes Apache beam python

Question:

I want to replace commas from text and replace them with semi-colons except for the commas that are in quotation marks. The text lines look like this:
‘1001,838,"Calabash, Water Spinach",2000-01-01’

I tried by creating a DoFn class function which I then called into ParDo as follows:

class fixFormat(beam.DoFn):
  def process(self, element):
    orderl = element.split('"')
    leftp = orderl[0].split(',')
    rightp = orderl[2].split(',')
    middlepart = orderl[1]
    finalp = leftp + [middlepart] + rightp
    new_line = ''
    for part in finalp:
 # to prevent the empty strings to be added add the if condition
     if part:
      new_line += part + ';'
    yield new_line 

class Transform(beam.DoFn):
  def process(self, element):
    yield elemententer code here

Create_2 = (p | 'Read lines2' >> beam.io.ReadFromText('orders_v.csv', skip_header_lines=1)
          | 'format line2' >> beam.ParDo(Transform())
          | 'fix' >> beam.ParDo(fixFormat()))

ib.show(Create_2, n =5, duration = 5)

I get the following error:

IndexError                                Traceback (most recent call last)
<ipython-input-5-b79548323b98> in process(self, element)
      3     orderl = element.split('"')
      4     leftp = orderl[0].split(',')
----> 5     rightp = orderl[2].split(',')
      6     middlepart = orderl[1]
      7     finalp = leftp + [middlepart] + rightp
IndexError: list index out of range [while running '[7]: fix']

It seems it’s taking the element as a non-separated text even though the fixFormat transform function splits the text in the first line. Not sure what I am missing. Please assist

Asked By: Katlego_mich

||

Answers:

Here is a one-liner approach using a regex with re.findall. We can eagerly try to first find doubly quoted terms. Only if that fails do we match a single CSV term. This approach preserves the commas inside double quotes. Then we join the list from re.findall by semicolon.

import re

inp = '1001,838,"Calabash, Water Spinach",2000-01-01'
terms = ';'.join(re.findall(r'".*?"|[^,]+', inp))
print(terms)  # 1001;838;"Calabash, Water Spinach";2000-01-01
Answered By: Tim Biegeleisen

You could also consider using the dataframes API if you’re trying to manipulate CSV files.

Answered By: robertwb
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.