pandas read csv with extra commas in column

Question:

I’m reading a basic csv file where the columns are separated by commas with these column names:

userid, username, body

However, the body column is a string which may contain commas. Obviously this causes a problem and pandas throws out an error:

CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 8

Is there a way to tell pandas to ignore commas in a specific column or a way to go around this problem?

Asked By: David

||

Answers:

Imagine we’re reading your dataframe called comma.csv:

userid, username, body
01, n1, 'string1, string2'

One thing you can do is to specify the delimiter of the strings in the column with:

df = pd.read_csv('comma.csv', quotechar="'")

In this case strings delimited by ' are considered as total, no matter commas inside them.

Answered By: Fabio Lamanna

Add usecols and lineterminator to your read_csv() function, which, n is the len of your columns.

In my case:

n = 5 #define yours
df = pd.read_csv(file,
                 usecols=range(n),
                 lineterminator='n',
                 header=None)
Answered By: Ilyas

Does this help?

import csv
with open("csv_with_commas.csv", newline='', encoding = 'utf8') as f:
    csvread = csv.reader(f)
    batch_data = list(csvread)
    print(batch_data)

Reference:

[1] https://stackoverflow.com/a/40477760/6907424

[2] To combat "UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x8f in position 157: character maps to undefined": https://stackoverflow.com/a/9233174/6907424

Answered By: hafiz031

for me none of the above code samples worked (I was working on Netflix Prize dataset on Kaggle) but there is actually one cool feature from pandas version 1.3.0+ which an on_bad_lines parameter that let you use a callback function. Here is what I did:

def manual_separation(bad_line):
    right_split = bad_line[:-2] + [",".join(bad_line[-2:])] # All the "bad lines" where all coming from the same last column that was containing ","
    return right_split

filename = "netflix_movie_titles.csv"
df = pd.read_csv(
        filename, 
        header=None,
        encoding="ISO-8859-1",
        names = ['Movie_Id', 'Year', 'Name'], 
        on_bad_lines=manual_separation,
        engine="python",
    )

Works like a charm! Your only obligation is to use engine=python. Hope that helps!

Answered By: Antoine Krajnc
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.