Error in Reading a csv file in pandas[CParserError: Error tokenizing data. C error: Buffer overflow caught – possible malformed input file.]

Question

So i tried reading all the csv files from a folder and then concatenate them to create a big csv(structure of all the files was same), save it and read it again. All this was done using Pandas. The Error occurs while reading. I am Attaching the code and the Error below.

import pandas as pd
import numpy as np
import glob

path =r'somePath' # use your path
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df)
store = pd.concat(list_)
store.to_csv("C:workDATARaw_data\store.csv", sep=',', index= False)
store1 = pd.read_csv("C:workDATARaw_data\store.csv", sep=',')

Error:-

CParserError                              Traceback (most recent call last)
<ipython-input-48-2983d97ccca6> in <module>()
----> 1 store1 = pd.read_csv("C:workDATARaw_data\store.csv", sep=',')

C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    472                     skip_blank_lines=skip_blank_lines)
    473 
--> 474         return _read(filepath_or_buffer, kwds)
    475 
    476     parser_f.__name__ = name

C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in _read(filepath_or_buffer, kwds)
    258         return parser
    259 
--> 260     return parser.read()
    261 
    262 _parser_defaults = {

C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in read(self, nrows)
    719                 raise ValueError('skip_footer not supported for iteration')
    720 
--> 721         ret = self._engine.read(nrows)
    722 
    723         if self.options.get('as_recarray'):

C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in read(self, nrows)
   1168 
   1169         try:
-> 1170             data = self._reader.read(nrows)
   1171         except StopIteration:
   1172             if nrows is None:

pandasparser.pyx in pandas.parser.TextReader.read (pandasparser.c:7544)()

pandasparser.pyx in pandas.parser.TextReader._read_low_memory (pandasparser.c:7784)()

pandasparser.pyx in pandas.parser.TextReader._read_rows (pandasparser.c:8401)()

pandasparser.pyx in pandas.parser.TextReader._tokenize_rows (pandasparser.c:8275)()

pandasparser.pyx in pandas.parser.raise_parser_error (pandasparser.c:20691)()

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

I tried using csv reader as well:-

import csv
with open("C:workDATARaw_data\store.csv", 'rb') as f:
    reader = csv.reader(f)
    l = list(reader)

Error:-

Error                                     Traceback (most recent call last)
<ipython-input-36-9249469f31a6> in <module>()
      1 with open('C:workDATARaw_data\store.csv', 'rb') as f:
      2     reader = csv.reader(f)
----> 3     l = list(reader)

Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

Asked By: Arman Sharma

||

Source

Answer 1

Not an answer, but too long for a comment (not speaking of code formatting)

As it breaks when you read it in csv module, you can at least locate the line where the error occurs:

import csv
with open(r"C:workDATARaw_datastore.csv", 'rb') as f:
    reader = csv.reader(f)
    linenumber = 1
    try:
        for row in reader:
            linenumber += 1
    except Exception as e:
        print (("Error line %d: %s %s" % (linenumber, str(type(e)), e.message)))

Then look in store.csv what happens at that line.

Answered By: Serge Ballesta

Answer 2

I found this error, the cause was that there were some carriage returns “r” in the data that pandas was using as a line terminator as if it was “n”. I thought I’d post here as that might be a common reason this error might come up.

The solution I found was to add lineterminator=’n’ into the read_csv function like this:

df_clean = pd.read_csv('test_error.csv',
                 lineterminator='n')

Answered By: Louise Fallon

Answer 3

If you are using python and its a big file you may use
engine='python' as below and should work.

df = pd.read_csv( file_, index_col=None, header=0, engine='python' )

Answered By: Firas Aswad

Answer 4

Change the directory to the CSV

Corpus = pd.read_csv(r"C:UsersDellDesktopDataset.csv",encoding='latin-1')

Answered By: Maham syed

Answer 5

the problem is from the format of the excel file.
We select Save as Options from menu and change the format from xls to csv, then
it will surely work.

Answered By: trylearning2022

Answer 6

In my case, the solution was to specify encoding to utf-16 as per the following answer:
https://stackoverflow.com/a/64516600/9836333

pd.read_csv("C:workDATARaw_datastore.csv", sep=',', encoding='utf-16')

Answered By: Mike

Error in Reading a csv file in pandas[CParserError: Error tokenizing data. C error: Buffer overflow caught – possible malformed input file.]

Question:

Answers: