Ignoring bad rows of data in pandas.read_csv() that break header= keyword

Question

I have a series of very messy *.csv files that are being read in by pandas. An example csv is:

Instrument 35392
"Log File Name : station"
"Setup Date (MMDDYY) : 031114"
"Setup Time (HHMMSS) : 073648"
"Starting Date (MMDDYY) : 031114"
"Starting Time (HHMMSS) : 090000"
"Stopping Date (MMDDYY) : 031115"
"Stopping Time (HHMMSS) : 235959"
"Interval (HHMMSS) : 010000"
"Sensor warmup (HHMMSS) : 000200"
"Circltr warmup (HHMMSS) : 000200" 


"Date","Time","","Temp","","SpCond","","Sal","","IBatt",""
"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","Volts",""

"Random message here 031114 073721 to 031114 083200"
03/11/14,09:00:00,"",15.85,"",1.408,"",.74,"",6.2,""
03/11/14,10:00:00,"",15.99,"",1.96,"",1.05,"",6.3,""
03/11/14,11:00:00,"",14.2,"",40.8,"",26.12,"",6.2,""
03/11/14,12:00:01,"",14.2,"",41.7,"",26.77,"",6.2,""
03/11/14,13:00:00,"",14.5,"",41.3,"",26.52,"",6.2,""
03/11/14,14:00:00,"",14.96,"",41,"",26.29,"",6.2,""
"message 3"
"message 4"**

I have been using this code to import the *csv file, process the double headers, pull out the empty columns, and then strip the offending rows with bad data:

DF = pd.read_csv(BADFILE,parse_dates={'Datetime_(ascii)': [0,1]}, sep=",", 
             header=[10,11],na_values=['','na', 'nan nan'], 
             skiprows=[10], encoding='cp1252')

DF = DF.dropna(how="all", axis=1)
DF = DF.dropna(thresh=2)
droplist = ['message', 'Random']
DF = DF[~DF['Datetime_(ascii)'].str.contains('|'.join(droplist))]

DF.head()

Datetime_(ascii)    (Temp, øC)  (SpCond, mS/cm) (Sal, ppt)  (IBatt, Volts)
0   03/11/14 09:00:00   15.85   1.408   0.74    6.2
1   03/11/14 10:00:00   15.99   1.960   1.05    6.3
2   03/11/14 11:00:00   14.20   40.800  26.12   6.2
3   03/11/14 12:00:01   14.20   41.700  26.77   6.2
4   03/11/14 13:00:00   14.50   41.300  26.52   6.2

This was working fine and dandy until I have a file that has an erronious 1 row line after the header: “Random message here 031114 073721 to 031114 083200”

The error I receieve is:

    *C:UsersUSERAppDataLocalContinuumAnaconda3libsite-
    packagespandasioparsers.py in _do_date_conversions(self, names, data)
   1554             data, names = _process_date_conversion(
   1555                 data, self._date_conv, self.parse_dates, self.index_col,
    -> 1556                 self.index_names, names, 
    keep_date_col=self.keep_date_col)
   1557 
   1558         return names, data
    C:UsersUSERAppDataLocalContinuumAnaconda3libsite-
    packagespandasioparsers.py in _process_date_conversion(data_dict, 
    converter, parse_spec, index_col, index_names, columns, keep_date_col)
   2975     if not keep_date_col:
   2976         for c in list(date_cols):
    -> 2977             data_dict.pop(c)
   2978             new_cols.remove(c)
   2979 
   KeyError: ('Time', 'HHMMSS')*

If I remove that line, the code works fine. Similarly, if I remove the header= line the code works fine. However, I want to be able to preserve this because I am reading in hundreds of these files.

Difficulty: I would prefer to not open each file before the call to pandas.read_csv() as these files can be rather large – thus I don’t want to read and save multiple times! Also, I would prefer a real pandas/pythonic solution that doesn’t involve openning the file first as a stringIO buffer to removing offending lines.

Asked By: name goes here

||

Source

Answer 1

Here’s one approach, making use of the fact that skip_rows accepts a callable function. The function receives only the row index being considered, which is a built-in limitation of that parameter.

As such, the callable function skip_test() first checks whether the current index is in the set of known indices to skip. If not, then it opens the actual file and checks the corresponding row to see if its contents match.

The skip_test() function is a little hacky in the sense that it does inspect the actual file, although it only inspects up until the current row index it’s evaluating. It also assumes that the bad line always begins with the same string (in the example case, "foo"), but that seems to be a safe assumption given OP.

# example data
""" foo.csv
uid,a,b,c
0,1,2,3
skip me
1,11,22,33
foo
2,111,222,333 
"""

import pandas as pd

def skip_test(r, fn, fail_on, known):
    if r in known: # we know we always want to skip these
        return True
    # check if row index matches problem line in file
    # for efficiency, quit after we pass row index in file
    f = open(fn, "r")
    data = f.read()
    for i, line in enumerate(data.splitlines()):
        if (i == r) & line.startswith(fail_on):
            return True
        elif i > r:
            break
    return False

fname = "foo.csv"
fail_str = "foo"
known_skip = [2]
pd.read_csv(fname, sep=",", header=0, 
            skiprows=lambda x: skip_test(x, fname, fail_str, known_skip))
# output
   uid    a    b    c
0    0    1    2    3
1    1   11   22   33
2    2  111  222  333

If you know exactly which line the random message will appear on when it does appear, then this will be much faster, as you can just tell it not to inspect the file contents for any index past the potential offending line.

Answered By: andrew_reece

Answer 2

After some tinkering yesterday I found a solution and what the potential issue may be.

I tried the skip_test() function answer above, but I was still getting errors with the size of the table:

pandas_libsparsers.pyx in pandas._libs.parsers.TextReader.read (pandas_libsparsers.c:10862)()

pandas_libsparsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas_libsparsers.c:11138)()

pandas_libsparsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas_libsparsers.c:11884)()

pandas_libsparsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas_libsparsers.c:11755)()

pandas_libsparsers.pyx in pandas._libs.parsers.raise_parser_error (pandas_libsparsers.c:28765)()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 14, saw 11

So after playing around with skiprows= I discovered that I was just not getting the behavior I wanted when using the engine=’c’. read_csv() was still determining the size of the file from those first few rows, and some of those single column rows were still being passed. It may be that I have a few more bad single column rows in my csv set that I did not plan on.

Instead, I create an arbitrary sized DataFrame as a template. I pull in the entire .csv file, then use logic to strip out the NaN rows.

For example, I know that the largest table that I will encounter with my data will be 10 rows long. So my call to pandas is:

DF = pd.read_csv(csv_file, sep=',', 
     parse_dates={'Datetime_(ascii)': [0,1]},
     na_values=['','na', '999999', '#'], engine='c', 
     encoding='cp1252', names = list(range(0,10)))

I then use these two lines to drop the NaN rows and columns from the DataFrame:

#drop the null columns created by double deliminators
DF = DF.dropna(how="all", axis=1)
DF = DF.dropna(thresh=2)  # drop if we don't have at least 2 cells with real values

Answered By: name goes here

Answer 3

If anyone in the future comes across this question, pandas has now implemented the on_bad_lines argument. You can now solve this problem by using on_bad_lines = "skip"

Answered By: clementzach

Ignoring bad rows of data in pandas.read_csv() that break header= keyword

Question:

Answers: