pandas failing with variable columns

Question:

my file is this

    4 7 a a
    s g 6 8 0 d
    g 6 2 1 f 7 9 
    f g 3 
    1 2 4 6 8 9 0

I was using pandas to save it in form of pandas object. But I am getting the following error
pandas.parser.CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 8

The code I used was
file = pd.read_csv("a.txt",dtype = None,delimiter = " ")

Can anyone suggest an idea to include the file as such ?

Asked By: Vinodini Natrajan

||

Answers:

Using pandas this will raise an error because the function expects there to be a certain number of columns, in this case 6, but when it got to the third row it encountered 8. One way to handle this is to not read the the rows that have more columns than the first row of the dataframe. This could be done using error_bad_lines parameter. This is what the docs say about error_bad_lines:

error_bad_lines : boolean, default True Lines with too many fields
(e.g. a csv line with too many commas) will by default cause an
exception to be raised, and no DataFrame will be returned. If False,
then these “bad lines” will dropped from the DataFrame that is
returned. (Only valid with C parser)

So you could do this:

>>> file = pd.read_csv("a.txt",dtype = None,delimiter = " ",error_bad_lines=False)
Skipping line 3: expected 6 fields, saw 8
Skipping line 5: expected 6 fields, saw 7

>>> file
     4    7    a  a.1
s g  6  8.0  0.0    d
f g  3  NaN  NaN  NaN

Or you could use skiprows parameter to skip the rows that you would like, this is what the docs have to say about skiprows:

skiprows : list-like or integer, default None Line numbers to skip
(0-indexed) or number of lines to skip (int) at the start of the file

Answered By: Peter Peluso

Here’s one way.

In [50]: !type temp.csv
4,7,a,a
s,g,6,8,0,d
g,6,2,1,f,7,9
f,g,3
1,2,4,6,8,9,0

Read the csv to list of lists and then convert to DataFrame.

In [51]: pd.DataFrame([line.strip().split(',') for line in open('temp.csv', 'r')])
Out[51]:
   0  1  2     3     4     5     6
0  4  7  a     a  None  None  None
1  s  g  6     8     0     d  None
2  g  6  2     1     f     7     9
3  f  g  3  None  None  None  None
4  1  2  4     6     8     9     0
Answered By: Zero

This is a variation of @Zero’s answer (https://stackoverflow.com/a/40881292/19007114), but without the simplistic str.strip().split(), which can be error prone with some CSV content (e.g. strings that contain commas).

csv_data = """
4,7,a,a
s,g,6,8,0,d
g,6,2,1,f,7,9
f,g,3
1,2,4,6,8,9,0
"""

pd.DataFrame( [ pd.read_csv( StringIO(line), header=None ).squeeze().tolist() for line in StringIO(csv_data) ] )

   0  1  2     3     4     5    6
0  4  7  a     a  None  None  NaN
1  s  g  6     8     0     d  NaN
2  g  6  2     1     f     7  9.0
3  f  g  3  None  None  None  NaN
4  1  2  4     6     8     9  0.0

NOTE: for some reason the last column shows NaN instead of None. I presume it’s a bug in the Debian 10 Buster python3-pandas (0.23.3) package?

I use this technique in a slightly different way – to get values as a list of lists.

[ pd.read_csv( StringIO(line), header=None ).squeeze().tolist() for line in StringIO(s1) ]

[[4, 7, 'a', 'a'], ['s', 'g', 6, 8, 0, 'd'], ['g', 6, 2, 1, 'f', 7, 9], ['f', 'g', 3], [1, 2, 4, 6, 8, 9, 0]]
Answered By: user19007114