pandas.read_csv ParserError: Expected x fields, saw y

Question

I’m trying to read a txt file that as different number of columns per row. Here’s the beginning of my file:

60381 6
1 0.270 0.30 0.30 0.70 0.70
4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988
2 0.078 0.30 0.30 0.70 0.70
5.387 5.312 5.338 4.463 4.675 4.275 4.238 3.562 3.175 3.925 4.950 4.762
6 0.241 0.30 0.60 0.70 0.40
3.700 3.200 2.738 2.325 1.250 0.975 1.175 1.950 2.488 3.613 3.987 3.950
7 0.357 0.30 0.60 0.70 0.40
1.212 1.125 1.050 0.950 0.663 0.488 0.425 0.512 0.637 0.900 1.112 1.188
8 0.031 0.30 0.70 0.70 0.30
0.225 0.213 0.200 0.175 0.200 0.213 0.375 0.887 0.975 0.512 0.262 0.262
10 0.022 0.30 0.80 0.70 0.20
0.712 0.700 0.738 0.550 0.513 0.688 0.613 0.600 0.850 0.812 0.800 0.775
60382 5
6 0.197 0.30 0.60 0.70 0.40
3.700 3.200 2.738 2.325 1.250 0.975 1.175 1.950 2.488 3.613 3.987 3.950
7 0.413 0.30 0.60 0.70 0.40
1.212 1.125 1.050 0.950 0.663 0.488 0.425 0.512 0.637 0.900 1.112 1.188
8 0.016 0.30 0.70 0.70 0.30
0.225 0.213 0.200 0.175 0.200 0.213 0.375 0.887 0.975 0.512 0.262 0.262
10 0.111 0.30 0.80 0.70 0.20
0.712 0.700 0.738 0.550 0.513 0.688 0.613 0.600 0.850 0.812 0.800 0.775
11 0.263 0.30 0.50 0.70 0.50
1.812 1.388 1.087 0.825 0.538 0.400 0.338 0.400 0.500 0.925 0.962 1.100

I’ve tried using pandas read_csv to read it:

import pandas as pd
data = pd.read_csv('./myfile.txt',header=None,sep='s')

Which gives the following error:

ParserError: Expected 6 fields in line 3, saw 12. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

So my file doesn’t have a multi-char delimiter or quotation marks. I’ve tried a solution for this I found in this forum, which suggested using:

data = pd.read_csv(open('./myfile.txt','r'), header=None,encoding='utf-8', engine='c')

Although this solves the error above, the array I’m presented with does not use space as a delimiter of columns, and the output has only 1 column:

How should I read the file in order to get a column for each value? I don’t mind if there are nan values that fill the rest.

Asked By: lanadaquenada

||

Source

Answer 1

If you’ve managed to get the data in a single column, you can use Series.str.split() to workaround this issue.

Here is an example with some sample data you provided (you can use any string or a regex as a delimiter in split()) :

df[0].str.split(' ', expand=True)

     0      1      2      3      4      5      6      7      8      9   
0  0.270   0.30   0.30   0.70   0.70   None   None   None   None   None
1  4.988  4.988  4.988  4.988  4.988  4.988  4.988  4.988  4.988  4.988

If you do this, then you might as well create the dataframe with pd.DataFrame(open(...).readlines()) or something like that, since you don’t benefit at all from read_csv(), and your file isn’t exactly a standard csv file.

# f is a StringIO with some of your sample data to simulate a file
df = pd.DataFrame(line.strip().split(' ') for line in f)

       0      1      2      3      4      5      6      7      8      9   
0   60381      6   None   None   None   None   None   None   None   None
1       1  0.270   0.30   0.30   0.70   0.70   None   None   None   None
2   4.988  4.988  4.988  4.988  4.988  4.988  4.988  4.988  4.988  4.988
3       2  0.078   0.30   0.30   0.70   0.70   None   None   None   None
4   5.387  5.312  5.338  4.463  4.675  4.275  4.238  3.562  3.175  3.925

Of course, you can also fix the input file by making sure every line contains the same number of columns, which will solve the ParserError issue.

Answered By: Unatiel

Answer 2

Consider iteratively rebuilding the text file and then pd.read_table() on cleaned text version:

from io import StringIO

txt="""
60381 6
1 0.270 0.30 0.30 0.70 0.70
4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988
2 0.078 0.30 0.30 0.70 0.70
5.387 5.312 5.338 4.463 4.675 4.275 4.238 3.562 3.175 3.925 4.950 4.762
6 0.241 0.30 0.60 0.70 0.40
3.700 3.200 2.738 2.325 1.250 0.975 1.175 1.950 2.488 3.613 3.987 3.950
7 0.357 0.30 0.60 0.70 0.40
1.212 1.125 1.050 0.950 0.663 0.488 0.425 0.512 0.637 0.900 1.112 1.188
8 0.031 0.30 0.70 0.70 0.30
0.225 0.213 0.200 0.175 0.200 0.213 0.375 0.887 0.975 0.512 0.262 0.262
10 0.022 0.30 0.80 0.70 0.20
0.712 0.700 0.738 0.550 0.513 0.688 0.613 0.600 0.850 0.812 0.800 0.775
60382 5
6 0.197 0.30 0.60 0.70 0.40
3.700 3.200 2.738 2.325 1.250 0.975 1.175 1.950 2.488 3.613 3.987 3.950
7 0.413 0.30 0.60 0.70 0.40
1.212 1.125 1.050 0.950 0.663 0.488 0.425 0.512 0.637 0.900 1.112 1.188
8 0.016 0.30 0.70 0.70 0.30
0.225 0.213 0.200 0.175 0.200 0.213 0.375 0.887 0.975 0.512 0.262 0.262
10 0.111 0.30 0.80 0.70 0.20
0.712 0.700 0.738 0.550 0.513 0.688 0.613 0.600 0.850 0.812 0.800 0.775
11 0.263 0.30 0.50 0.70 0.50
1.812 1.388 1.087 0.825 0.538 0.400 0.338 0.400 0.500 0.925 0.962 1.100
"""

for line in StringIO(txt):

    if len(line) == 8:
        header = line
        next
    if len(line) == 28 or len(line) == 29:
        firstdata = line
        next
    if len(line) == 72:
        seconddata = header + ' ' + firstdata + ' ' + line
        print(seconddata.replace("n", ""))

Output

# 60381 6 1 0.270 0.30 0.30 0.70 0.70 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988
# 60381 6 2 0.078 0.30 0.30 0.70 0.70 5.387 5.312 5.338 4.463 4.675 4.275 4.238 3.562 3.175 3.925 4.950 4.762
# 60381 6 6 0.241 0.30 0.60 0.70 0.40 3.700 3.200 2.738 2.325 1.250 0.975 1.175 1.950 2.488 3.613 3.987 3.950
# 60381 6 7 0.357 0.30 0.60 0.70 0.40 1.212 1.125 1.050 0.950 0.663 0.488 0.425 0.512 0.637 0.900 1.112 1.188
# 60381 6 8 0.031 0.30 0.70 0.70 0.30 0.225 0.213 0.200 0.175 0.200 0.213 0.375 0.887 0.975 0.512 0.262 0.262
# 60381 6 10 0.022 0.30 0.80 0.70 0.20 0.712 0.700 0.738 0.550 0.513 0.688 0.613 0.600 0.850 0.812 0.800 0.775
# 60382 5 6 0.197 0.30 0.60 0.70 0.40 3.700 3.200 2.738 2.325 1.250 0.975 1.175 1.950 2.488 3.613 3.987 3.950
# 60382 5 7 0.413 0.30 0.60 0.70 0.40 1.212 1.125 1.050 0.950 0.663 0.488 0.425 0.512 0.637 0.900 1.112 1.188
# 60382 5 8 0.016 0.30 0.70 0.70 0.30 0.225 0.213 0.200 0.175 0.200 0.213 0.375 0.887 0.975 0.512 0.262 0.262
# 60382 5 10 0.111 0.30 0.80 0.70 0.20 0.712 0.700 0.738 0.550 0.513 0.688 0.613 0.600 0.850 0.812 0.800 0.775
# 60382 5 11 0.263 0.30 0.50 0.70 0.50 1.812 1.388 1.087 0.825 0.538 0.400 0.338 0.400 0.500 0.925 0.962 1.100

Now for file import into pandas, consider writing new output to text file:

with open('input.txt', 'r') as f1, open('output.txt', 'w') as f2:
   for line in f1:
      if len(line) == 8:
          header = line
          next
      if len(line) == 28 or len(line) == 29:
          firstdata = line
          next
      if len(line) == 72:
          seconddata = header + ' ' + firstdata + ' ' + line
          f2.write(seconddata.replace("n", ""))

df = read.table('output.txt', header=None, sep="\s+")

Answered By: Parfait

pandas.read_csv ParserError: Expected x fields, saw y

Question:

Answers: