Read CSV file with skip rows until we find certain no of columns using python

Question:

I want to read CSV file using python by skiprows dynamically after condition.

Condition – whenever I found 6 cols in CSV read from there or either when i find col names sequence as those 6 cols.

File.csv

Col1,col2,col3

1,2,3

13,u,u

,,,

,,,

Col1,col2,col3,col4

1,2,3,4

13,u,u,y

,,,

,,,

Col1,col2,col3,col4,col5,col6

1,2,3,4,5,6

qw,ers,hh,yj,df,ji

Now I’m reading this file using pandas.read_csv()

I know that at 10th row i have required cols.

pandas.read_csv("file.csv", skiprows=10, header=None)

Want to skip this dynamically by skipping rows when we 6 cols or either in this sequence col1,col2,col3,col4,col5,col6.

start =  df.loc[df.FILE-START == 'col1,col2,col3,col4,col5,col6'].index[0]
df = pd.read_csv(filename, skiprows = start + 1)

Tried this but it’s not working.

Asked By: Keshav

||

Answers:

Update

A more robust version using csv module:

import pandas as pd
import csv
import io

with open('File.csv') as fp:
    while True:
        pos = fp.tell()
        reader = csv.reader(io.StringIO(fp.readline()))
        row = next(reader)
        if len(row) == 6:
            break
    fp.seek(pos)
    df = pd.read_csv(fp)

Old answer

You can read the file line by line until you found 6 columns or 5 commas (take care if you have quotes and comma between them. But it’s fine for a simple csv file:

import pandas as pd

with open('File.csv') as fp:
    while True:
        pos = fp.tell()
        row = fp.readline()
        if row.count(',') == 5:
            break
    fp.seek(pos)
    df = pd.read_csv(fp)

Output:

>>> df
  Col1 col2 col3 col4 col5 col6
0    1    2    3    4    5    6
1   qw  ers   hh   yj   df   ji
Answered By: Corralien

You can use the approach as follows:

def check_num_or_colseq(row):
    return len(row)==6 or (row[0]=='col1' and row[1]=='col2' and row[2]=='col3' and row[3]=='col4' and row[4]=='col5' and row[5]=='col6')

 // suppose you read the csv file
    readervar = csv.reader(file)
    for i,row in enumrate(readervar):
          if check_num_or_colseq(row):
              skip = i 
              break

df = pd.read_csv(filename, skiprows = skip + 1)

I think all of the code above is self-explanatory. Hope this helps.

Answered By: Suchandra T

Another option with pandas’ DataFrame constructor :

import csv
import pandas as pd

with open("file.csv") as csv_file:
    csv_reader = csv.reader(csv_file)
    rows = [row for row in csv_reader if len(row) == 6]
    data_six = {"columns": rows[0], "data": rows[1:]}​
    df = pd.DataFrame(**data_six)

As explained by @Corralien, with this approach, pandas lose the ability to infer data types for each column since csv.reader returns always a list of strings.

csv.reader(csvfile, dialect=’excel’, **fmtparams)
Return a reader object which will iterate over lines in the given csvfile. csvfile can
be any object which supports the iterator protocol and returns a
string each time its _next_() method is called — file objects and
list objects are both suitable. Each row read from the csv file is
returned as a list of strings.

Source : [docs.python]

Output :

print(df)

  Col1 col2 col3 col4 col5 col6
0    1    2    3    4    5    6
1   qw  ers   hh   yj   df   ji

Nota: this assumes that your csv file always ends up with six columns data and with a unique header.

Answered By: Timeless
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.