Read CSV file with skip rows until we find certain no of columns using python
Question:
I want to read CSV file using python by skiprows dynamically after condition.
Condition – whenever I found 6 cols in CSV read from there or either when i find col names sequence as those 6 cols.
File.csv
Col1,col2,col3
1,2,3
13,u,u
,,,
,,,
Col1,col2,col3,col4
1,2,3,4
13,u,u,y
,,,
,,,
Col1,col2,col3,col4,col5,col6
1,2,3,4,5,6
qw,ers,hh,yj,df,ji
Now I’m reading this file using pandas.read_csv()
I know that at 10th row i have required cols.
pandas.read_csv("file.csv", skiprows=10, header=None)
Want to skip this dynamically by skipping rows when we 6 cols or either in this sequence col1,col2,col3,col4,col5,col6.
start = df.loc[df.FILE-START == 'col1,col2,col3,col4,col5,col6'].index[0]
df = pd.read_csv(filename, skiprows = start + 1)
Tried this but it’s not working.
Answers:
Update
A more robust version using csv
module:
import pandas as pd
import csv
import io
with open('File.csv') as fp:
while True:
pos = fp.tell()
reader = csv.reader(io.StringIO(fp.readline()))
row = next(reader)
if len(row) == 6:
break
fp.seek(pos)
df = pd.read_csv(fp)
Old answer
You can read the file line by line until you found 6 columns or 5 commas (take care if you have quotes and comma between them. But it’s fine for a simple csv file:
import pandas as pd
with open('File.csv') as fp:
while True:
pos = fp.tell()
row = fp.readline()
if row.count(',') == 5:
break
fp.seek(pos)
df = pd.read_csv(fp)
Output:
>>> df
Col1 col2 col3 col4 col5 col6
0 1 2 3 4 5 6
1 qw ers hh yj df ji
You can use the approach as follows:
def check_num_or_colseq(row):
return len(row)==6 or (row[0]=='col1' and row[1]=='col2' and row[2]=='col3' and row[3]=='col4' and row[4]=='col5' and row[5]=='col6')
// suppose you read the csv file
readervar = csv.reader(file)
for i,row in enumrate(readervar):
if check_num_or_colseq(row):
skip = i
break
df = pd.read_csv(filename, skiprows = skip + 1)
I think all of the code above is self-explanatory. Hope this helps.
Another option with pandas’ DataFrame
constructor :
import csv
import pandas as pd
with open("file.csv") as csv_file:
csv_reader = csv.reader(csv_file)
rows = [row for row in csv_reader if len(row) == 6]
data_six = {"columns": rows[0], "data": rows[1:]}
df = pd.DataFrame(**data_six)
As explained by @Corralien, with this approach, pandas lose the ability to infer data types for each column since csv.reader
returns always a list of strings.
csv.reader(csvfile, dialect=’excel’, **fmtparams)
Return a reader object which will iterate over lines in the given csvfile. csvfile can
be any object which supports the iterator protocol and returns a
string each time its _next_() method is called — file objects and
list objects are both suitable. Each row read from the csv file is
returned as a list of strings.
Source : [docs.python]
Output :
print(df)
Col1 col2 col3 col4 col5 col6
0 1 2 3 4 5 6
1 qw ers hh yj df ji
Nota: this assumes that your csv file always ends up with six columns data and with a unique header.
I want to read CSV file using python by skiprows dynamically after condition.
Condition – whenever I found 6 cols in CSV read from there or either when i find col names sequence as those 6 cols.
File.csv
Col1,col2,col3
1,2,3
13,u,u
,,,
,,,
Col1,col2,col3,col4
1,2,3,4
13,u,u,y
,,,
,,,
Col1,col2,col3,col4,col5,col6
1,2,3,4,5,6
qw,ers,hh,yj,df,ji
Now I’m reading this file using pandas.read_csv()
I know that at 10th row i have required cols.
pandas.read_csv("file.csv", skiprows=10, header=None)
Want to skip this dynamically by skipping rows when we 6 cols or either in this sequence col1,col2,col3,col4,col5,col6.
start = df.loc[df.FILE-START == 'col1,col2,col3,col4,col5,col6'].index[0]
df = pd.read_csv(filename, skiprows = start + 1)
Tried this but it’s not working.
Update
A more robust version using csv
module:
import pandas as pd
import csv
import io
with open('File.csv') as fp:
while True:
pos = fp.tell()
reader = csv.reader(io.StringIO(fp.readline()))
row = next(reader)
if len(row) == 6:
break
fp.seek(pos)
df = pd.read_csv(fp)
Old answer
You can read the file line by line until you found 6 columns or 5 commas (take care if you have quotes and comma between them. But it’s fine for a simple csv file:
import pandas as pd
with open('File.csv') as fp:
while True:
pos = fp.tell()
row = fp.readline()
if row.count(',') == 5:
break
fp.seek(pos)
df = pd.read_csv(fp)
Output:
>>> df
Col1 col2 col3 col4 col5 col6
0 1 2 3 4 5 6
1 qw ers hh yj df ji
You can use the approach as follows:
def check_num_or_colseq(row):
return len(row)==6 or (row[0]=='col1' and row[1]=='col2' and row[2]=='col3' and row[3]=='col4' and row[4]=='col5' and row[5]=='col6')
// suppose you read the csv file
readervar = csv.reader(file)
for i,row in enumrate(readervar):
if check_num_or_colseq(row):
skip = i
break
df = pd.read_csv(filename, skiprows = skip + 1)
I think all of the code above is self-explanatory. Hope this helps.
Another option with pandas’ DataFrame
constructor :
import csv
import pandas as pd
with open("file.csv") as csv_file:
csv_reader = csv.reader(csv_file)
rows = [row for row in csv_reader if len(row) == 6]
data_six = {"columns": rows[0], "data": rows[1:]}
df = pd.DataFrame(**data_six)
As explained by @Corralien, with this approach, pandas lose the ability to infer data types for each column since csv.reader
returns always a list of strings.
csv.reader(csvfile, dialect=’excel’, **fmtparams)
Return a reader object which will iterate over lines in the given csvfile. csvfile can
be any object which supports the iterator protocol and returns a
string each time its _next_() method is called — file objects and
list objects are both suitable. Each row read from the csv file is
returned as a list of strings.Source : [docs.python]
Output :
print(df)
Col1 col2 col3 col4 col5 col6
0 1 2 3 4 5 6
1 qw ers hh yj df ji
Nota: this assumes that your csv file always ends up with six columns data and with a unique header.