How to read bad lines in csv files using Pandas in Python?
Question:
The csv file has the following structure:
a,b,c
a,b,c,d,e,f,g
a,b,c,d
a,b,c
if I use file = pd.read_csv('Desktop/export.csv',delimiter=',')
, it will throw a tokenizing error like this:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 9 fields in line 3, saw 10
I do NOT want to skip bad lines. I want to read the csv with all columns and create a dataframe that looks like:
unnamed column1, unnamed column2, ....... unnamed column 7
a,b,c
a,b,c,d,e,f,g
a,b,c,d
a,b,c
How can I load the bad lines in the csv files?
Answers:
You can load the csv file into a dataframe while specifying the maximum number of columns that it should expect using the names parameter and engine='python'
of the read_csv()
function.
Here is an example:
import pandas as pd
df = pd.read_csv('Desktop/export.csv', delimiter=',', names=range(7), engine='python')
print(df)
This will create a dataframe with 7 unnamed columns and will read all the lines of the csv file, including the lines that have a different number of columns. The values for the missing columns will be filled with NaN.
Please note that by default the csv file is assumed to have a header, if you want to read the csv without header you have to add the parameter header=None
in the read_csv function.
You can use the error_bad_lines set to false.
import pandas as pd
file = pd.read_csv('Desktop/export.csv', delimiter=',',error_bad_lines=False)
The csv file has the following structure:
a,b,c
a,b,c,d,e,f,g
a,b,c,d
a,b,c
if I use file = pd.read_csv('Desktop/export.csv',delimiter=',')
, it will throw a tokenizing error like this:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 9 fields in line 3, saw 10
I do NOT want to skip bad lines. I want to read the csv with all columns and create a dataframe that looks like:
unnamed column1, unnamed column2, ....... unnamed column 7
a,b,c
a,b,c,d,e,f,g
a,b,c,d
a,b,c
How can I load the bad lines in the csv files?
You can load the csv file into a dataframe while specifying the maximum number of columns that it should expect using the names parameter and engine='python'
of the read_csv()
function.
Here is an example:
import pandas as pd
df = pd.read_csv('Desktop/export.csv', delimiter=',', names=range(7), engine='python')
print(df)
This will create a dataframe with 7 unnamed columns and will read all the lines of the csv file, including the lines that have a different number of columns. The values for the missing columns will be filled with NaN.
Please note that by default the csv file is assumed to have a header, if you want to read the csv without header you have to add the parameter header=None
in the read_csv function.
You can use the error_bad_lines set to false.
import pandas as pd
file = pd.read_csv('Desktop/export.csv', delimiter=',',error_bad_lines=False)