Pandas: read_csv ignore rows after a blank line
Question:
There is a weird .csv file, something like:
header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33
pretty fine, but after these lines, there is always a blank line followed by lots of useless lines. The whole stuff is something line:
header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33
dhjsakfjkldsa
fasdfggfhjhgsdfgds
gsdgffsdgfdgsdfgs
gsdfdgsg
The number of lines in the bottom is totally random, the only remark is the empty line before them.
Pandas has a parameter “skipfooter” for ignoring a known number of rows in the footer.
Any idea about how to ignore this rows without actually opening (open()…) the file and removing them?
Answers:
If you’re using the csv
module, it’s fairly trivial to detect an empty row.
import csv
with open(filename, newline='') as f:
r = csv.reader(f)
for l in r:
if not l:
break
#Otherwise, process data
There is not any option to terminate read_csv
function by getting the first blank line. This module isn’t capable of accepting/rejecting lines based on desired conditions. It only can ignore blank lines (optional) or rows which disobey the formed shape of data (rows with more separators).
You can normalize the data by the below approaches (without parsing file – pure pandas
):
-
Knowing the number of the desiredtrash data rows. [Manual]
pd.read_csv('file.csv', nrows=3)
or pd.read_csv('file.csv', skipfooter=4)
-
Preserving the desired data by eliminating others in DataFrame
. [Automatic]
df.dropna(axis=0, how='any', inplace=True)
The results will be:
header1 header2 header3
0 val11 val12 val13
1 val21 val22 val23
2 val31 val32 val33
Solution:
df = pd.read_csv(<filepath>, skip_blank_lines=False)
blank_df = df.loc[df.isnull().all(1)]
if len(blank_df) > 0:
first_blank_index = blank_df.index[0]
df = df[:first_blank_index]
Explanation:
The best way to do this using pandas native functions is a combination of arguments and function calls – a bit messy, but definitely possible!
First, call read_csv
with the skip_blank_lines=False
, since the default is True
.
df = pd.read_csv(<filepath>, skip_blank_lines=False)
Then, create a dataframe that only contains the blank rows, using the isnull
or isna
method. This works by locating (.loc
) the indices where all values are null/blank.
blank_df = df.loc[df.isnull().all(1)]
By utilizing the fact that this dataframe preserves the original indices, you can get the index of the first blank row.
Because this uses indexing, you will also want to check that there actually is a blank line in the csv. And finally, you simply slice the original dataframe in order to remove the unwanted lines.
if len(blank_df) > 0:
first_blank_index = blank_df.index[0]
df = df[:first_blank_index]
There is a weird .csv file, something like:
header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33
pretty fine, but after these lines, there is always a blank line followed by lots of useless lines. The whole stuff is something line:
header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33
dhjsakfjkldsa
fasdfggfhjhgsdfgds
gsdgffsdgfdgsdfgs
gsdfdgsg
The number of lines in the bottom is totally random, the only remark is the empty line before them.
Pandas has a parameter “skipfooter” for ignoring a known number of rows in the footer.
Any idea about how to ignore this rows without actually opening (open()…) the file and removing them?
If you’re using the csv
module, it’s fairly trivial to detect an empty row.
import csv
with open(filename, newline='') as f:
r = csv.reader(f)
for l in r:
if not l:
break
#Otherwise, process data
There is not any option to terminate read_csv
function by getting the first blank line. This module isn’t capable of accepting/rejecting lines based on desired conditions. It only can ignore blank lines (optional) or rows which disobey the formed shape of data (rows with more separators).
You can normalize the data by the below approaches (without parsing file – pure pandas
):
-
Knowing the number of the desiredtrash data rows. [Manual]
pd.read_csv('file.csv', nrows=3)
orpd.read_csv('file.csv', skipfooter=4)
-
Preserving the desired data by eliminating others in
DataFrame
. [Automatic]df.dropna(axis=0, how='any', inplace=True)
The results will be:
header1 header2 header3
0 val11 val12 val13
1 val21 val22 val23
2 val31 val32 val33
Solution:
df = pd.read_csv(<filepath>, skip_blank_lines=False)
blank_df = df.loc[df.isnull().all(1)]
if len(blank_df) > 0:
first_blank_index = blank_df.index[0]
df = df[:first_blank_index]
Explanation:
The best way to do this using pandas native functions is a combination of arguments and function calls – a bit messy, but definitely possible!
First, call read_csv
with the skip_blank_lines=False
, since the default is True
.
df = pd.read_csv(<filepath>, skip_blank_lines=False)
Then, create a dataframe that only contains the blank rows, using the isnull
or isna
method. This works by locating (.loc
) the indices where all values are null/blank.
blank_df = df.loc[df.isnull().all(1)]
By utilizing the fact that this dataframe preserves the original indices, you can get the index of the first blank row.
Because this uses indexing, you will also want to check that there actually is a blank line in the csv. And finally, you simply slice the original dataframe in order to remove the unwanted lines.
if len(blank_df) > 0:
first_blank_index = blank_df.index[0]
df = df[:first_blank_index]