Pandas: read_csv ignore rows after a blank line

Question:

There is a weird .csv file, something like:

header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33

pretty fine, but after these lines, there is always a blank line followed by lots of useless lines. The whole stuff is something line:


header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33

dhjsakfjkldsa
fasdfggfhjhgsdfgds
gsdgffsdgfdgsdfgs
gsdfdgsg

The number of lines in the bottom is totally random, the only remark is the empty line before them.

Pandas has a parameter “skipfooter” for ignoring a known number of rows in the footer.

Any idea about how to ignore this rows without actually opening (open()…) the file and removing them?

Asked By: Thiago Melo

||

Answers:

If you’re using the csv module, it’s fairly trivial to detect an empty row.

import csv 

with open(filename, newline='') as f:
    r = csv.reader(f)
    for l in r:
        if not l:
            break
        #Otherwise, process data
Answered By: Patrick Haugh

There is not any option to terminate read_csv function by getting the first blank line. This module isn’t capable of accepting/rejecting lines based on desired conditions. It only can ignore blank lines (optional) or rows which disobey the formed shape of data (rows with more separators).

You can normalize the data by the below approaches (without parsing file – pure pandas):

  1. Knowing the number of the desiredtrash data rows. [Manual]

    pd.read_csv('file.csv', nrows=3) or pd.read_csv('file.csv', skipfooter=4)

  2. Preserving the desired data by eliminating others in DataFrame. [Automatic]

    df.dropna(axis=0, how='any', inplace=True)

The results will be:

  header1 header2 header3
0   val11   val12   val13
1   val21   val22   val23
2   val31   val32   val33
Answered By: amin

Solution:

df = pd.read_csv(<filepath>, skip_blank_lines=False)
blank_df = df.loc[df.isnull().all(1)]
if len(blank_df) > 0:
    first_blank_index = blank_df.index[0]
    df = df[:first_blank_index]

Explanation:

The best way to do this using pandas native functions is a combination of arguments and function calls – a bit messy, but definitely possible!

First, call read_csv with the skip_blank_lines=False, since the default is True.

df = pd.read_csv(<filepath>, skip_blank_lines=False)

Then, create a dataframe that only contains the blank rows, using the isnull or isna method. This works by locating (.loc) the indices where all values are null/blank.

blank_df = df.loc[df.isnull().all(1)]

By utilizing the fact that this dataframe preserves the original indices, you can get the index of the first blank row.

Because this uses indexing, you will also want to check that there actually is a blank line in the csv. And finally, you simply slice the original dataframe in order to remove the unwanted lines.

if len(blank_df) > 0:
    first_blank_index = blank_df.index[0]
    df = df[:first_blank_index]
Answered By: Andrew Pye
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.