How to eliminate "blank" rows that show up after importing an Excel file using pd.read_excel()

Question

I read in an Excel file from an external source:

import pandas as pd
df = pd.read_excel('https://www.sharkattackfile.net/spreadsheets/GSAF5.xls')

When I call df.tail(), I see that there are 25,841 rows in this dataframe.

Also, notably, I see that there is a value of ‘xx’ in the Case Number column. This is not valid data.

But, looking at the file itself, I see that there are only 6807 rows of valid data:

How do I get a dataframe that only has the valid data (i.e. rows 1-6807), noting that as cases are added to this file, the range would need to be dynamic?

Thanks for your help!

Asked By: equanimity

||

Source

Answer 1

You could use pandas DataFrame’s replace function, then do dropna to drop every np.nan values.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html

DataFrame.replace('', np.nan)
DataFrame.replace('xx', np.nan)
DataFrame.dropna()

Answered By: Quoding

How to eliminate "blank" rows that show up after importing an Excel file using pd.read_excel()

Question:

Answers: