better way to drop nan rows in pandas
Question:
On my own I found a way to drop nan rows from a pandas dataframe. Given a dataframe dat
with column x
which contains nan values,is there a more elegant way to do drop each row of dat
which has a nan value in the x
column?
dat = dat[np.logical_not(np.isnan(dat.x))]
dat = dat.reset_index(drop=True)
Answers:
Use dropna:
dat.dropna()
You can pass param how
to drop if all labels are nan or any of the labels are nan
dat.dropna(how='any') #to drop if any value in the row has a nan
dat.dropna(how='all') #to drop if all values in the row are nan
Hope that answers your question!
Edit 1:
In case you want to drop rows containing nan
values only from particular column(s), as suggested by J. Doe in his answer below, you can use the following:
dat.dropna(subset=[col_list]) # col_list is a list of column names to consider for nan values.
To expand Hitesh’s answer if you want to drop rows where ‘x’ specifically is nan, you can use the subset parameter. His answer will drop rows where other columns have nans as well
dat.dropna(subset=['x'])
Just in case commands in previous answers doesn’t work,
Try this:
dat.dropna(subset=['x'], inplace = True)
bool_series=pd.notnull(dat["x"])
dat=dat[bool_series]
To remove rows based on Nan value of particular column:
d= pd.DataFrame([[2,3],[4,None]]) #creating data frame
d
Output:
0 1
0 2 3.0
1 4 NaN
d = d[np.isfinite(d[1])] #Select rows where value of 1st column is not nan
d
Output:
0 1
0 2 3.0
dropna() is probably all you need for this, but creating a custom filter may also help or be easier to understand
import pandas as pd
import numpy as np
df = pd.DataFrame(
[[4, 7, np.nan, np.nan],
[5, np.nan, 11, 2],
[6, 9, 12, np.nan]],
index=[1, 2, 3],
columns=['a', 'b', 'c', 'd'])
print(f'starting matrix:n{df}')
#create the matrix of true/false NaNs:
null_matrix = df.isnull()
#create the sum of number of NaNs
sum_null_matrix = null_matrix.T.sum().T
#create the query of the matrix
query_null = sum_null_matrix<2
#apply them to your matrix
applied_df = df[query_null]
print(f'query matrix:n{query_null}')
print(f'applied matrix:n{applied_df}')
and you get the result:
starting matrix:
a b c d
1 4 7.0 NaN NaN
2 5 NaN 11.0 2.0
3 6 9.0 12.0 NaN
query matrix:
1 False
2 True
3 True
dtype: bool
applied matrix:
a b c d
2 5 NaN 11.0 2.0
3 6 9.0 12.0 NaN
more info may be available on the nan checking answer:
How to check if any value is NaN in a Pandas DataFrame
edit: dropna() has a threshold variable, but it doesn’t have a min variable. This answer was for when someone needed to create a ‘min NaNs’ or some other custom function.
This answer introduces the thresh
parameter which is absolutely useful in some use-cases.
Note: I added this answer because some questions have been marked as duplicates directing to this page which none of the approaches here addresses such use-cases eg;
The bellow df format.
Example:
This approach addresses:
- Dropping rows/columns with all
NaN
- Keeping rows/columns with desired number of
non-NaN
values (having valid data)
# Approaching rows
------------------
# Sample df
df = pd.DataFrame({'Names': ['Name1', 'Name2', 'Name3', 'Name4'],
'Sunday': [2, None, 3, 3],
'Tuesday': [0, None, 3, None],
'Wednesday': [None, None, 4, None],
'Friday': [1, None, 7, None]})
print(df)
Names Sunday Tuesday Wednesday Friday
0 Name1 2.0 0.0 NaN 1.0
1 Name2 NaN NaN NaN NaN
2 Name3 3.0 3.0 4.0 7.0
3 Name4 3.0 NaN NaN NaN
# Keep only the rows with at least 2 non-NA values.
df = df.dropna(thresh=2)
print(df)
Names Sunday Tuesday Wednesday Friday
0 Name1 2.0 0.0 NaN 1.0
2 Name3 3.0 3.0 4.0 7.0
3 Name4 3.0 NaN NaN NaN
# Keep only the rows with at least 3 non-NA values.
df = df.dropna(thresh=3)
print(df)
Names Sunday Tuesday Wednesday Friday
0 Name1 2.0 0.0 NaN 1.0
2 Name3 3.0 3.0 4.0 7.0
# Approaching columns: We need axis here to direct drop to columns
------------------------------------------------------------------
# If axis=0 or not called, drop is applied to only rows like the above examples
# original df
print(df)
Names Sunday Tuesday Wednesday Friday
0 Name1 2.0 0.0 NaN 1.0
1 Name2 NaN NaN NaN NaN
2 Name3 3.0 3.0 4.0 7.0
3 Name4 3.0 NaN NaN NaN
# Keep only the columns with at least 2 non-NA values.
df =df.dropna(axis=1, thresh=2)
print(df)
Names Sunday Tuesday Friday
0 Name1 2.0 0.0 1.0
1 Name2 NaN NaN NaN
2 Name3 3.0 3.0 7.0
3 Name4 3.0 NaN NaN
# Keep only the columns with at least 3 non-NA values.
df =df.dropna(axis=1, thresh=3)
print(df)
Names Sunday
0 Name1 2.0
1 Name2 NaN
2 Name3 3.0
3 Name4 3.0
Conclusion:
- The
thresh
parameter from pd.dropna() doc
gives you the flexibility to decide the range of non-Na
values you want to keep in a row/column.
- The
thresh
parameter addresses a dataframe of the above given structure which df.dropna(how='all')
does not.
On my own I found a way to drop nan rows from a pandas dataframe. Given a dataframe dat
with column x
which contains nan values,is there a more elegant way to do drop each row of dat
which has a nan value in the x
column?
dat = dat[np.logical_not(np.isnan(dat.x))]
dat = dat.reset_index(drop=True)
Use dropna:
dat.dropna()
You can pass param how
to drop if all labels are nan or any of the labels are nan
dat.dropna(how='any') #to drop if any value in the row has a nan
dat.dropna(how='all') #to drop if all values in the row are nan
Hope that answers your question!
Edit 1:
In case you want to drop rows containing nan
values only from particular column(s), as suggested by J. Doe in his answer below, you can use the following:
dat.dropna(subset=[col_list]) # col_list is a list of column names to consider for nan values.
To expand Hitesh’s answer if you want to drop rows where ‘x’ specifically is nan, you can use the subset parameter. His answer will drop rows where other columns have nans as well
dat.dropna(subset=['x'])
Just in case commands in previous answers doesn’t work,
Try this:
dat.dropna(subset=['x'], inplace = True)
bool_series=pd.notnull(dat["x"])
dat=dat[bool_series]
To remove rows based on Nan value of particular column:
d= pd.DataFrame([[2,3],[4,None]]) #creating data frame
d
Output:
0 1
0 2 3.0
1 4 NaN
d = d[np.isfinite(d[1])] #Select rows where value of 1st column is not nan
d
Output:
0 1
0 2 3.0
dropna() is probably all you need for this, but creating a custom filter may also help or be easier to understand
import pandas as pd
import numpy as np
df = pd.DataFrame(
[[4, 7, np.nan, np.nan],
[5, np.nan, 11, 2],
[6, 9, 12, np.nan]],
index=[1, 2, 3],
columns=['a', 'b', 'c', 'd'])
print(f'starting matrix:n{df}')
#create the matrix of true/false NaNs:
null_matrix = df.isnull()
#create the sum of number of NaNs
sum_null_matrix = null_matrix.T.sum().T
#create the query of the matrix
query_null = sum_null_matrix<2
#apply them to your matrix
applied_df = df[query_null]
print(f'query matrix:n{query_null}')
print(f'applied matrix:n{applied_df}')
and you get the result:
starting matrix:
a b c d
1 4 7.0 NaN NaN
2 5 NaN 11.0 2.0
3 6 9.0 12.0 NaN
query matrix:
1 False
2 True
3 True
dtype: bool
applied matrix:
a b c d
2 5 NaN 11.0 2.0
3 6 9.0 12.0 NaN
more info may be available on the nan checking answer:
How to check if any value is NaN in a Pandas DataFrame
edit: dropna() has a threshold variable, but it doesn’t have a min variable. This answer was for when someone needed to create a ‘min NaNs’ or some other custom function.
This answer introduces the thresh
parameter which is absolutely useful in some use-cases.
Note: I added this answer because some questions have been marked as duplicates directing to this page which none of the approaches here addresses such use-cases eg;
The bellow df format.
Example:
This approach addresses:
- Dropping rows/columns with all
NaN
- Keeping rows/columns with desired number of
non-NaN
values (having valid data)
# Approaching rows
------------------
# Sample df
df = pd.DataFrame({'Names': ['Name1', 'Name2', 'Name3', 'Name4'],
'Sunday': [2, None, 3, 3],
'Tuesday': [0, None, 3, None],
'Wednesday': [None, None, 4, None],
'Friday': [1, None, 7, None]})
print(df)
Names Sunday Tuesday Wednesday Friday
0 Name1 2.0 0.0 NaN 1.0
1 Name2 NaN NaN NaN NaN
2 Name3 3.0 3.0 4.0 7.0
3 Name4 3.0 NaN NaN NaN
# Keep only the rows with at least 2 non-NA values.
df = df.dropna(thresh=2)
print(df)
Names Sunday Tuesday Wednesday Friday
0 Name1 2.0 0.0 NaN 1.0
2 Name3 3.0 3.0 4.0 7.0
3 Name4 3.0 NaN NaN NaN
# Keep only the rows with at least 3 non-NA values.
df = df.dropna(thresh=3)
print(df)
Names Sunday Tuesday Wednesday Friday
0 Name1 2.0 0.0 NaN 1.0
2 Name3 3.0 3.0 4.0 7.0
# Approaching columns: We need axis here to direct drop to columns
------------------------------------------------------------------
# If axis=0 or not called, drop is applied to only rows like the above examples
# original df
print(df)
Names Sunday Tuesday Wednesday Friday
0 Name1 2.0 0.0 NaN 1.0
1 Name2 NaN NaN NaN NaN
2 Name3 3.0 3.0 4.0 7.0
3 Name4 3.0 NaN NaN NaN
# Keep only the columns with at least 2 non-NA values.
df =df.dropna(axis=1, thresh=2)
print(df)
Names Sunday Tuesday Friday
0 Name1 2.0 0.0 1.0
1 Name2 NaN NaN NaN
2 Name3 3.0 3.0 7.0
3 Name4 3.0 NaN NaN
# Keep only the columns with at least 3 non-NA values.
df =df.dropna(axis=1, thresh=3)
print(df)
Names Sunday
0 Name1 2.0
1 Name2 NaN
2 Name3 3.0
3 Name4 3.0
Conclusion:
- The
thresh
parameterfrom pd.dropna() doc
gives you the flexibility to decide the range ofnon-Na
values you want to keep in a row/column. - The
thresh
parameter addresses a dataframe of the above given structure whichdf.dropna(how='all')
does not.