pandas dataframe drop columns by number of nan
Question:
I have a dataframe with some columns containing nan. I’d like to drop those columns with certain number of nan. For example, in the following code, I’d like to drop any column with 2 or more nan. In this case, column ‘C’ will be dropped and only ‘A’ and ‘B’ will be kept. How can I implement it?
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan
print dff
Answers:
Here is a possible solution:
s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
A 1
B 1
C 3
dtype: int64
for col in dff:
if s[col] >= 2:
del dff[col]
Or
for c in dff:
if sum(dff[c].isnull()) >= 2:
dff.drop(c, axis=1, inplace=True)
You can use a conditional list comprehension:
>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
A B
0 -0.819004 0.919190
1 0.922164 0.088111
2 0.188150 0.847099
3 NaN -0.053563
4 1.327250 -0.376076
5 3.724980 0.292757
6 -0.319342 NaN
7 -1.051529 0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026
There is a thresh
param for dropna
, you just need to pass the length of your df – the number of NaN
values you want as your threshold:
In [13]:
dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
A B
0 0.517199 -0.806304
1 -0.643074 0.229602
2 0.656728 0.535155
3 NaN -0.162345
4 -0.309663 -0.783539
5 1.244725 -0.274514
6 -0.254232 NaN
7 -1.242430 0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416
So the above will drop any column that does not meet the criteria of the length of the df (number of rows) – 2 as the number of non-Na values.
I recommend the drop
-method. This is an alternative solution:
dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)
Say you have to drop columns having more than 70% null values.
data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)
You can do this through another approach as well like below for dropping columns having certain number of na values:
df = df.drop( columns= [x for x in df if df[x].isna().sum() > 5 ])
For dropping columns having certain percentage of na values :
df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 20 ])
I have a dataframe with some columns containing nan. I’d like to drop those columns with certain number of nan. For example, in the following code, I’d like to drop any column with 2 or more nan. In this case, column ‘C’ will be dropped and only ‘A’ and ‘B’ will be kept. How can I implement it?
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan
print dff
Here is a possible solution:
s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
A 1
B 1
C 3
dtype: int64
for col in dff:
if s[col] >= 2:
del dff[col]
Or
for c in dff:
if sum(dff[c].isnull()) >= 2:
dff.drop(c, axis=1, inplace=True)
You can use a conditional list comprehension:
>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
A B
0 -0.819004 0.919190
1 0.922164 0.088111
2 0.188150 0.847099
3 NaN -0.053563
4 1.327250 -0.376076
5 3.724980 0.292757
6 -0.319342 NaN
7 -1.051529 0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026
There is a thresh
param for dropna
, you just need to pass the length of your df – the number of NaN
values you want as your threshold:
In [13]:
dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
A B
0 0.517199 -0.806304
1 -0.643074 0.229602
2 0.656728 0.535155
3 NaN -0.162345
4 -0.309663 -0.783539
5 1.244725 -0.274514
6 -0.254232 NaN
7 -1.242430 0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416
So the above will drop any column that does not meet the criteria of the length of the df (number of rows) – 2 as the number of non-Na values.
I recommend the drop
-method. This is an alternative solution:
dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)
Say you have to drop columns having more than 70% null values.
data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)
You can do this through another approach as well like below for dropping columns having certain number of na values:
df = df.drop( columns= [x for x in df if df[x].isna().sum() > 5 ])
For dropping columns having certain percentage of na values :
df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 20 ])