Remove Columns with missing values above a threshold pandas
Question:
I am doing data preprocessing and want to remove features/columns which have more than say 10% missing values.
I have made the below code:
df_missing=df.isna()
result=df_missing.sum()/len(df)
result
Default 0.010066
Income 0.142857
Age 0.109090
Name 0.047000
Gender 0.000000
Type of job 0.200000
Amt of credit 0.850090
Years employed 0.009003
dtype: float64
I want df to have columns only where there are no missing values above 10%.
Expected output:
df
Default Name Gender Years employed
(columns where there were missing values greater than 10% are removed.)
I have tried
result.iloc[:,0]
IndexingError: Too many indexers
Please help
Answers:
Because division of sum by length is mean
, you can instead df_missing.sum()/len(df)
use df_missing.mean()
:
result = df.isna().mean()
Then filter by DataFrame.loc
with :
for all rows and columns by mask:
df = df.loc[:,result > .1]
it should be df = df.loc[:,result < .1]
as the user only want to keep the columns that have less than 10% of the rows missing
pandas has built in methods for such things:
df_clean = df.dropna(axis=1, thresh=(len(df)*.1), inplace=False)
Or if you don’t want to create an extra dataframe object you can do it inplace:
df.dropna(axis=1, thresh=(len(df)*.1), inplace=True)
I am doing data preprocessing and want to remove features/columns which have more than say 10% missing values.
I have made the below code:
df_missing=df.isna()
result=df_missing.sum()/len(df)
result
Default 0.010066
Income 0.142857
Age 0.109090
Name 0.047000
Gender 0.000000
Type of job 0.200000
Amt of credit 0.850090
Years employed 0.009003
dtype: float64
I want df to have columns only where there are no missing values above 10%.
Expected output:
df
Default Name Gender Years employed
(columns where there were missing values greater than 10% are removed.)
I have tried
result.iloc[:,0]
IndexingError: Too many indexers
Please help
Because division of sum by length is mean
, you can instead df_missing.sum()/len(df)
use df_missing.mean()
:
result = df.isna().mean()
Then filter by DataFrame.loc
with :
for all rows and columns by mask:
df = df.loc[:,result > .1]
it should be df = df.loc[:,result < .1]
as the user only want to keep the columns that have less than 10% of the rows missing
pandas has built in methods for such things:
df_clean = df.dropna(axis=1, thresh=(len(df)*.1), inplace=False)
Or if you don’t want to create an extra dataframe object you can do it inplace:
df.dropna(axis=1, thresh=(len(df)*.1), inplace=True)