Remove Columns with missing values above a threshold pandas

Question

I am doing data preprocessing and want to remove features/columns which have more than say 10% missing values.

I have made the below code:

df_missing=df.isna()
result=df_missing.sum()/len(df)
result

Default           0.010066
Income            0.142857
Age               0.109090
Name              0.047000
Gender            0.000000
Type of job       0.200000
Amt of credit     0.850090
Years employed    0.009003
dtype: float64

I want df to have columns only where there are no missing values above 10%.

Expected output:

df

Default   Name   Gender   Years employed

(columns where there were missing values greater than 10% are removed.)

I have tried

result.iloc[:,0] 
IndexingError: Too many indexers

Please help

Asked By: noob

||

Source

Answer 1

Because division of sum by length is mean, you can instead df_missing.sum()/len(df) use df_missing.mean():

result = df.isna().mean()

Then filter by DataFrame.loc with : for all rows and columns by mask:

df = df.loc[:,result > .1]

Answered By: jezrael

Answer 2

it should be df = df.loc[:,result < .1] as the user only want to keep the columns that have less than 10% of the rows missing

Answered By: Unknown

Answer 3

pandas has built in methods for such things:

df_clean = df.dropna(axis=1, thresh=(len(df)*.1), inplace=False)

Or if you don’t want to create an extra dataframe object you can do it inplace:

df.dropna(axis=1, thresh=(len(df)*.1), inplace=True)

Answered By: ghost_in_the

Remove Columns with missing values above a threshold pandas

Question:

Answers: