Imputation of missing values for categories in pandas
Question:
The question is how to fill NaNs with most frequent levels for category column in pandas dataframe?
In R randomForest package there is
na.roughfix option : A completed data matrix or data frame. For numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.
in Pandas for numeric variables I can fill NaN values with :
df = df.fillna(df.median())
Answers:
You can use df = df.fillna(df['Label'].value_counts().index[0])
to fill NaNs with the most frequent value from one column.
If you want to fill every column with its own most frequent value you can use
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
UPDATE 2018-25-10 ⬇
Starting from 0.13.1
pandas includes mode
method for Series and Dataframes.
You can use it to fill missing values for each column (using its own most frequent value) like this
df = df.fillna(df.mode().iloc[0])
def fillna(col):
col.fillna(col.value_counts().index[0], inplace=True)
return col
df=df.apply(lambda col:fillna(col))
In more recent versions of scikit-learn up you can use SimpleImputer
to impute both numerics and categoricals:
import pandas as pd
from sklearn.impute import SimpleImputer
arr = [[1., 'x'], [np.nan, 'y'], [7., 'z'], [7., 'y'], [4., np.nan]]
df1 = pd.DataFrame({'x1': [x[0] for x in arr],
'x2': [x[1] for x in arr]},
index=[l for l in 'abcde'])
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
print(pd.DataFrame(imp.fit_transform(df1),
columns=df1.columns,
index=df1.index))
# x1 x2
# a 1 x
# b 7 y
# c 7 z
# d 7 y
# e 4 y
Most of the time, you wouldn’t want the same imputing strategy for all the columns. For example, you may want column mode for categorical variables and column mean or median for numeric columns.
For example:
df = pd.DataFrame({'num': [1.,2.,4.,np.nan],'cate1':['a','a','b',np.nan],'cate2':['a','b','b',np.nan]})
# numeric columns
>>> df.fillna(df.select_dtypes(include='number').mean().iloc[0], inplace=True)
# categorical columns
>>> df.fillna(df.select_dtypes(include='object').mode().iloc[0], inplace=True)
>>> print(df)
num cate1 cate2
0 1.000 a a
1 2.000 a b
2 4.000 b b
3 2.333 a b
Imputing strategy for all the columns based on the dtype. For example, you may want column mode for categorical variables and column mean for numeric columns.
For example:
df = pd.DataFrame({'num': [1.,2.,4.,np.nan],'cate1':['a','a','b',np.nan],'cate2':['a','b','b',np.nan]})
# numeric columns
for col in df.select_dtypes(include=['number']):
df[col].fillna(df[col].mean(), inplace=True)
# categorical columns
for col in df.select_dtypes(include=['object']):
df[col].fillna(df[col].mode()[0], inplace=True)
print(df)
output:
num cate1 cate2
0 1.000000 a a
1 2.000000 a b
2 4.000000 b b
3 2.333333 a b
If you want to fill a column:
from sklearn.impute import SimpleImputer
# create SimpleImputer object with the most frequent strategy
imputer = SimpleImputer(strategy='most_frequent')
# select the column to impute
column_to_impute = 'customer type'
# impute missing values in the selected column
imputed_column = imputer.fit_transform(df[[column_to_impute]])
# replace the original column with the imputed column
df[column_to_impute] = imputed_column
The question is how to fill NaNs with most frequent levels for category column in pandas dataframe?
In R randomForest package there is
na.roughfix option : A completed data matrix or data frame. For numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.
in Pandas for numeric variables I can fill NaN values with :
df = df.fillna(df.median())
You can use df = df.fillna(df['Label'].value_counts().index[0])
to fill NaNs with the most frequent value from one column.
If you want to fill every column with its own most frequent value you can use
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
UPDATE 2018-25-10 ⬇
Starting from 0.13.1
pandas includes mode
method for Series and Dataframes.
You can use it to fill missing values for each column (using its own most frequent value) like this
df = df.fillna(df.mode().iloc[0])
def fillna(col):
col.fillna(col.value_counts().index[0], inplace=True)
return col
df=df.apply(lambda col:fillna(col))
In more recent versions of scikit-learn up you can use SimpleImputer
to impute both numerics and categoricals:
import pandas as pd
from sklearn.impute import SimpleImputer
arr = [[1., 'x'], [np.nan, 'y'], [7., 'z'], [7., 'y'], [4., np.nan]]
df1 = pd.DataFrame({'x1': [x[0] for x in arr],
'x2': [x[1] for x in arr]},
index=[l for l in 'abcde'])
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
print(pd.DataFrame(imp.fit_transform(df1),
columns=df1.columns,
index=df1.index))
# x1 x2
# a 1 x
# b 7 y
# c 7 z
# d 7 y
# e 4 y
Most of the time, you wouldn’t want the same imputing strategy for all the columns. For example, you may want column mode for categorical variables and column mean or median for numeric columns.
For example:
df = pd.DataFrame({'num': [1.,2.,4.,np.nan],'cate1':['a','a','b',np.nan],'cate2':['a','b','b',np.nan]})
# numeric columns
>>> df.fillna(df.select_dtypes(include='number').mean().iloc[0], inplace=True)
# categorical columns
>>> df.fillna(df.select_dtypes(include='object').mode().iloc[0], inplace=True)
>>> print(df)
num cate1 cate2
0 1.000 a a
1 2.000 a b
2 4.000 b b
3 2.333 a b
Imputing strategy for all the columns based on the dtype. For example, you may want column mode for categorical variables and column mean for numeric columns.
For example:
df = pd.DataFrame({'num': [1.,2.,4.,np.nan],'cate1':['a','a','b',np.nan],'cate2':['a','b','b',np.nan]})
# numeric columns
for col in df.select_dtypes(include=['number']):
df[col].fillna(df[col].mean(), inplace=True)
# categorical columns
for col in df.select_dtypes(include=['object']):
df[col].fillna(df[col].mode()[0], inplace=True)
print(df)
output:
num cate1 cate2
0 1.000000 a a
1 2.000000 a b
2 4.000000 b b
3 2.333333 a b
If you want to fill a column:
from sklearn.impute import SimpleImputer
# create SimpleImputer object with the most frequent strategy
imputer = SimpleImputer(strategy='most_frequent')
# select the column to impute
column_to_impute = 'customer type'
# impute missing values in the selected column
imputed_column = imputer.fit_transform(df[[column_to_impute]])
# replace the original column with the imputed column
df[column_to_impute] = imputed_column