How to handle missing values (NaN) in categorical data when using scikit-learn OneHotEncoder?

Question:

I have recently started learning python to develop a predictive model for a research project using machine learning methods. I have a large dataset comprised of both numerical and categorical data. The dataset has lots of missing values. I am currently trying to encode the categorical features using OneHotEncoder. When I read about OneHotEncoder, my understanding was that for a missing value (NaN), OneHotEncoder would assign 0s to all the feature’s categories, as such:

0     Male 
1     Female
2     NaN

After applying OneHotEncoder:

0     10 
1     01
2     00

However, when running the following code:

    # Encoding categorical data
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder


    ct = ColumnTransformer([('encoder', OneHotEncoder(handle_unknown='ignore'), [1])],
                           remainder='passthrough')
    obj_df = np.array(ct.fit_transform(obj_df))
    print(obj_df)

I am getting the error ValueError: Input contains NaN

So I am guessing my previous understanding of how OneHotEncoder handles missing values is wrong.
Is there a way for me to get the functionality described above? I know imputing the missing values before encoding will resolve this issue, but I am reluctant to do this as I am dealing with medical data and fear that imputation may decrease the predictive accuracy of my model.

I found this question that is similar but the answer doesn’t offer a detailed enough solution on how to deal with the NaN values.

Let me know what your thoughts are, thanks.

Asked By: sums22

||

Answers:

  1. Change the NaN values with “Others”.
  2. Then proceed with one-hot encoding
  3. You can then remove the “Others” column.
Answered By: Om Rastogi

You will need to impute the missing values before. You can define a Pipeline with an imputing step using SimpleImputer setting a constant strategy to input a new category for null fields, prior to the OneHot encoding:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import numpy as np

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, [0])
    ])

df = pd.DataFrame(['Male', 'Female', np.nan])
preprocessor.fit_transform(df)
array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])
Answered By: yatu

As far as I understood after a long time of dealing with this problem, the problem is because in the split of the training test set you have some columns with the same value for all data samples. If your data rows that are saved close to each other are more similar, then it is more likely that would happen. Shuffling data can help in this case. This seems a bug with scikit. I’m using version 0.23.2 version.

Answered By: ElhamMotamedi

From version 0.24, OneHotEncoder now just treats missing values as its own category. What’s New entry.

Answered By: Ben Reiniger

One option here would be to use pandas get_dummies() function documented here. The dummy_na parameter can be altered to include NaN as a separate category. Based on your desired solution, seems like the default value will do.

obj_df_encoded = pd.get_dummies(obj_df)
print(obj_df)
>> 1 0
>> 0 1
>> 0 0

If using Scikit-Learn’s One Hot Encoder is necessary, you can fill nan values with pandas filna('something'), one hot encode this as a new category, then remove this column after.

Answered By: Jamie

OneHotEncoder adds missing values as new column. You can prevent the creation of this potentially useless column by setting the categories manually (as shown below) or by using the ‘drop’ parameter of OneHotEncoder. This encoder will give you the outputs you illustrated:

enc = OneHotEncoder(categories = [[0, 1]], handle_unknown='ignore')
Answered By: A. John Callegari