One Hot Encoding preserve the NAs for imputation

Question:

I am trying to use KNN for imputing categorical variables in python.

In order to do so, a typical way is to one hot encode the variables before. However sklearn OneHotEncoder() doesn’t handle NAs so you need to rename them to something which creates a seperate variable.

Small reproducible example:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

#Create random pandas with categories to impute
data0 = pd.DataFrame(columns=["1","2"],data = [["A",np.nan],["B","A"],[np.nan,"A"],["A","B"]])

original data frame:

data0
     1    2
0    A  NaN
1    B    A
2  NaN    A
3    A    B

Proceed with one hot encoding:

#Rename for sklearn OHE
enc_missing = SimpleImputer(strategy="constant",fill_value="missing")
data1 = enc_missing.fit_transform(data0)
# Perform OHE:
OHE = OneHotEncoder(sparse=False)
data_OHE = OHE.fit_transform(data1) 

Data_OHE is now one hot encoded:

Data_OHE
array([[1., 0., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0.],
       [0., 0., 1., 1., 0., 0.],
       [1., 0., 0., 0., 1., 0.]])

But because of the seperate "missing" category – i dont have any nans to impute anymore.

My desired output of one hot encoding

array([[1,        0,      np.nan, np.nan],
       [0,        1,        1,       0   ],
       [np.nan, np.nan,     1,       0   ], 
       [1,        0,        0,       1   ]
       ])

Such that I keep nans for later imputation.

Do you know any way to do this?

From my understanding this is something that has been discussed in the scikit-learn Github repo here
and here, i.e. to make OneHotEncoder handle this automatically with a handle_missing argument, but i am unsure of the status of their work.

Asked By: Kasper Einarson

||

Answers:

Handling of missing values in OneHotEncoder ended up getting merged in PR17317, but it operates by just treating the missing values as a new category (no option for other treatments, if I understand correctly).

One manual approach is described in this answer. The first step isn’t strictly necessary now because of the above PR, but maybe filling with custom text will make it easier to find the column?

Answered By: Ben Reiniger

Create a Pipeline:

from sklearn.pipeline import make_pipeline

model = make_pipeline(
    OneHotEncoder(),
    SimpleImputer(),
    Ridge()
)
model.fit(X_train, y_train)
Answered By: abir