Easy way to apply transformation from `pandas.get_dummies` to new data?

Question:

Suppose I have a data frame data with strings that I want converted to indicators. I use pandas.get_dummies(data) to convert this to a dataset that I can now use for building a model.

Now I have a single new observation that I want to run through my model. Obviously I can’t use pandas.get_dummies(new_data) because it doesn’t contain all of the classes and won’t make the same indicator matrices. Is there a good way to do this?

Asked By: Ellis Valentiner

||

Answers:

you can create the dummies from the single new observation, and then reindex this frames columns using the columns from the original indicator matrix:

import pandas as pd
df = pd.DataFrame({'cat':['a','b','c','d'],'val':[1,2,5,10]})
df1 = pd.get_dummies(pd.DataFrame({'cat':['a'],'val':[1]}))
dummies_frame = pd.get_dummies(df)
df1.reindex(columns = dummies_frame.columns, fill_value=0)

returns:

        val     cat_a   cat_b   cat_c   cat_d
  0     1       1       0       0       0
Answered By: JAB

Fetching out JAB’s answer in order to use it for example in sklearn pipelines, this code may help you:

from sklearn.base import BaseEstimator, TransformerMixin

class GetDummies(BaseEstimator, TransformerMixin):
    def __init__(self, dummy_columns):
        self.columns = None
        self.dummy_columns = dummy_columns

    def fit(self, X, y=None):
        self.columns = pd.get_dummies(X, columns=self.dummy_columns).columns
        return self

    def transform(self, X):
        X_new = pd.get_dummies(X, columns=self.dummy_columns)
        return X_new.reindex(columns=self.columns, fill_value=0)
Answered By: Guido

Seems you can take the advantage of type category.

import pandas as pd


train = pd.DataFrame({'feature':['a', 'b', 'c', 'd']})
test = pd.DataFrame({'feature':['a']})

train['feature'] = train['feature'].astype('category')
dummies_type = train['feature'].dtype
test['feature'] = test['feature'].astype(dummies_type)

training data:

pd.get_dummies(train)

feature_a   feature_b   feature_c   feature_d
1   0   0   0
0   1   0   0
0   0   1   0
0   0   0   1

testing data:

pd.get_dummies(test)

feature_a   feature_b   feature_c   feature_d
1   0   0   0

new value of the feature:

test_oov = pd.DataFrame({'feature':['z']})
test_oov['feature'] = test_oov['feature'].astype(dummies_type)
pd.get_dummies(test_oov)

feature_a   feature_b   feature_c   feature_d
0   0   0   0
Answered By: yidatongshui
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.