How to do Onehotencoding in Sklearn Pipeline

Question:

I am trying to oneHotEncode the categorical variables of my Pandas dataframe, which includes both categorical and continues variables. I realise this can be done easily with the pandas .get_dummies() function, but I need to use a pipeline so I can generate a PMML-file later on.

This is the code to create a mapper. The categorical variables I would like to encode are stored in a list called ‘dummies’.

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

mapper = DataFrameMapper(
    [(d, LabelEncoder()) for d in dummies] +
    [(d, OneHotEncoder()) for d in dummies]
)

And this is the code to create a pipeline, including the mapper and linear regression.

from sklearn2pmml import PMMLPipeline
from sklearn.linear_model import LinearRegression

lm = PMMLPipeline([("mapper", mapper),
                   ("regressor", LinearRegression())])

When I now try to fit (with ‘features’ being a dataframe, and ‘targets’ a series), it gives an error ‘could not convert string to float’.

lm.fit(features, targets)
Asked By: Desiré De Waele

||

Answers:

OneHotEncoder doesn’t support string features, and with [(d, OneHotEncoder()) for d in dummies] you are applying it to all dummies columns. Use LabelBinarizer instead:

mapper = DataFrameMapper(
    [(d, LabelBinarizer()) for d in dummies]
)

An alternative would be to use the LabelEncoder with a second OneHotEncoder step.

mapper = DataFrameMapper(
    [(d, LabelEncoder()) for d in dummies]
)

lm = PMMLPipeline([("mapper", mapper),
                   ("onehot", OneHotEncoder()),
                   ("regressor", LinearRegression())])
Answered By: dukebody

LabelEncoder and LabelBinarizer are intended for encoding/binarizing the target (label) of your data, i.e. the y vector. Of course they do more or less the same thing as OneHotEncoder, the main difference being the Label preprocessing steps don’t accept matrices, only 1-D vectors.

example = pd.DataFrame({'x':np.arange(2,14,2),
                        'cat1':['A','B','A','B','C','A'],
                        'cat2':['p','q','w','p','q','w']})
dummies = ['cat1', 'cat2']
    x cat1 cat2
0   2    A    p
1   4    B    q
2   6    A    w
3   8    B    p
4  10    C    q
5  12    A    w

As an example, LabelEncoder().fit_transform(example['cat1']) works, but LabelEncoder().fit_transform(example[dummies]) throws a ValueError exception.

In contrast, OneHotEncoder accepts multiple columns:

from sklearn.preprocessing import OneHotEncoder

OneHotEncoder().fit_transform(example[dummies])
<6x6 sparse matrix of type '<class 'numpy.float64'>'
        with 12 stored elements in Compressed Sparse Row format>

This can be incorporated into a pipeline using a ColumnTransformer, passing through (or alternatively applying different transformations to) the other columns :

from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([('encode_cats', OneHotEncoder(), dummies),],
                       remainder='passthrough')      
pd.DataFrame(ct.fit_transform(example), columns = ct.get_feature_names_out())
   encode_cats__cat1_A  encode_cats__cat1_B  ...  encode_cats__cat2_w  remainder__x
0                  1.0                  0.0  ...                  0.0           2.0
1                  0.0                  1.0  ...                  0.0           4.0
2                  1.0                  0.0  ...                  1.0           6.0
3                  0.0                  1.0  ...                  0.0           8.0
4                  0.0                  0.0  ...                  0.0          10.0
5                  1.0                  0.0  ...                  1.0          12.0

Finally, slot this into a pipeline:

from sklearn.pipeline import Pipeline
Pipeline([('preprocessing', ct),
          ('regressor', LinearRegression())])
Answered By: njp