using make_column_transformer with OnehotEncoder and StandaScaler + passthrough

Question:

I am unable to use remainder=’passthrough’ whenever I am using the StandardScaler and OnehotEncoding at the same time. Whichever way I am putting it, I have a problem. it’s either keyword before argument,a problem with the fit_tranform… you name it. Here what I am doing :

trans_cols= make_column_transformer((OneHotEncoder(),['job', 'marital', 'education', 
 'default','housing','loan','contact','month','poutcome']),remainder='passthrough')

trans_cols.fit_transform(X)

here are my columns:
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
   'loan', 'contact', 'month', 'duration', 'campaign', 'pdays', 'previous',
   'poutcome', 'y'],
  dtype='object')

The code above works, I am just not able to combine the 2 estimators when using the remainder key argument. Here is why I am trying:

trans_cols= make_column_transformer((OneHotEncoder(),['job', 'marital', 'education', 'default','housing','loan',
                                                  'contact','month','poutcome']),remainder='passthrough',

(StandardScaler(),['age', 'job', 'marital', 'education', 'default', 'balance',
                  'housing','loan', 'contact', 'month', 'duration',
                  'campaign', 'pdays', 'previous','poutcome']))

However, the above does not work until I remove remainder and keep 2 tuples. Which understandable. however, doing that it is trying to encode some of my numeric and I have a a message telling that it encountered some columns that have float.Plus my accuracy drops severely.

Asked By: Herc01

||

Answers:

The preferred practice is not to use StandardScaler on one-hot-encoded columns. The first example below demonstrates the application of OHE on the categorical variables and StandardScaler on the numeric columns. The second example, shows the sequential application of OHE on selected columns and StandardScaler on all columns, but this is not recommended.

Example_1:

import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline

df = pd.DataFrame({'Cat_Var': np.random.choice(['a', 'b'], size=5),
                   'Num_Var': np.arange(5)})

cat_cols = ['Cat_Var']
num_cols = ['Num_Var']

col_transformer = make_column_transformer(
        (OneHotEncoder(), cat_cols),
        remainder=StandardScaler())

X = col_transformer.fit_transform(df)

Output:

df
Out[57]: 
  Cat_Var  Num_Var
0       b        0
1       a        1
2       b        2
3       a        3
4       a        4

X
Out[58]: 
array([[ 0.        ,  1.        , -1.41421356],
       [ 1.        ,  0.        , -0.70710678],
       [ 0.        ,  1.        ,  0.        ],
       [ 1.        ,  0.        ,  0.70710678],
       [ 1.        ,  0.        ,  1.41421356]])

Example 2:

col_transformer_2 = ColumnTransformer(
        [('cat_transform', OneHotEncoder(), cat_cols)],
        remainder='passthrough'
        )

pipe = Pipeline(
        [
         ('col_tranform', col_transformer_2),
         ('standard_scaler', StandardScaler())
         ])

X_2 = pipe.fit_transform(df)

Output:

X_2
Out[62]: 
array([[-1.22474487,  1.22474487, -1.41421356],
       [ 0.81649658, -0.81649658, -0.70710678],
       [-1.22474487,  1.22474487,  0.        ],
       [ 0.81649658, -0.81649658,  0.70710678],
       [ 0.81649658, -0.81649658,  1.41421356]])
Answered By: KRKirov

Making some additions to KRKirov’s answer as it might be useful as well.

As you are using make_column_transformer, it do preprocessing your features in the given order. So the remainder (all features that are untouched during processing) should come at the end.

The problem with your code is that you pass the remainder parameter at the middle, so all features that remained from process goes to there. And you can’t do processing after that. So first, do all the special processing first then do processing with other features with remainder parameter.

Here I will explain it with the codes.

(1) importing

import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
onhe = OneHotEncoder()
scaler = StandardScaler()

(2) I will create basic DataFrame

df = pd.DataFrame({'sex':['m', 'f','f','m'],
                       'age':[45,25,10,31], 
                       'married':['y','y','n','y'],
                       'salary':[1000,300,370,500],
                       'child':[5, 1,0,3]})
print(df)

(3) Let’s say we want to do encoding for sex and married columns, standard scaling for age and salary columns and leave the child column as it is.

transforming = make_column_transformer((onhe,['sex','married']),
 (scaler,['age', 'salary']),
    remainder = 'passthrough')
     
processed_df = transforming.fit_transform(df)
print(processed_df)

Note that remainder is being assigned at the end of the process.
What is more, if you want to do scaler in all remaining features (‘age’,’salary’, ‘child’), then you can use:

transforming_1 = make_column_transformer((onhe, ['sex', 'married']), remainder = scaler)
processed_df_1 = transforming_1.fit_transform(df)
print(processed_df_1)

It will encode two given columns then do StandardScaling for all remaining columns.

And when it comes to your situation, your code (from which you got an error), should look like this:

trans_cols= make_column_transformer((OneHotEncoder(),['job', 'marital', 'education', 'default','housing','loan','contact','month','poutcome']),(StandardScaler(),['age', 'job', 'marital', 'education', 'default', 'balance','housing','loan', 'contact', 'month', 'duration','campaign', 'pdays', 'previous','poutcome']),remainder='passthrough')
Answered By: PivotAl