How to apply multiple transforms to the same columns using ColumnTransformer in scikit-learn

Question:

I have a data frame that looks like this:

df = pd.DataFrame(
{
    'x' : range(0,5),
    'y' : [1,2,3,np.nan, np.nan]
})

enter image description here

I want to impute the values for y and also apply standardization to the two variables with the following code:

columnPreprocess = ColumnTransformer([
('imputer', SimpleImputer(strategy = 'median'), ['x','y']),   
('scaler', StandardScaler(), ['x','y'])])
columnPreprocess.fit_transform(df)

However, it seems like the ColumnTransformer would setup separate columns for each steps, with different transformations in different columns. This is not what I intended.

enter image description here

Is there a way to apply different transformation to the same columns and result in the same number of columns in the outputting array?

Asked By: PingPong

||

Answers:

You should use Pipeline in this case:

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({
    'x': range(0, 5),
    'y': [1, 2, 3, np.nan, np.nan]
})

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

pipeline.fit_transform(df)
# array([[-1.41421356, -1.58113883],
#        [-0.70710678,  0.        ],
#        [ 0.        ,  1.58113883],
#        [ 0.70710678,  0.        ],
#        [ 1.41421356,  0.        ]])
Answered By: Flavia Giammarino