How to build a custom scaler based on StandardScaler?

Question:

I am trying to build a custom scaler to scale only the continuous variables on a dataset (the US Adult Income: https://www.kaggle.com/uciml/adult-census-income), using StandardScaler as a base.
Here is my Python code that I used:


from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

class CustomScaler(BaseEstimator,TransformerMixin): 
    
    
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        
        self.scaler = StandardScaler(copy,with_mean,with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    

    def transform(self, X, y=None, copy=None):
        
        init_col_order = X.columns
        
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

X=new_df_upsampled.copy()
X.drop('income',axis=1,inplace=True)

continuous = df.iloc[:, np.r_[0,2,10:13]] 
#basically independent variables that I consider continuous

columns_to_scale = continuous

scaler = CustomScaler(columns_to_scale)

scaler.fit(X)

However when I tried to run the scaler, I met this problem:
enter image description here

So what is the error that I have on building the scaler? And furthermore, how could you build a custom scaler for this dataset?

Thank you!

Asked By: Hoang Cuong Nguyen

||

Answers:

There is no need to create a custom transformer for this problematic. as this operation can be performed using ColumnTransformer. This transformer allows different columns of the input to be transformed separately.

The example below is scaling the columns ['A', 'B'] without changing the column C.

import numpy as np
import pandas as pd

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({'A': np.arange(10), 
                 'B': np.arange(10),
                 'C':  np.arange(10)})


transformer = make_column_transformer(
    (StandardScaler(), ['A', 'B']),
    remainder='passthrough'
)

pd.DataFrame(transformer.fit_transform(df), columns=df.columns)

This output the following result:

          A         B    C
0 -1.566699 -1.566699  0.0
1 -1.218544 -1.218544  1.0
2 -0.870388 -0.870388  2.0
3 -0.522233 -0.522233  3.0
4 -0.174078 -0.174078  4.0
5  0.174078  0.174078  5.0
6  0.522233  0.522233  6.0
7  0.870388  0.870388  7.0
8  1.218544  1.218544  8.0
9  1.566699  1.566699  9.0
Answered By: Antoine Dubuis

I agree with @AntoineDubuis, that ColumnTransformer is a better (builtin!) way to do this. That said, I’d like to address where your code goes wrong.

In fit, you have self.scaler.fit(X[self.columns], y); this indicates that self.columns should be a list of column names (or a few other options). But you’ve defined the parameter as continuous = df.iloc[:, np.r_[0,2,10:13]], which is a dataframe.

A couple other issues:

  1. you should only set attributes in __init__ that come from its signature, or cloning will fail. Move self.scaler
    to fit, and save its parameters copy etc. directly at __init__. Don’t initialize mean_ or var_.
  2. you never actually use mean_ or var_. You can keep them if you want, but the relevant statistics are stored in the scaler object.
Answered By: Ben Reiniger