How to preserve column order after applying sklearn.compose.ColumnTransformer on numpy array

Question:

I want to use Pipeline and ColumnTransformer modules from sklearn library to apply scaling on numpy array. Scaler is applied on some of the columns. And, I want to have the output with same column order of input.

Example:

import numpy as np
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import  MinMaxScaler


X = np.array ( [(25, 1, 2, 0),
                (30, 1, 5, 0),
                (25, 10, 2, 1),
                (25, 1, 2, 0),
                (np.nan, 10, 4, 1),
                (40, 1, 2, 1) ] )



column_trans = ColumnTransformer(
    [ ('scaler', MinMaxScaler(), [0,2]) ], 
     remainder='passthrough') 
      
X_scaled = column_trans.fit_transform(X)

The problem is that ColumnTransformer changes the order of columns. How can I preserve the original order of columns?

I am aware of this post. But, it is for pandas DataFrame. For some reasons, I cannot use DataFrame and I have to use numpy array in my code.

Thanks.

Asked By: Mohammad

||

Answers:

Here is a solution by adding a transformer which will apply the inverse column permutation after the column transform:

from sklearn.base import BaseEstimator, TransformerMixin
import re


class ReorderColumnTransformer(BaseEstimator, TransformerMixin):
    index_pattern = re.compile(r'd+$')
    
    def __init__(self, column_transformer):
        self.column_transformer = column_transformer
        
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        order_after_column_transform = [int( self.index_pattern.search(col).group()) for col in self.column_transformer.get_feature_names_out()]
        order_inverse = np.zeros(len(order_after_column_transform), dtype=int)
        order_inverse[order_after_column_transform] = np.arange(len(order_after_column_transform))
        return X[:, order_inverse]

It relies on parsing

column_trans.get_feature_names_out()
# = array(['scaler__x1', 'scaler__x3', 'remainder__x0', 'remainder__x2'],
#      dtype=object)

to read the initial column order from the suffix number. Then computing and applying the inverse permutation.

To be used as:

import numpy as np
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import  MinMaxScaler
from sklearn.pipeline import make_pipeline

X = np.array ( [(25, 1, 2, 0),
                (30, 1, 5, 0),
                (25, 10, 2, 1),
                (25, 1, 2, 0),
                (np.nan, 10, 4, 1),
                (40, 1, 2, 1) ] )



column_trans = ColumnTransformer(
    [ ('scaler', MinMaxScaler(), [0,2]) ], 
     remainder='passthrough') 

pipeline = make_pipeline( column_trans, ReorderColumnTransformer(column_transformer=column_trans))
X_scaled = pipeline.fit_transform(X)
#X_scaled has same column order as X

Alternative solution not relying on string parsing but reading the column slices of the column transformer:

from sklearn.base import BaseEstimator, TransformerMixin


class ReorderColumnTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self, column_transformer):
        self.column_transformer = column_transformer
        
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        slices = self.column_transformer.output_indices_.values()
        n_cols = self.column_transformer.n_features_in_
        order_after_column_transform = [value for slice_ in slices for value in range(n_cols)[slice_]]
        
        order_inverse = np.zeros(n_cols, dtype=int)
        order_inverse[order_after_column_transform] = np.arange(n_cols)
        return X[:, order_inverse]
Answered By: Learning is a mess

ColumnTransformer can be used to reorder columns however you would like by passing it the column indices in the desired order. Pairing ColumnTransformer with an identity FunctionTransformer will make it do nothing but reorder the columns. (You can create an identity FunctionTransformer by not assigning func when initializing FunctionTransformer, in which case the data will passed through without being transformed).

import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer

X = np.array ( [[30, 20, 10]] )
new_column_order = [2, 1, 0]
column_reorder_transformer = make_column_transformer((FunctionTransformer(), new_column_order))
Xt = column_reorder_transformer.fit_transform(X)
print(f"Xt = {Xt}")
# arr = [[10 20 30]]
Answered By: marcusaurelius