Is there a way to use mutual information as part of a pipeline in scikit learn?

Question:

I’m creating a model with scikit-learn. The pipeline that seems to be working best is:

  1. mutual_info_classif with a threshold – i.e. only include fields whose mutual information score is above a given threshold.
  2. PCA
  3. LogisticRegression

I’d like to do them all using sklearn’s pipeline object, but I’m not sure how to get the mutual info classification in. For the second and third steps I do:

pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
    [
        ('dim_red', pca),
        ('pred', lr)
    ]
)

But I don’t see a way to include the first step. I know I can create my own class to do this, and I will if I have to, but is there a way to do this within sklearn?

Asked By: roundsquare

||

Answers:

You can implement your Estimator by subclassing BaseEstimator. Then, you can pass it as estimator to a SelectFromModel instance, which can be used in a Pipeline:

from sklearn.feature_selection import SelectFromModel, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

X = [[ 0.87, -1.34,  0.31 ],
     [-2.79, -0.02, -0.85 ],
     [-1.34, -0.48, -2.55 ],
     [ 1.92,  1.48,  0.65 ]]
y = [0, 1, 0, 1]


class MutualInfoEstimator(BaseEstimator):
    def __init__(self, discrete_features='auto', n_neighbors=3, copy=True, random_state=None):
        self.discrete_features = discrete_features
        self.n_neighbors = n_neighbors
        self.copy = copy
        self.random_state = random_state
    

    def fit(self, X, y):
        self.feature_importances_ = mutual_info_classif(X, y, discrete_features=self.discrete_features, 
                                                        n_neighbors=self.n_neighbors, 
                                                        copy=self.copy, random_state=self.random_state)
    

feat_sel = SelectFromModel(estimator=MutualInfoEstimator(random_state=0))
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)

pipe = Pipeline(
    [
        ('feat_sel', feat_sel),
        ('pca', pca),
        ('pred', lr)
    ]
)

print(pipe)
Pipeline(steps=[('feat_sel',
                 SelectFromModel(estimator=MutualInfoSelector(random_state=0))),
                ('pca', PCA(random_state=100)),
                ('pred', LogisticRegression(random_state=200))])

Note that of course the new estimator should expose the parameters you want to tweak during optimisation. Here I just exposed all of them.

Yeah, I do not think there is another way to do it. At least not that I know!

Answered By: user2246849

How about SelectKBest or SelectPercentile:

from sklearn.feature_selection import SelectKBest

mi_best = SelectKBest(score_func=mutual_info_classif, k=10)
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
    [
        ('select', mi_best),
        ('dim_red', pca),
        ('pred', lr),
    ]
)
Answered By: Sanjar Adilov

It is possible to do this, but the utility of doing this will vary depending on where in the ML workflow you are. I will describe how I got something similar to work.

High Level:

A selector in a ColumnTransformer is just a callable that returns a list of columns when it is passed the dataframe. We can use this to do what you’re trying to do. We can define it as follows:

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge                                          
from sklearn.model_selection import ShuffleSplit                                
from sklearn.model_selection import cross_validate                              
from sklearn.dummy import DummyRegressor                                        
from sklearn.feature_selection import mutual_info_classif                       
from sklearn.feature_selection import mutual_info_regression                    
from sklearn.preprocessing import StandardScaler                                
from sklearn.preprocessing import OneHotEncoder                                 
from sklearn.compose import ColumnTransformer                                   
from sklearn.compose import make_column_selector as selector                    
from sklearn.impute import SimpleImputer                                        
from sklearn.pipeline import Pipeline

def mi_selector(mi_threshold=0.05, target_label=None):                          
    def selector_to_return(df,):                                                
        mi_df = compute_mutual_information(                                     
            df=df,                                                              
            target_label=target_label,                                          
            random_state=random_state)                                          
        matching_variables = mi_df[mi_df.loc[:,'mutual_information'] > mi_threshold].index.tolist()
        matching_features = []                                                  
        # Remove target                                                         
        if target_label in matching_variables:                                  
            matching_variables.remove(target_label)                             
        # Only return from features that were in original df                    
        # since we compute more than that as we impute, encode etc.             
        for feature_name in df.columns.tolist():                                
            if feature_name in matching_variables:                              
                matching_features.append(feature_name)                          
        return matching_features                                                
    return selector_to_return                                                    

What this does is it uses the mutual_information computed by compute_mutual_information to create a selector which can be plugged into a Pipeline.

Since we’re computing mutual information, we need to know what the target is. We also need to know whether it’s a classification or regression problem. This is the part that compute_mutual_information solves using the following approach:

  • Figure out the metadata for the dataframe, physical datatypes as well as whether or not the feature is numeric or not etc., along with some other stats (second last code snippet)
  • Figure out whether it’s a regression or a classification problem to set up the right function for computing mutual information
  • Impute and encode/scale the data and reconstruct the processed dataframe
  • Compute mutual information on the processed dataframe
def compute_mutual_information(df, target_label, random_state):                 
    # Analyze data frame                                                        
    meta_df = df_metadata(df, numerical_threshold=50)                           
    target_is_numerical = meta_df.loc[meta_df.variable == target_label][        
        'is_numerical'].iloc[0]                                                 
                                                                                
    # Determine problem type                                                    
    if target_is_numerical:                                                     
        problem_type = 'regression'                                             
        mutual_information_function = mutual_info_regression                    
    else:                                                                       
        problem_type = 'classification'                                         
        mutual_information_function = mutual_info_classif                       
                                                                                
    # Select feature types                                                      
    my_numerical_selector = feature_type_selector(dtype_include='numerical')    
    my_categorical_selector = feature_type_selector(dtype_include='categorical')
    numerical_features = my_numerical_selector(df)                              
    categorical_features = my_categorical_selector(df)                          
                                                                                
    # Remove target label from features                                         
    for feature_list in [numerical_features, categorical_features]:             
        if target_label in feature_list:                                        
            feature_list.remove(target_label)                                   
                                                                                
    # Transform df                                                              
    imputation_preprocessor = ColumnTransformer(                                
        [('numerical_imputer',                                                  
          SimpleImputer(strategy='median', add_indicator=True),                 
          numerical_features),                                                  
         ('categorical_imputer',                                                
          SimpleImputer(strategy='most_frequent', add_indicator=True),          
          categorical_features)],                                               
        remainder='passthrough')                                                
                                                                                
    # We need to figure out the indices to the features that are supposed to be scaled and encoded by the next
    # step                                                                      
                                                                                
    post_imputation_np = imputation_preprocessor.fit_transform(df)              
    feature_name_np_array = imputation_preprocessor.get_feature_names_out()     
    categorical_feature_indices = np.zeros(len(categorical_features))           
    numerical_feature_indices = np.zeros(len(numerical_features))               
                                                                                
    for position, feature in enumerate(categorical_features):                   
        categorical_feature_indices[position] = np.where(                       
            feature_name_np_array == 'categorical_imputer__' + feature)[0]      
                                                                                
    for position, feature in enumerate(numerical_features):                     
        numerical_feature_indices[position] = np.where(                         
            feature_name_np_array == 'numerical_imputer__' + feature)[0]        
                                                                                
    categorical_feature_indices = categorical_feature_indices.astype(           
        int).tolist()                                                           
    numerical_feature_indices = numerical_feature_indices.astype(int).tolist()  
                                                                                
    numeric_and_categorical_transformer = ColumnTransformer(                    
        [('OneHotEncoder', OneHotEncoder(),                                     
          categorical_feature_indices),                                         
         ('StandardScaler', StandardScaler(),                                   
          numerical_feature_indices)],                                          
        remainder='passthrough')                                                
    preprocessor = Pipeline(                                                    
        [('imputation_preprocessor', imputation_preprocessor),                  
         ('numeric_and_categorical_transformer',                                
          numeric_and_categorical_transformer)])                                
    df_transformed_np = preprocessor.fit_transform(df)                          
    preprocessed_feature_names = list(preprocessor.get_feature_names_out())     
    df_transformed = pd.DataFrame(                                              
        df_transformed_np.todense(),                                            
        columns=preprocessed_feature_names)                                     
    df_transformed = df_transformed.rename(shorten_param, axis=1)               
    estimated_mutual_information = mutual_information_function(                 
        X=df_transformed, y=df[target_label], random_state=random_state)        
    estimated_mutual_information_df = pd.DataFrame(                             
        estimated_mutual_information.T.reshape(                                 
            1, -1), columns=preprocessed_feature_names)                         
    estimated_mutual_information_df = estimated_mutual_information_df.rename(   
        shorten_param,                                                          
        axis=1)                                                                 
    estimated_mutual_information_df = estimated_mutual_information_df.T         
    estimated_mutual_information_df.columns = ['mutual_information']            
    estimated_mutual_information_df = estimated_mutual_information_df.sort_values(
        by=['mutual_information'])                                              
                                                                                
    return estimated_mutual_information_df

The above used a feature_type_selector which is defined as following:

def feature_type_selector(dtype_include=None):                                  
    def nested_function(df,):                                                   
        meta_df = df_metadata(df)                                               
        if dtype_include == 'numerical':                                        
            return meta_df.loc[meta_df.is_numerical, 'variable'].tolist()       
        else:                                                                   
            return meta_df.loc[meta_df.is_numerical ==                          
                               False, 'variable'].tolist()                      
    return nested_function

The metadata analysis of the dataframe does the following:

  • Determine variable types
  • Figure out, with some threshold, which features are really categoricals encoded as numericals
  • Percentage missing data etc.
def df_metadata(df, numerical_threshold=50):                                    
    list_of_variables = list(df.dtypes.index)                                   
    list_of_dtypes = [df.dtypes[variable] for variable in list_of_variables]    
    categorical_selector = selector(dtype_include=object)                       
    numerical_selector = selector(dtype_exclude=object)                         
    unique_value_counts = [df[variable].nunique()                               
                           for variable in list_of_variables]                   
    categorical_features = categorical_selector(df)                             
    numerical_features = numerical_selector(df)                                 
    is_numerical_init = [True] * len(list_of_variables)                         
    metadata_frame = pd.DataFrame(                                              
        {'variable': list_of_variables, 'dtype': list_of_dtypes,                
         'is_numerical': is_numerical_init,                                     
         'unique_value_counts': unique_value_counts})                           
    null_sum = df.isnull().sum()                                                
    null_sum.name = 'null_sum'                                                  
    metadata_frame = pd.merge(                                                  
        metadata_frame,                                                         
        null_sum,                                                               
        left_on='variable',                                                     
        right_index=True)                                                       
    metadata_frame['samples_missing'] = metadata_frame['null_sum'] > 0          
    total_samples = len(df)                                                     
    metadata_frame['percent_missing'] = metadata_frame['null_sum'] / total_samples
    for feature in categorical_features:                                        
        metadata_frame.loc[metadata_frame.variable ==                           
                           feature, ['is_numerical']] = False                   
    for feature in numerical_features:                                          
        if df[feature].nunique() < numerical_threshold:                                                       
            metadata_frame.loc[metadata_frame.variable ==                       
                               feature, ['is_numerical']] = False               
    return metadata_frame

The shorten_param function is defined as follows:

def shorten_param(param_name):                                                  
    if "__" in param_name:                                                      
        if len(param_name.rsplit(" ", 1)) < 2:                                  
            return param_name.rsplit("__", 1)[1]                                
        else:                                                                   
            return str(shorten_param(param_name.rsplit(" ", 1)[                 
                       0])) + " " + shorten_param(' '.join(param_name.rsplit(" ", 1)[1:]))
    return param_name 

With all this in place, you can do something like the following to run your model.

standard_scaler_transformer = StandardScaler()                                  
identity_transformer = ColumnTransformer([('unused_scaler', standard_scaler_transformer, []),],remainder='passthrough')
my_mi_selector = mi_selector(mi_threshold = 0.2, target_label=target_label)     
mi_filter = ColumnTransformer([('identity_transformer', identity_transformer, my_mi_selector)], remainder='drop')
brute_imputer = SimpleImputer(strategy='most_frequent')                         
ames_target = ames_data[ames_target_name]                                       
my_model = Pipeline([('mi_filter', mi_filter), ('brute_imputer', brute_imputer), ('Ridge', Ridge())])
my_model.fit(X=ames_data, y=ames_target)

To my original point about the utility of this, I think it’s useful very early on in the process when you’re trying to figure out which features are important, and which are not. Here are some challenges that I ran into:

  • Since feature selection happens at runtime, you are limited in what feature engineering you can do. For example, in imputation, I had to use most_frequent because that works for whichever feature type.
  • It would be really cool if there was a way to track variable locations as they enter numpy land. For example, in compute_mutual_information, post transformation, the ndarray needs to be put back into a DataFrame and then the mutual_information computed in order to have traceability of the mutual_information values for each feature
  • You cannot treat mi_threshold as a hyperparameter, because the selector is a callable, and model.get_params() has essentially a memory address.

I look forward to the day when mutual_information is a hyperparameter that you can tune, and you set policies for the types of feature engineering you want to apply.

Answered By: Pritam Dodeja
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.