Is there a way to use mutual information as part of a pipeline in scikit learn?
Question:
I’m creating a model with scikit-learn. The pipeline that seems to be working best is:
- mutual_info_classif with a threshold – i.e. only include fields whose mutual information score is above a given threshold.
- PCA
- LogisticRegression
I’d like to do them all using sklearn’s pipeline object, but I’m not sure how to get the mutual info classification in. For the second and third steps I do:
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('dim_red', pca),
('pred', lr)
]
)
But I don’t see a way to include the first step. I know I can create my own class to do this, and I will if I have to, but is there a way to do this within sklearn?
Answers:
You can implement your Estimator
by subclassing BaseEstimator
. Then, you can pass it as estimator
to a SelectFromModel
instance, which can be used in a Pipeline
:
from sklearn.feature_selection import SelectFromModel, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
X = [[ 0.87, -1.34, 0.31 ],
[-2.79, -0.02, -0.85 ],
[-1.34, -0.48, -2.55 ],
[ 1.92, 1.48, 0.65 ]]
y = [0, 1, 0, 1]
class MutualInfoEstimator(BaseEstimator):
def __init__(self, discrete_features='auto', n_neighbors=3, copy=True, random_state=None):
self.discrete_features = discrete_features
self.n_neighbors = n_neighbors
self.copy = copy
self.random_state = random_state
def fit(self, X, y):
self.feature_importances_ = mutual_info_classif(X, y, discrete_features=self.discrete_features,
n_neighbors=self.n_neighbors,
copy=self.copy, random_state=self.random_state)
feat_sel = SelectFromModel(estimator=MutualInfoEstimator(random_state=0))
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('feat_sel', feat_sel),
('pca', pca),
('pred', lr)
]
)
print(pipe)
Pipeline(steps=[('feat_sel',
SelectFromModel(estimator=MutualInfoSelector(random_state=0))),
('pca', PCA(random_state=100)),
('pred', LogisticRegression(random_state=200))])
Note that of course the new estimator should expose the parameters you want to tweak during optimisation. Here I just exposed all of them.
Yeah, I do not think there is another way to do it. At least not that I know!
How about SelectKBest
or SelectPercentile
:
from sklearn.feature_selection import SelectKBest
mi_best = SelectKBest(score_func=mutual_info_classif, k=10)
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('select', mi_best),
('dim_red', pca),
('pred', lr),
]
)
It is possible to do this, but the utility of doing this will vary depending on where in the ML workflow you are. I will describe how I got something similar to work.
High Level:
A selector
in a ColumnTransformer
is just a callable that returns a list of columns when it is passed the dataframe. We can use this to do what you’re trying to do. We can define it as follows:
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_validate
from sklearn.dummy import DummyRegressor
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
def mi_selector(mi_threshold=0.05, target_label=None):
def selector_to_return(df,):
mi_df = compute_mutual_information(
df=df,
target_label=target_label,
random_state=random_state)
matching_variables = mi_df[mi_df.loc[:,'mutual_information'] > mi_threshold].index.tolist()
matching_features = []
# Remove target
if target_label in matching_variables:
matching_variables.remove(target_label)
# Only return from features that were in original df
# since we compute more than that as we impute, encode etc.
for feature_name in df.columns.tolist():
if feature_name in matching_variables:
matching_features.append(feature_name)
return matching_features
return selector_to_return
What this does is it uses the mutual_information
computed by compute_mutual_information
to create a selector which can be plugged into a Pipeline
.
Since we’re computing mutual information, we need to know what the target is. We also need to know whether it’s a classification or regression problem. This is the part that compute_mutual_information
solves using the following approach:
- Figure out the metadata for the dataframe, physical datatypes as well as whether or not the feature is numeric or not etc., along with some other stats (second last code snippet)
- Figure out whether it’s a regression or a classification problem to set up the right function for computing mutual information
- Impute and encode/scale the data and reconstruct the processed dataframe
- Compute mutual information on the processed dataframe
def compute_mutual_information(df, target_label, random_state):
# Analyze data frame
meta_df = df_metadata(df, numerical_threshold=50)
target_is_numerical = meta_df.loc[meta_df.variable == target_label][
'is_numerical'].iloc[0]
# Determine problem type
if target_is_numerical:
problem_type = 'regression'
mutual_information_function = mutual_info_regression
else:
problem_type = 'classification'
mutual_information_function = mutual_info_classif
# Select feature types
my_numerical_selector = feature_type_selector(dtype_include='numerical')
my_categorical_selector = feature_type_selector(dtype_include='categorical')
numerical_features = my_numerical_selector(df)
categorical_features = my_categorical_selector(df)
# Remove target label from features
for feature_list in [numerical_features, categorical_features]:
if target_label in feature_list:
feature_list.remove(target_label)
# Transform df
imputation_preprocessor = ColumnTransformer(
[('numerical_imputer',
SimpleImputer(strategy='median', add_indicator=True),
numerical_features),
('categorical_imputer',
SimpleImputer(strategy='most_frequent', add_indicator=True),
categorical_features)],
remainder='passthrough')
# We need to figure out the indices to the features that are supposed to be scaled and encoded by the next
# step
post_imputation_np = imputation_preprocessor.fit_transform(df)
feature_name_np_array = imputation_preprocessor.get_feature_names_out()
categorical_feature_indices = np.zeros(len(categorical_features))
numerical_feature_indices = np.zeros(len(numerical_features))
for position, feature in enumerate(categorical_features):
categorical_feature_indices[position] = np.where(
feature_name_np_array == 'categorical_imputer__' + feature)[0]
for position, feature in enumerate(numerical_features):
numerical_feature_indices[position] = np.where(
feature_name_np_array == 'numerical_imputer__' + feature)[0]
categorical_feature_indices = categorical_feature_indices.astype(
int).tolist()
numerical_feature_indices = numerical_feature_indices.astype(int).tolist()
numeric_and_categorical_transformer = ColumnTransformer(
[('OneHotEncoder', OneHotEncoder(),
categorical_feature_indices),
('StandardScaler', StandardScaler(),
numerical_feature_indices)],
remainder='passthrough')
preprocessor = Pipeline(
[('imputation_preprocessor', imputation_preprocessor),
('numeric_and_categorical_transformer',
numeric_and_categorical_transformer)])
df_transformed_np = preprocessor.fit_transform(df)
preprocessed_feature_names = list(preprocessor.get_feature_names_out())
df_transformed = pd.DataFrame(
df_transformed_np.todense(),
columns=preprocessed_feature_names)
df_transformed = df_transformed.rename(shorten_param, axis=1)
estimated_mutual_information = mutual_information_function(
X=df_transformed, y=df[target_label], random_state=random_state)
estimated_mutual_information_df = pd.DataFrame(
estimated_mutual_information.T.reshape(
1, -1), columns=preprocessed_feature_names)
estimated_mutual_information_df = estimated_mutual_information_df.rename(
shorten_param,
axis=1)
estimated_mutual_information_df = estimated_mutual_information_df.T
estimated_mutual_information_df.columns = ['mutual_information']
estimated_mutual_information_df = estimated_mutual_information_df.sort_values(
by=['mutual_information'])
return estimated_mutual_information_df
The above used a feature_type_selector
which is defined as following:
def feature_type_selector(dtype_include=None):
def nested_function(df,):
meta_df = df_metadata(df)
if dtype_include == 'numerical':
return meta_df.loc[meta_df.is_numerical, 'variable'].tolist()
else:
return meta_df.loc[meta_df.is_numerical ==
False, 'variable'].tolist()
return nested_function
The metadata analysis of the dataframe does the following:
- Determine variable types
- Figure out, with some threshold, which features are really categoricals encoded as numericals
- Percentage missing data etc.
def df_metadata(df, numerical_threshold=50):
list_of_variables = list(df.dtypes.index)
list_of_dtypes = [df.dtypes[variable] for variable in list_of_variables]
categorical_selector = selector(dtype_include=object)
numerical_selector = selector(dtype_exclude=object)
unique_value_counts = [df[variable].nunique()
for variable in list_of_variables]
categorical_features = categorical_selector(df)
numerical_features = numerical_selector(df)
is_numerical_init = [True] * len(list_of_variables)
metadata_frame = pd.DataFrame(
{'variable': list_of_variables, 'dtype': list_of_dtypes,
'is_numerical': is_numerical_init,
'unique_value_counts': unique_value_counts})
null_sum = df.isnull().sum()
null_sum.name = 'null_sum'
metadata_frame = pd.merge(
metadata_frame,
null_sum,
left_on='variable',
right_index=True)
metadata_frame['samples_missing'] = metadata_frame['null_sum'] > 0
total_samples = len(df)
metadata_frame['percent_missing'] = metadata_frame['null_sum'] / total_samples
for feature in categorical_features:
metadata_frame.loc[metadata_frame.variable ==
feature, ['is_numerical']] = False
for feature in numerical_features:
if df[feature].nunique() < numerical_threshold:
metadata_frame.loc[metadata_frame.variable ==
feature, ['is_numerical']] = False
return metadata_frame
The shorten_param
function is defined as follows:
def shorten_param(param_name):
if "__" in param_name:
if len(param_name.rsplit(" ", 1)) < 2:
return param_name.rsplit("__", 1)[1]
else:
return str(shorten_param(param_name.rsplit(" ", 1)[
0])) + " " + shorten_param(' '.join(param_name.rsplit(" ", 1)[1:]))
return param_name
With all this in place, you can do something like the following to run your model.
standard_scaler_transformer = StandardScaler()
identity_transformer = ColumnTransformer([('unused_scaler', standard_scaler_transformer, []),],remainder='passthrough')
my_mi_selector = mi_selector(mi_threshold = 0.2, target_label=target_label)
mi_filter = ColumnTransformer([('identity_transformer', identity_transformer, my_mi_selector)], remainder='drop')
brute_imputer = SimpleImputer(strategy='most_frequent')
ames_target = ames_data[ames_target_name]
my_model = Pipeline([('mi_filter', mi_filter), ('brute_imputer', brute_imputer), ('Ridge', Ridge())])
my_model.fit(X=ames_data, y=ames_target)
To my original point about the utility of this, I think it’s useful very early on in the process when you’re trying to figure out which features are important, and which are not. Here are some challenges that I ran into:
- Since feature selection happens at runtime, you are limited in what feature engineering you can do. For example, in imputation, I had to use
most_frequent
because that works for whichever feature type.
- It would be really cool if there was a way to track variable locations as they enter numpy land. For example, in
compute_mutual_information
, post transformation, the ndarray
needs to be put back into a DataFrame
and then the mutual_information
computed in order to have traceability of the mutual_information
values for each feature
- You cannot treat
mi_threshold
as a hyperparameter, because the selector is a callable, and model.get_params()
has essentially a memory address.
I look forward to the day when mutual_information
is a hyperparameter that you can tune, and you set policies for the types of feature engineering you want to apply.
I’m creating a model with scikit-learn. The pipeline that seems to be working best is:
- mutual_info_classif with a threshold – i.e. only include fields whose mutual information score is above a given threshold.
- PCA
- LogisticRegression
I’d like to do them all using sklearn’s pipeline object, but I’m not sure how to get the mutual info classification in. For the second and third steps I do:
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('dim_red', pca),
('pred', lr)
]
)
But I don’t see a way to include the first step. I know I can create my own class to do this, and I will if I have to, but is there a way to do this within sklearn?
You can implement your Estimator
by subclassing BaseEstimator
. Then, you can pass it as estimator
to a SelectFromModel
instance, which can be used in a Pipeline
:
from sklearn.feature_selection import SelectFromModel, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
X = [[ 0.87, -1.34, 0.31 ],
[-2.79, -0.02, -0.85 ],
[-1.34, -0.48, -2.55 ],
[ 1.92, 1.48, 0.65 ]]
y = [0, 1, 0, 1]
class MutualInfoEstimator(BaseEstimator):
def __init__(self, discrete_features='auto', n_neighbors=3, copy=True, random_state=None):
self.discrete_features = discrete_features
self.n_neighbors = n_neighbors
self.copy = copy
self.random_state = random_state
def fit(self, X, y):
self.feature_importances_ = mutual_info_classif(X, y, discrete_features=self.discrete_features,
n_neighbors=self.n_neighbors,
copy=self.copy, random_state=self.random_state)
feat_sel = SelectFromModel(estimator=MutualInfoEstimator(random_state=0))
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('feat_sel', feat_sel),
('pca', pca),
('pred', lr)
]
)
print(pipe)
Pipeline(steps=[('feat_sel',
SelectFromModel(estimator=MutualInfoSelector(random_state=0))),
('pca', PCA(random_state=100)),
('pred', LogisticRegression(random_state=200))])
Note that of course the new estimator should expose the parameters you want to tweak during optimisation. Here I just exposed all of them.
Yeah, I do not think there is another way to do it. At least not that I know!
How about SelectKBest
or SelectPercentile
:
from sklearn.feature_selection import SelectKBest
mi_best = SelectKBest(score_func=mutual_info_classif, k=10)
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('select', mi_best),
('dim_red', pca),
('pred', lr),
]
)
It is possible to do this, but the utility of doing this will vary depending on where in the ML workflow you are. I will describe how I got something similar to work.
High Level:
A selector
in a ColumnTransformer
is just a callable that returns a list of columns when it is passed the dataframe. We can use this to do what you’re trying to do. We can define it as follows:
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_validate
from sklearn.dummy import DummyRegressor
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
def mi_selector(mi_threshold=0.05, target_label=None):
def selector_to_return(df,):
mi_df = compute_mutual_information(
df=df,
target_label=target_label,
random_state=random_state)
matching_variables = mi_df[mi_df.loc[:,'mutual_information'] > mi_threshold].index.tolist()
matching_features = []
# Remove target
if target_label in matching_variables:
matching_variables.remove(target_label)
# Only return from features that were in original df
# since we compute more than that as we impute, encode etc.
for feature_name in df.columns.tolist():
if feature_name in matching_variables:
matching_features.append(feature_name)
return matching_features
return selector_to_return
What this does is it uses the mutual_information
computed by compute_mutual_information
to create a selector which can be plugged into a Pipeline
.
Since we’re computing mutual information, we need to know what the target is. We also need to know whether it’s a classification or regression problem. This is the part that compute_mutual_information
solves using the following approach:
- Figure out the metadata for the dataframe, physical datatypes as well as whether or not the feature is numeric or not etc., along with some other stats (second last code snippet)
- Figure out whether it’s a regression or a classification problem to set up the right function for computing mutual information
- Impute and encode/scale the data and reconstruct the processed dataframe
- Compute mutual information on the processed dataframe
def compute_mutual_information(df, target_label, random_state):
# Analyze data frame
meta_df = df_metadata(df, numerical_threshold=50)
target_is_numerical = meta_df.loc[meta_df.variable == target_label][
'is_numerical'].iloc[0]
# Determine problem type
if target_is_numerical:
problem_type = 'regression'
mutual_information_function = mutual_info_regression
else:
problem_type = 'classification'
mutual_information_function = mutual_info_classif
# Select feature types
my_numerical_selector = feature_type_selector(dtype_include='numerical')
my_categorical_selector = feature_type_selector(dtype_include='categorical')
numerical_features = my_numerical_selector(df)
categorical_features = my_categorical_selector(df)
# Remove target label from features
for feature_list in [numerical_features, categorical_features]:
if target_label in feature_list:
feature_list.remove(target_label)
# Transform df
imputation_preprocessor = ColumnTransformer(
[('numerical_imputer',
SimpleImputer(strategy='median', add_indicator=True),
numerical_features),
('categorical_imputer',
SimpleImputer(strategy='most_frequent', add_indicator=True),
categorical_features)],
remainder='passthrough')
# We need to figure out the indices to the features that are supposed to be scaled and encoded by the next
# step
post_imputation_np = imputation_preprocessor.fit_transform(df)
feature_name_np_array = imputation_preprocessor.get_feature_names_out()
categorical_feature_indices = np.zeros(len(categorical_features))
numerical_feature_indices = np.zeros(len(numerical_features))
for position, feature in enumerate(categorical_features):
categorical_feature_indices[position] = np.where(
feature_name_np_array == 'categorical_imputer__' + feature)[0]
for position, feature in enumerate(numerical_features):
numerical_feature_indices[position] = np.where(
feature_name_np_array == 'numerical_imputer__' + feature)[0]
categorical_feature_indices = categorical_feature_indices.astype(
int).tolist()
numerical_feature_indices = numerical_feature_indices.astype(int).tolist()
numeric_and_categorical_transformer = ColumnTransformer(
[('OneHotEncoder', OneHotEncoder(),
categorical_feature_indices),
('StandardScaler', StandardScaler(),
numerical_feature_indices)],
remainder='passthrough')
preprocessor = Pipeline(
[('imputation_preprocessor', imputation_preprocessor),
('numeric_and_categorical_transformer',
numeric_and_categorical_transformer)])
df_transformed_np = preprocessor.fit_transform(df)
preprocessed_feature_names = list(preprocessor.get_feature_names_out())
df_transformed = pd.DataFrame(
df_transformed_np.todense(),
columns=preprocessed_feature_names)
df_transformed = df_transformed.rename(shorten_param, axis=1)
estimated_mutual_information = mutual_information_function(
X=df_transformed, y=df[target_label], random_state=random_state)
estimated_mutual_information_df = pd.DataFrame(
estimated_mutual_information.T.reshape(
1, -1), columns=preprocessed_feature_names)
estimated_mutual_information_df = estimated_mutual_information_df.rename(
shorten_param,
axis=1)
estimated_mutual_information_df = estimated_mutual_information_df.T
estimated_mutual_information_df.columns = ['mutual_information']
estimated_mutual_information_df = estimated_mutual_information_df.sort_values(
by=['mutual_information'])
return estimated_mutual_information_df
The above used a feature_type_selector
which is defined as following:
def feature_type_selector(dtype_include=None):
def nested_function(df,):
meta_df = df_metadata(df)
if dtype_include == 'numerical':
return meta_df.loc[meta_df.is_numerical, 'variable'].tolist()
else:
return meta_df.loc[meta_df.is_numerical ==
False, 'variable'].tolist()
return nested_function
The metadata analysis of the dataframe does the following:
- Determine variable types
- Figure out, with some threshold, which features are really categoricals encoded as numericals
- Percentage missing data etc.
def df_metadata(df, numerical_threshold=50):
list_of_variables = list(df.dtypes.index)
list_of_dtypes = [df.dtypes[variable] for variable in list_of_variables]
categorical_selector = selector(dtype_include=object)
numerical_selector = selector(dtype_exclude=object)
unique_value_counts = [df[variable].nunique()
for variable in list_of_variables]
categorical_features = categorical_selector(df)
numerical_features = numerical_selector(df)
is_numerical_init = [True] * len(list_of_variables)
metadata_frame = pd.DataFrame(
{'variable': list_of_variables, 'dtype': list_of_dtypes,
'is_numerical': is_numerical_init,
'unique_value_counts': unique_value_counts})
null_sum = df.isnull().sum()
null_sum.name = 'null_sum'
metadata_frame = pd.merge(
metadata_frame,
null_sum,
left_on='variable',
right_index=True)
metadata_frame['samples_missing'] = metadata_frame['null_sum'] > 0
total_samples = len(df)
metadata_frame['percent_missing'] = metadata_frame['null_sum'] / total_samples
for feature in categorical_features:
metadata_frame.loc[metadata_frame.variable ==
feature, ['is_numerical']] = False
for feature in numerical_features:
if df[feature].nunique() < numerical_threshold:
metadata_frame.loc[metadata_frame.variable ==
feature, ['is_numerical']] = False
return metadata_frame
The shorten_param
function is defined as follows:
def shorten_param(param_name):
if "__" in param_name:
if len(param_name.rsplit(" ", 1)) < 2:
return param_name.rsplit("__", 1)[1]
else:
return str(shorten_param(param_name.rsplit(" ", 1)[
0])) + " " + shorten_param(' '.join(param_name.rsplit(" ", 1)[1:]))
return param_name
With all this in place, you can do something like the following to run your model.
standard_scaler_transformer = StandardScaler()
identity_transformer = ColumnTransformer([('unused_scaler', standard_scaler_transformer, []),],remainder='passthrough')
my_mi_selector = mi_selector(mi_threshold = 0.2, target_label=target_label)
mi_filter = ColumnTransformer([('identity_transformer', identity_transformer, my_mi_selector)], remainder='drop')
brute_imputer = SimpleImputer(strategy='most_frequent')
ames_target = ames_data[ames_target_name]
my_model = Pipeline([('mi_filter', mi_filter), ('brute_imputer', brute_imputer), ('Ridge', Ridge())])
my_model.fit(X=ames_data, y=ames_target)
To my original point about the utility of this, I think it’s useful very early on in the process when you’re trying to figure out which features are important, and which are not. Here are some challenges that I ran into:
- Since feature selection happens at runtime, you are limited in what feature engineering you can do. For example, in imputation, I had to use
most_frequent
because that works for whichever feature type. - It would be really cool if there was a way to track variable locations as they enter numpy land. For example, in
compute_mutual_information
, post transformation, thendarray
needs to be put back into aDataFrame
and then themutual_information
computed in order to have traceability of themutual_information
values for each feature - You cannot treat
mi_threshold
as a hyperparameter, because the selector is a callable, andmodel.get_params()
has essentially a memory address.
I look forward to the day when mutual_information
is a hyperparameter that you can tune, and you set policies for the types of feature engineering you want to apply.