Using Sklearn's GridSearchCV for finding best imputation method without estimator

Question:

I’d like to find the best imputation method for missing data in Scikit-learn.

I have a dataset X and I have created an artificially corrupted version of it in X_na, so I can measure the qualities of different imputations. At this point I’m wondering if I could use sklearn’s GridSearchCV to do the search over possible imputer versions like this:

imputer_pipeline = Pipeline([("imputer":SimpleImputer())]

params = [{"imputer":[SimpleImputer()]},
          {"imputer":[IterativeImputer()]},
          {"imputer":[KNNImputer()], "imputer__n_neighbors": [3, 5, 7]}]

imputer_grid = GridSearchCV(imputer_pipe, param_grid=params, scoring="mse", cv=5)
imputer_grid.fit(X_na, X)

But the problem is that imputer_grid.fit does’n channel X_na and X to the imputer pipeline, I cannot instruct it to compare the imputed X_na and X by scoring (mse). The pipeline must have some object with .fit() accepting both X and y.

Asked By: Fredrik

||

Answers:

Not all your imputers have a predict method. You can create a custom function that simply returns the input, i.e return the imputed matrix that was passed, below is something I lifted over from DummyRegressor :

class IdentityFunction(MultiOutputMixin, RegressorMixin, BaseEstimator):

    def __init__(self):
        pass

    def fit(self, X, y):

        y = check_array(y, ensure_2d=False)
        if len(y) == 0:
            raise ValueError("y must not be empty.")

        check_consistent_length(X, y)

        return self

    def predict(self, X):
        return (X)

Then we define the pipeline using an example dataset:

from sklearn.pipeline import Pipeline
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer, IterativeImputer, KNNImputer
from sklearn.model_selection import GridSearchCV
import numpy as np

imputer_pipe = Pipeline([("imputer" , SimpleImputer()),
                        ("identity", IdentityFunction())])

params = [{"imputer":[SimpleImputer()]},
          {"imputer":[IterativeImputer()]},
          {"imputer":[KNNImputer()], "imputer__n_neighbors": [3, 5, 7]}]

Use a dummy dataset and fit :

X = np.random.uniform(0,1,(100,3))
X_na = np.where(X<0.3,np.nan,X) 

imputer_grid = GridSearchCV(imputer_pipe, param_grid=params,
                            scoring="neg_mean_squared_error", cv=5)
imputer_grid.fit(X_na, X)

The results, not useful here because there’s no useful information in the dummy matrix to impute :

Pipeline(steps=[('imputer', IterativeImputer()),
                ('identity', IdentityFunction())])
Answered By: StupidWolf