MAE using Pipeline and GridSearchCV

Question:

I am facing a challenge finding Mean Average Error (MAE) using Pipeline and GridSearchCV

Background:

I have worked on a Data Science project (MWE as below) where a MAE value would be returned of a classifier as it’s performance metric.

#Library
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

#Data import and preparation
data = pd.read_csv("data.csv")
data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']
X = data[data_features]
y = data.fault_severity

#Train Validation Split for Cross Validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

#RandomForest Modeling
RF_model = RandomForestClassifier(n_estimators=100, random_state=0)
RF_model.fit(X_train, y_train)

#RandomForest Prediction
y_predict = RF_model.predict(X_valid)

#MAE 
print(mean_absolute_error(y_valid, y_predict))
#Output:
#   0.38727149627623564

Challenge:

Now I am trying to implement the same using Pipeline and GridSearchCV (MWE as below). The expectation is the same MAE value would be returned as above. Unfortunately I could not get it right using the 3 approaches below.

#Library
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

#Data import and preparation
data = pd.read_csv("data.csv")
data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']
X = data[data_features]
y = data.fault_severity

#Train Validation Split for Cross Validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

#RandomForest Modeling via Pipeline and Hyper-parameter tuning
steps = [('rf', RandomForestClassifier(random_state=0))]
pipeline = Pipeline(steps) # define the pipeline object.
parameters = {'rf__n_estimators':[100]}
grid = GridSearchCV(pipeline, param_grid=parameters, scoring='neg_mean_squared_error', cv=None, refit=True)
grid.fit(X_train, y_train)

#Approach 1:
print(grid.best_score_)
# Output:
#    -0.508130081300813

#Approach 2:
y_predict=grid.predict(X_valid)
print("score = %3.2f"%(grid.score(y_predict, y_valid)))
# Output:
#    ValueError: Expected 2D array, got 1D array instead:
#    array=[0. 0. 0. ... 0. 1. 0.].
#    Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

#Approach 3:
y_predict_df = pd.DataFrame(y_predict.reshape(len(y_predict), -1),columns=['fault_severity'])
print("score = %3.2f"%(grid.score(y_predict_df, y_valid)))
# Output: 
#    ValueError: Number of features of the model must match the input. Model n_features is 6 and input n_features is 1 

Discussion:

Approach 1:
As in GridSearchCV() the scoring variable is set to neg_mean_squared_error, tried to read the grid.best_score_. But it did not get the same MAE result.

Approach 2:
Tried to get the y_predict values using grid.predict(X_valid). Then tried to get the MAE using grid.score(y_predict, y_valid) as the scoring variable in GridSearchCV() is set to neg_mean_squared_error. It returned a ValueError complaining “Expected 2D array, got 1D array instead”.

Approach 3:
Tried to reshape y_predict and it did not work either. This time it returned “ValueError: Number of features of the model must match the input.”

It would be helpful if you can assist to point where I could have made the error?

If you need, the data.csv is available at https://www.dropbox.com/s/t1h53jg1hy4x33b/data.csv

Thank you very much

Asked By: Saravanan K

||

Answers:

You are trying to compare mean_absolute_error with neg_mean_squared_error which is very different refer here for more details. You should have used neg_mean_absolute_error in your GridSearchCV object creation like shown below:

grid = GridSearchCV(pipeline, param_grid=parameters,scoring='neg_mean_absolute_error', cv=None, refit=True)

Also, the score method in sklearn takes (X,y) as inputs, where x is your input feature of shape (n_samples, n_features) and y is the target labels, you need to change your grid.score(y_predict, y_valid) into grid.score(X_valid, y_valid).