Retrieve cross validation performance (AUC) on h2o AutoML for holdout dataset

Question:

I am training a binary classification model with h2o AutoML using the default cross-validation (nfolds=5). I need to obtain the AUC score for each holdout fold in order to compute the variability.

This is the code I am using:

h2o.init()

prostate = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
# convert columns to factors
prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
prostate['RACE'] = prostate['RACE'].asfactor()
prostate['DCAPS'] = prostate['DCAPS'].asfactor()
prostate['DPROS'] = prostate['DPROS'].asfactor()

# set the predictor and response columns
predictors = ["AGE", "RACE", "VOL", "GLEASON"]
response_col = "CAPSULE"

# split into train and testing sets
train, test = prostate.split_frame(ratios = [0.8], seed = 1234)


aml = H2OAutoML(seed=1, max_runtime_secs=100, exclude_algos=["DeepLearning", "GLM"],
                    nfolds=5, keep_cross_validation_predictions=True)

aml.train(predictors, response_col, training_frame=prostate)

leader = aml.leader

I check that leader is not a StackedEnsamble model (for which the validation metrics are not available). Anyway, I am not able to retrieve the five AUC scores.

Any idea on how to do so?

Asked By: A1010

||

Answers:

Here’s how it’s done:

import h2o
from h2o.automl import H2OAutoML

h2o.init()

# import prostate dataset
prostate = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
# convert columns to factors
prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
prostate['RACE'] = prostate['RACE'].asfactor()
prostate['DCAPS'] = prostate['DCAPS'].asfactor()
prostate['DPROS'] = prostate['DPROS'].asfactor()

# set the predictor and response columns
predictors = ["AGE", "RACE", "VOL", "GLEASON"]
response_col = "CAPSULE"

# split into train and testing sets
train, test = prostate.split_frame(ratios = [0.8], seed = 1234)

# run AutoML for 100 seconds
aml = H2OAutoML(seed=1, max_runtime_secs=100, exclude_algos=["DeepLearning", "GLM"],
                    nfolds=5, keep_cross_validation_predictions=True)
aml.train(x=predictors, y=response_col, training_frame=prostate)

# Get the leader model
leader = aml.leader

There is a caveat to mention here about cross-validated AUC — H2O currently stores two computations of CV AUC. One is an aggregated version (take the AUC of aggregated CV predictions), and the other is the "true" definition of cross-validated AUC (an average of the k AUCs from k-fold cross-validation). The latter is stored in an object which also contains the individual fold AUCs, as well as the standard deviation across the folds.

If you’re wondering why we do this, there’s some historical & technical reasons why we have two versions, as well as a ticket open to only every report the latter.

The first one is what you get when you do this (and also what appears on the AutoML Leaderboard).

# print CV AUC for leader model
print(leader.model_performance(xval=True).auc())

If you want the fold-wise AUCs so you can compute or view their mean and variability (standard deviation), you can do that by looking here:

# print CV metrics summary
leader.cross_validation_metrics_summary()

Output:

Cross-Validation Metrics Summary:
             mean        sd           cv_1_valid    cv_2_valid    cv_3_valid    cv_4_valid    cv_5_valid
-----------  ----------  -----------  ------------  ------------  ------------  ------------  ------------
accuracy     0.71842104  0.06419111   0.7631579     0.6447368     0.7368421     0.7894737     0.65789473
auc          0.7767409   0.053587236  0.8206676     0.70905924    0.7982079     0.82538515    0.7303846
aucpr        0.6907578   0.0834025    0.78737605    0.7141305     0.7147677     0.67790955    0.55960524
err          0.28157896  0.06419111   0.23684211    0.35526314    0.2631579     0.21052632    0.34210527
err_count    21.4        4.8785243    18.0          27.0          20.0          16.0          26.0
---          ---         ---          ---           ---           ---           ---           ---
precision    0.61751753  0.08747421   0.675         0.5714286     0.61702126    0.7241379     0.5
r2           0.20118153  0.10781976   0.3014902     0.09386432    0.25050205    0.28393403    0.07611712
recall       0.84506994  0.08513061   0.84375       0.9142857     0.9354839     0.7241379     0.8076923
rmse         0.435928    0.028099842  0.41264254    0.47447023    0.42546       0.41106534    0.4560018
specificity  0.62579334  0.15424488   0.70454544    0.41463414    0.6           0.82978725    0.58

See the whole table with table.as_data_frame()

Here’s what the leaderboard looks like (storing aggregated CV AUCs). In this case, because the data is so small (300 rows), there’s a noticeable difference between the two reported between the two reported CV AUC values, however for larger datasets, they should be much closer estimates.

# print the whole Leaderboard (all CV metrics for all models)
lb = aml.leaderboard
print(lb)

That will print the top of the leaderboard:

model_id                                                  auc    logloss     aucpr    mean_per_class_error      rmse       mse
---------------------------------------------------  --------  ---------  --------  ----------------------  --------  --------
XGBoost_grid__1_AutoML_20200924_200634_model_2       0.769716   0.565326  0.668827                0.290806  0.436652  0.190665
GBM_grid__1_AutoML_20200924_200634_model_4           0.762993   0.56685   0.666984                0.279145  0.437634  0.191524
XGBoost_grid__1_AutoML_20200924_200634_model_9       0.762417   0.570041  0.645664                0.300121  0.440255  0.193824
GBM_grid__1_AutoML_20200924_200634_model_6           0.759912   0.572651  0.636713                0.30097   0.440755  0.194265
StackedEnsemble_BestOfFamily_AutoML_20200924_200634  0.756486   0.574461  0.646087                0.294002  0.441413  0.194845
GBM_grid__1_AutoML_20200924_200634_model_7           0.754153   0.576821  0.641462                0.286041  0.442533  0.195836
XGBoost_1_AutoML_20200924_200634                     0.75411    0.584216  0.626074                0.289237  0.443911  0.197057
XGBoost_grid__1_AutoML_20200924_200634_model_3       0.753347   0.57999   0.629876                0.312056  0.4428    0.196072
GBM_grid__1_AutoML_20200924_200634_model_1           0.751706   0.577175  0.628564                0.273603  0.442751  0.196029
XGBoost_grid__1_AutoML_20200924_200634_model_8       0.749446   0.576686  0.610544                0.27844   0.442314  0.195642

[28 rows x 7 columns]
Answered By: Erin LeDell

I submitted the following task
https://h2oai.atlassian.net/browse/PUBDEV-8984

This is when you want to order your grid search for a specific metric.

def sort_grid(grid,metric):
#input: grid and metric to order
if metric == 'accuracy':
    id = 0
elif metric == 'auc':
    id = 1
elif metric=='err':
    id = 2
elif metric == 'err_count':
    id=3
elif metric=='f0point5':
    id=4
elif metric=='f1':
    id=5
elif metric =='f2':
    id=6
elif metric =='lift_top_group':
    id=7
elif metric == 'logloss':
    id=8
elif metric == 'max_per_class_error':
    id=9
elif metric == 'mcc':
    metric=9
elif metric =='mena_per_class_accuracy':
    id=10
elif metric == 'mean_per_class_error':
    id=11
elif metric == 'mse':
    id =12
elif metric == 'pr_auc':
    id=13
elif metric == 'precision':
    id=14
elif metric == 'r2':
    id=15
elif metric =='recall':
    id=16
elif metric == 'rmse':
    id = 17
elif metric == 'specificity':
    id = 18
else: 
    return 0

model_ids = []
cross_val_values = []
number_of_models = len(grid.model_ids) 
number_of_models
for i in range(number_of_models):
    modelo_grid = grid[i]
    mean = np.array(modelo_grid.cross_validation_metrics_summary()[[1]])
    cross_val= mean[0][id]
    model_id = grid.model_ids[i]
    model_ids.append(model_id)
    cross_val_values.append(cross_val)

df = pd.DataFrame(
    {'Model_IDs': model_ids, metric: cross_val_values}
)
df = df.sort_values([metric], ascending=False)
best_model = h2o.get_model(df.iloc[0,0])
return df, best_model
#outputs: ordered grid in pandas dataframe and best model

I used this for a binary classification model

Answered By: Tomás Araujo