sklearn cross_val_score() returns NaN values
Question:
i’m trying to predict next customer purchase to my job. I followed a guide, but when i tried to use cross_val_score() function, it returns NaN values.Google Colab notebook screenshot
Variables:
- X_train is a dataframe
- X_test is a dataframe
- y_train is a list
- y_test is a list
Code:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)
X_train = X_train.reset_index(drop=True)
X_train
X_test = X_test.reset_index(drop=True)
y_train = y_train.astype('float')
y_test = y_test.astype('float')
models = []
models.append(("LR",LogisticRegression()))
models.append(("NB",GaussianNB()))
models.append(("RF",RandomForestClassifier()))
models.append(("SVC",SVC()))
models.append(("Dtree",DecisionTreeClassifier()))
models.append(("XGB",xgb.XGBClassifier()))
models.append(("KNN",KNeighborsClassifier()))´
for name,model in models:
kfold = KFold(n_splits=2, random_state=22)
cv_result = cross_val_score(model,X_train,y_train, cv = kfold,scoring = "accuracy")
print(name, cv_result)
>>
LR [nan nan]
NB [nan nan]
RF [nan nan]
SVC [nan nan]
Dtree [nan nan]
XGB [nan nan]
KNN [nan nan]
help me please!
Answers:
Well thanks everyone for your answers. The answer of Anna helped me a lot!, but i don’t used X_train.values, instead i assigned an unique ID to the Customers, then dropped Customers column and it works!
Now the models has this output 🙂
LR [0.73958333 0.74736842]
NB [0.60416667 0.71578947]
RF [0.80208333 0.82105263]
SVC [0.79166667 0.77894737]
Dtree [0.82291667 0.83157895]
XGB [0.85416667 0.85263158]
KNN [0.79166667 0.75789474]
The cross_val_score
method returns NaN
when there are null values in your dataset.
Either use a model which can deal with missing values or remove all the null values from your dataset and try again.
For my case, I had a time delta data type inside my numpy array that resulted in the error
For me using xtrain.values
, ytrain.values
worked as the cross validation needs the input to be an array and not dataframe.
My case is a bit different. I was using cross_validate
instead of cross_val_score
with a list of performance metrics. Doing a 5 fold CV, I kept getting NaNs for all performance metrics for a RandomForestRegressor
:
scorers = ['neg_mean_absolute_error', 'neg_root_mean_squared_error', 'r2', 'accuracy']
results = cross_validate(forest, X, y, cv=5, scoring=scorers, return_estimator=True)
results
Turns out, I stupidly included the ‘accuracy’ metric which is only used in classification. Instead of throwing an error, it looks like sklearn just returns NaNs for such cases
I face to face with that problem. I solved this way; i convert X_train and y_train to DataFrame.
cross_val_score(model,X_train,y_train, cv = kfold,scoring = "accuracy")
I know this is answered already but for others who still cannot figure out the problem, this is for you…
Check if you y
data type is a int
or not. It will return nan
if your date type for the y
value is an object
How to check
y.dtype
How to change the data type
y = y.astype(int)
I fixed the issue on my side. I was using a custom metric (Area Under Curve Precision-Recall (AUCPR))
def pr_auc_score(y, y_pred, **kwargs):
classes = list(range(y_pred.shape[1]))
if len(classes) == 2:
precision, recall, _ = precision_recall_curve(y, y_pred[:,1],
**kwargs)
else:
Y = label_binarize(y, classes=classes)
precision, recall, _ = precision_recall_curve(Y.ravel(), y_pred.ravel(),
**kwargs)
return auc(recall, precision)
The problem is, for a binary problem, y_pred
contains only the predicted probability of the label 1, so y_pred
‘s shape is (n_sample,).
When I try to call the method : y_pred.shape[1]
, it raises an error.
The solution: inside cross_validate
, use the parameter error_score="raise"
. This will allow you to detect the error.
I had the same problem and i have solved it in the same way as the author of this topic.
i’m trying to predict next customer purchase to my job. I followed a guide, but when i tried to use cross_val_score() function, it returns NaN values.Google Colab notebook screenshot
Variables:
- X_train is a dataframe
- X_test is a dataframe
- y_train is a list
- y_test is a list
Code:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)
X_train = X_train.reset_index(drop=True)
X_train
X_test = X_test.reset_index(drop=True)
y_train = y_train.astype('float')
y_test = y_test.astype('float')
models = []
models.append(("LR",LogisticRegression()))
models.append(("NB",GaussianNB()))
models.append(("RF",RandomForestClassifier()))
models.append(("SVC",SVC()))
models.append(("Dtree",DecisionTreeClassifier()))
models.append(("XGB",xgb.XGBClassifier()))
models.append(("KNN",KNeighborsClassifier()))´
for name,model in models:
kfold = KFold(n_splits=2, random_state=22)
cv_result = cross_val_score(model,X_train,y_train, cv = kfold,scoring = "accuracy")
print(name, cv_result)
>>
LR [nan nan]
NB [nan nan]
RF [nan nan]
SVC [nan nan]
Dtree [nan nan]
XGB [nan nan]
KNN [nan nan]
help me please!
Well thanks everyone for your answers. The answer of Anna helped me a lot!, but i don’t used X_train.values, instead i assigned an unique ID to the Customers, then dropped Customers column and it works!
Now the models has this output 🙂
LR [0.73958333 0.74736842]
NB [0.60416667 0.71578947]
RF [0.80208333 0.82105263]
SVC [0.79166667 0.77894737]
Dtree [0.82291667 0.83157895]
XGB [0.85416667 0.85263158]
KNN [0.79166667 0.75789474]
The cross_val_score
method returns NaN
when there are null values in your dataset.
Either use a model which can deal with missing values or remove all the null values from your dataset and try again.
For my case, I had a time delta data type inside my numpy array that resulted in the error
For me using xtrain.values
, ytrain.values
worked as the cross validation needs the input to be an array and not dataframe.
My case is a bit different. I was using cross_validate
instead of cross_val_score
with a list of performance metrics. Doing a 5 fold CV, I kept getting NaNs for all performance metrics for a RandomForestRegressor
:
scorers = ['neg_mean_absolute_error', 'neg_root_mean_squared_error', 'r2', 'accuracy']
results = cross_validate(forest, X, y, cv=5, scoring=scorers, return_estimator=True)
results
Turns out, I stupidly included the ‘accuracy’ metric which is only used in classification. Instead of throwing an error, it looks like sklearn just returns NaNs for such cases
I face to face with that problem. I solved this way; i convert X_train and y_train to DataFrame.
cross_val_score(model,X_train,y_train, cv = kfold,scoring = "accuracy")
I know this is answered already but for others who still cannot figure out the problem, this is for you…
Check if you y
data type is a int
or not. It will return nan
if your date type for the y
value is an object
How to check
y.dtype
How to change the data type
y = y.astype(int)
I fixed the issue on my side. I was using a custom metric (Area Under Curve Precision-Recall (AUCPR))
def pr_auc_score(y, y_pred, **kwargs):
classes = list(range(y_pred.shape[1]))
if len(classes) == 2:
precision, recall, _ = precision_recall_curve(y, y_pred[:,1],
**kwargs)
else:
Y = label_binarize(y, classes=classes)
precision, recall, _ = precision_recall_curve(Y.ravel(), y_pred.ravel(),
**kwargs)
return auc(recall, precision)
The problem is, for a binary problem, y_pred
contains only the predicted probability of the label 1, so y_pred
‘s shape is (n_sample,).
When I try to call the method : y_pred.shape[1]
, it raises an error.
The solution: inside cross_validate
, use the parameter error_score="raise"
. This will allow you to detect the error.
I had the same problem and i have solved it in the same way as the author of this topic.