sklearn cross_val_score() returns NaN values

Question:

i’m trying to predict next customer purchase to my job. I followed a guide, but when i tried to use cross_val_score() function, it returns NaN values.Google Colab notebook screenshot

Variables:

  • X_train is a dataframe
  • X_test is a dataframe
  • y_train is a list
  • y_test is a list

Code:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)
X_train = X_train.reset_index(drop=True)
X_train
X_test = X_test.reset_index(drop=True)

y_train = y_train.astype('float')
y_test = y_test.astype('float')

models = []
models.append(("LR",LogisticRegression()))
models.append(("NB",GaussianNB()))
models.append(("RF",RandomForestClassifier()))
models.append(("SVC",SVC()))
models.append(("Dtree",DecisionTreeClassifier()))
models.append(("XGB",xgb.XGBClassifier()))
models.append(("KNN",KNeighborsClassifier()))´

for name,model in models:
   kfold = KFold(n_splits=2, random_state=22)
   cv_result = cross_val_score(model,X_train,y_train, cv = kfold,scoring = "accuracy")
   print(name, cv_result)
>>
LR [nan nan]
NB [nan nan]
RF [nan nan]
SVC [nan nan]
Dtree [nan nan]
XGB [nan nan]
KNN [nan nan]

help me please!

Asked By: Tomás Ortiz

||

Answers:

Well thanks everyone for your answers. The answer of Anna helped me a lot!, but i don’t used X_train.values, instead i assigned an unique ID to the Customers, then dropped Customers column and it works!

Now the models has this output 🙂

LR [0.73958333 0.74736842]
NB [0.60416667 0.71578947]
RF [0.80208333 0.82105263]
SVC [0.79166667 0.77894737]
Dtree [0.82291667 0.83157895]
XGB [0.85416667 0.85263158]
KNN [0.79166667 0.75789474]
Answered By: Tomás Ortiz

The cross_val_score method returns NaN when there are null values in your dataset.

Either use a model which can deal with missing values or remove all the null values from your dataset and try again.

Answered By: Ayush Srivastava

For my case, I had a time delta data type inside my numpy array that resulted in the error

Answered By: jay_the_superman

For me using xtrain.values, ytrain.values worked as the cross validation needs the input to be an array and not dataframe.

Answered By: Manoranjan Kumar

My case is a bit different. I was using cross_validate instead of cross_val_score with a list of performance metrics. Doing a 5 fold CV, I kept getting NaNs for all performance metrics for a RandomForestRegressor:

scorers = ['neg_mean_absolute_error', 'neg_root_mean_squared_error', 'r2', 'accuracy']

results = cross_validate(forest, X, y, cv=5, scoring=scorers, return_estimator=True)
results

Turns out, I stupidly included the ‘accuracy’ metric which is only used in classification. Instead of throwing an error, it looks like sklearn just returns NaNs for such cases

Answered By: Bex T.

I face to face with that problem. I solved this way; i convert X_train and y_train to DataFrame.

cross_val_score(model,X_train,y_train, cv = kfold,scoring = "accuracy")
Answered By: Göktuğ Ozleyen

I know this is answered already but for others who still cannot figure out the problem, this is for you…

Check if you y data type is a int or not. It will return nan if your date type for the y value is an object

How to check

y.dtype

How to change the data type

y = y.astype(int)

Answered By: theEconCsEngineer

I fixed the issue on my side. I was using a custom metric (Area Under Curve Precision-Recall (AUCPR))

def pr_auc_score(y, y_pred, **kwargs):
  classes = list(range(y_pred.shape[1]))
  if len(classes) == 2:
      precision, recall, _ = precision_recall_curve(y, y_pred[:,1],
                                                    **kwargs)
  else:
    Y = label_binarize(y, classes=classes)
    precision, recall, _ = precision_recall_curve(Y.ravel(), y_pred.ravel(),
                                                  **kwargs)
  return auc(recall, precision)

The problem is, for a binary problem, y_pred contains only the predicted probability of the label 1, so y_pred‘s shape is (n_sample,).
When I try to call the method : y_pred.shape[1], it raises an error.

The solution: inside cross_validate, use the parameter error_score="raise". This will allow you to detect the error.

Answered By: Adrien

I had the same problem and i have solved it in the same way as the author of this topic.

Answered By: Morksil