NLP neural net validation accuracy increases too much (?) between folds in cross validation


I’m training a model with BERT for classification with two labels. I’d like to use cross validation, as I want to get an out of sample prediction of each observations in the data set to use later in linear regressions. I have 5 EPOCHS.

The behavior of the first fold is as expected: the validation accuracy increases across EPOCHS and converges to the accuracy I get when running the neural net with the usual 80-10-10 split and the whole sample (about .86).

The strange part of it is that for the subsequent folds (2 to 5), the validation accuracy keeps increasing—to .90, .95, .98 and 1.0.

I believe the code is right, as I re-run all the model from scratch for each fold. I’ve also manually checked and the split seems to be ok. Each fold’s validation set is random, unique, and not overlaps with its corresponding training dataset A possible explanation could be that the weights are not reinitialized between folds?? That would cause the new observations from the new folds to have been used before to compute the weights. But that looks strange to me.

I copy my code. Any help or ideas would be much appreciated. Thank you!


# Create K train/test folds
n_folds = 5
kf = KFold(n_splits=n_folds, random_state=RANDOM_SEED, shuffle=True)

# Initialize matrix to store results across folds
train_acc_mat = [[0 for _ in range(EPOCHS)] for _ in range(n_folds)]
train_loss_mat = [[0 for _ in range(EPOCHS)] for _ in range(n_folds)]
val_acc_mat = [[0 for _ in range(EPOCHS)] for _ in range(n_folds)]
val_loss_mat = [[0 for _ in range(EPOCHS)] for _ in range(n_folds)]

# Create fold index to store results in matrix
fold_index = 1

# Loop across folds
for train_index, val_index in kf.split(df):
  df_train = df.iloc[train_index]
  df_val = df.iloc[val_index]
  # Run Data Loader function for each training data set - fold
  train_data_loader = create_data_loader(df_train, tokenizer, MAX_LEN, BATCH_SIZE)
  # Run Data Loader function for each validation data set - fold
  val_data_loader = create_data_loader(df_val, tokenizer, MAX_LEN, BATCH_SIZE)

  # Train data loader
  data = next(iter(train_data_loader))

  # Create input_ids and attention_mask
  input_ids = data['input_ids'].to(device)
  attention_mask = data['attention_mask'].to(device)

  # Set last layer classification function, other config
  F.sigmoid(model(input_ids, attention_mask))
  optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
  total_steps = len(train_data_loader) * EPOCHS

  scheduler = get_linear_schedule_with_warmup(

  loss_fn = nn.CrossEntropyLoss().to(device)

  ###### TRAIN MODEL

  # Set accuracy to 0 to store fold best model results
  best_accuracy = 0

  # Create epoch index to store results in matrix
  EPOCH_index = 1

  # Iterate over EPOCHS, TRAIN MODEL
  for epoch in range(EPOCHS):

    print(f'Epoch {epoch + 1}/{EPOCHS}')
    print('-' * 10)

    train_acc, train_loss = train_epoch(

    print(f'Train loss {train_loss} accuracy {train_acc}')

    val_acc, val_loss = eval_model(

    print(f'Val   loss {val_loss} accuracy {val_acc}')

    # For fold_index, store results from EPOCH_index iteration
    train_acc_mat[fold_index-1][EPOCH_index-1] = train_acc.item()
    train_loss_mat[fold_index-1][EPOCH_index-1] = train_loss.item()
    val_acc_mat[fold_index-1][EPOCH_index-1] = val_acc.item()
    val_loss_mat[fold_index-1][EPOCH_index-1] = val_loss.item()

    # Save fold_index best model
    if val_acc > best_accuracy:, 'best_model_state_%s.bin' % fold_index)
      best_accuracy = val_acc
    # Update index for next EPOCH iteration
    EPOCH_index = EPOCH_index + 1

  # Store fold results
  globals()['train_fold_%s' % fold_index] = np.asmatrix(train_index)  
  y_review_texts, y_pred, y_pred_probs, y_test = get_predictions(

  globals()['results_fold_%s' % fold_index] = np.asmatrix(val_index)
  globals()['results_fold_%s' % fold_index] = np.vstack([globals()['results_fold_%s' % fold_index], y_review_texts])
  globals()['results_fold_%s' % fold_index] = np.vstack([globals()['results_fold_%s' % fold_index], y_pred.detach().numpy()])
  globals()['results_fold_%s' % fold_index] = np.vstack([globals()['results_fold_%s' % fold_index], y_test.detach().numpy()])
  globals()['results_fold_%s' % fold_index] = np.vstack([globals()['results_fold_%s' % fold_index], y_pred_probs.detach().numpy().T[1]])

  # Update index for next FOLD iteration
  fold_index = fold_index + 1

Asked By: roma



You’re resetting the optimizer but not resetting the model. With K folds you should have K models.

Imagine you have 5 pieces of data [1, 2, 3, 4, 5]

On the first fold you train [1, 2, 3, 4] and test [5].
Second fold you train [2, 3, 4, 5] test on [1]
Third fold you train [3, 4, 5, 1] test on [2]
Fourth fold you train [4, 5, 1, 2] test on [3]
Fifth fold you train [5, 1, 2, 3] test on [4].

By the time you get to the fifth fold since you’re not resetting the model, it’s already seen 4 in 4 of the training sets, so it’s going to perform very well on it.

This is also why your accuracy continues to increase after each fold. On each fold you’re leaking data to the model by testing on an already seen fold. The purpose of K fold cross validation is to see how the model would perform on unseen data.

Answered By: Chrispresso