Need for both CV and train-test-split to assess a model's performance


This is a theoretical question.

Suppose I am working on a supervised learning model to predict employee attrition based on some categorical and numerical features. I have a small dataset of 500 employees. I want to try a few models, such as a Decision Tree, KNN, Support Vector Machine, Random Forest and an XGBoost, and compare their accuracy. I use cross-validation (CV) to get the average performance of my model in predicting unseen data (by definition of CV).

Question: In this scenario, is it still relevant that I use train-test-split to create a hold-out dataset to test my model afterwards? Why so?

Follow-up question: if it is relevant to use train-test-split, should I run CV after the split, using the train set only, or run the CV with the entire dataset and then split and test?

Thank you

I’ve seen different tutorials from what I think are good sources online and they use both train-test-split and CV using the train set for CV (e.g.: In my opinion, using the complete dataset for CV would result in even better results.

Asked By: brauliopf



It is recommended to split your initial dataset in 3 parts:

  1. Train
  2. Validation (to tune hyperparameters)
  3. Test (for validation)

Even in the case of the small dataset, I would suggest getting a part of your whole dataset for the test part (this part will be common for every algorithm). All other samples should be used for k-fold CV for every algorithm (be very careful with the standard deviation of your metric). After finding the best hyperparameters and choosing the best model on the test, you have to retrain your model on the whole dataset at once (train + val + test) – it will be your final model for production.

Answered By: anon1453092865
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.