Cross-Validation - Testing Your Model on Data It Has Never Seen // Megha Bose

Helpful context:

You have 1000 labeled examples and you want to choose between two model architectures. You set aside 100 examples as a validation set and train each model on the remaining 900. Model A gets 84% validation accuracy. Model B gets 81%. You pick Model A. But here is the uncomfortable truth: with only 100 validation examples, the standard error of each accuracy estimate is about $\sqrt{0.84 \cdot 0.16 / 100} \approx 3.7%$. The difference between 84% and 81% is less than one standard error. You might just as well have flipped a coin.

The fundamental problem is that a single holdout validation set gives a noisy estimate of a model’s true performance. The estimate depends heavily on which examples happened to end up in the validation set. If those 100 examples are unusually easy or unusually hard, the estimate is biased in that direction. With limited data, you cannot afford to dedicate a large chunk exclusively to validation - but a small validation set gives unreliable estimates. This tension is the problem that cross-validation resolves.

The key idea is to use all of the data for both training and validation, just not at the same time. Instead of committing to one fixed split, you perform multiple splits and average the results. Each example gets to be in the validation set exactly once. The resulting estimate is lower variance than any single split and makes efficient use of every labeled example you have.

K-Fold Cross-Validation

K-fold CV works as follows. Partition the training data randomly into $K$ equal-sized subsets called folds. For each fold $k = 1, \ldots, K$:

Train the model on all folds except fold $k$ (so on $(K-1)/K$ of the data).
Evaluate the trained model on fold $k$.
Record the validation score for this fold.

After all $K$ iterations, you have $K$ validation scores. The cross-validation estimate is their average:

$$\text{CV score} = \frac{1}{K} \sum_{k=1}^{K} \text{score}_k$$

Every example appears in the validation set exactly once and in the training set $K-1$ times. No data is wasted. The estimate of model performance is an average over $K$ different data splits, which substantially reduces variance compared to a single split.

A concrete example with $K=5$. With 1000 examples and $K=5$, each fold contains 200 examples. You train 5 models, each on 800 examples, each validated on a different set of 200. Each model sees 80% of the data for training. The five validation scores are averaged. If the scores are 83%, 86%, 82%, 85%, 84%, the CV estimate is 84% with a standard error of roughly $\sigma / \sqrt{5}$ where $\sigma$ is the standard deviation of the fold scores - in this case about $1.6% / \sqrt{5} \approx 0.7%$. Far more informative than a single 200-example holdout.

The Bias-Variance Tradeoff in K

The choice of $K$ involves a bias-variance tradeoff, just like the rest of machine learning.

Large $K$ (leave-one-out, $K = n$): Each training set has $n-1$ examples, almost the same size as the full dataset. The model trained in each fold is nearly identical to a model trained on all the data, so the performance estimate has low bias - it closely reflects how a model trained on the full dataset would perform. However, the $n$ validation scores are each based on a single example and are highly correlated (the $n$ training sets overlap almost entirely). This correlation means averaging them does not reduce variance as much as one might hope. Leave-one-out CV is also computationally expensive: you train $n$ separate models.

Small $K$ ($K = 2$): Each training set uses only half the data. A model trained on half the data may perform substantially worse than a model trained on all the data. The CV estimate is now biased downward - it underestimates how well your model will perform when trained on the full training set. On the positive side, the two folds are independent, so the two estimates are less correlated.

$K = 5$ or $K = 10$ hits a practical sweet spot. The training sets use 80-90% of the data (low bias), and the five or ten folds are reasonably independent (reduced variance). Empirically, $K = 10$ is the most common choice. $K = 5$ is preferred when computational cost is a concern.

There is a theoretical result (Kohavi 1995) suggesting $K = 10$ often gives better estimates than leave-one-out in practice, because despite the lower bias of LOO, its variance is high enough to outweigh the gain.

What Cross-Validation Is For

Cross-validation is a tool for model selection and hyperparameter tuning. It is not a substitute for a held-out test set.

This distinction is critical. Suppose you use 5-fold CV to compare 20 hyperparameter settings and pick the best one. You then report the CV score of the winning configuration as your model’s performance. This is wrong. You have selected the hyperparameter that happened to do best on your CV splits. Out of 20 candidates, the winner has some lucky positive bias from the selection process. The reported CV score overestimates how the model will perform in deployment.

The correct workflow:

Keep a test set completely aside before you start.
Use cross-validation on the training data to select hyperparameters and compare architectures.
Retrain the final model (with the chosen hyperparameters) on all training data.
Evaluate once on the held-out test set to get your final performance estimate.

Cross-validation tells you which model configuration is best. The test set tells you how good “best” actually is.

Stratified K-Fold

For classification with imbalanced classes, a random partition into $K$ folds may produce folds with very different class proportions. A fold that happens to contain no positive examples gives a degenerate validation score and can cause numerical issues (e.g., undefined AUC if there are no positives).

Stratified K-fold preserves the class proportions in each fold. If the full dataset is 90% negative and 10% positive, each fold is also 90/10. This ensures that each validation estimate is computed on a representative sample of the label distribution, making the CV estimate reliable and stable.

For regression, stratification can be approximated by binning the target variable and stratifying on the bins.

Time-Series CV: Forward Chaining

Standard K-fold CV assumes examples are exchangeable - that is, the order doesn’t matter. For time-series data, this assumption is violated. Randomly shuffling and splitting a time series means a model is trained on future data and validated on past data, which is temporal leakage.

The correct approach for time-series is forward chaining (also called expanding window or walk-forward validation). The idea is always to train on the past and validate on the future:

Fold 1: train on months 1-6, validate on month 7.
Fold 2: train on months 1-7, validate on month 8.
Fold 3: train on months 1-8, validate on month 9.
…

Each subsequent training set is strictly in the past relative to its validation set. No temporal leakage. The average performance across folds estimates how the model performs when deployed to predict a future period it hasn’t seen.

An alternative is a sliding window: use a fixed-size training window instead of an ever-expanding one. This is appropriate when older data is less relevant (e.g., fashion trends from 10 years ago may not be informative today).

Nested Cross-Validation

Here is a subtle problem. Suppose you use 10-fold CV to tune hyperparameters and pick the best configuration. You then report the CV score of that best configuration. As noted above, this is optimistically biased because the hyperparameter selection used the same CV splits that generated the reported score.

Nested cross-validation separates these two concerns by using two loops:

Outer loop (for performance estimation): Split data into $K_\text{out}$ folds. For each outer fold, hold out one fold as the outer validation set. On the remaining data, run the inner loop.

Inner loop (for hyperparameter selection): On the training portion of the current outer fold, run $K_\text{in}$-fold CV over all hyperparameter settings. Pick the best hyperparameter setting according to inner CV. Train on the full inner training set with those hyperparameters. Evaluate on the outer validation fold.

After all outer folds complete, average the outer validation scores. This is an unbiased estimate of performance of the model selection procedure, not of any single model.

Nested CV is expensive - it trains $K_\text{out} \times K_\text{in} \times H$ models, where $H$ is the number of hyperparameter settings. It is most useful when data is scarce and you need both a reliable performance estimate and principled hyperparameter selection. In practice, with large datasets, a simple train/val/test split is sufficient and nested CV is unnecessary.

Repeated Cross-Validation

Even K-fold CV has variance from the random choice of how the folds are formed. A different random partition might give a different average score. Repeated K-fold CV runs K-fold multiple times, each time with a different random partition of the data, and averages all the results.

For example, 10 repetitions of 5-fold CV trains $10 \times 5 = 50$ models and averages 50 validation scores. The resulting estimate has substantially lower variance than a single run of 5-fold CV. The tradeoff is computational: 10 times as many model fits.

Repeated CV is most useful when your dataset is small, each fold estimate is noisy, and you want a stable estimate for comparing two similar models. For large datasets, a single run of K-fold CV is already stable enough.

Computational Cost

K-fold CV costs exactly $K$ times as much as training a single model (assuming each fold takes the same time). For models that are cheap to train (linear regression, shallow trees), this is trivial. For large neural networks, K-fold CV may be prohibitive: training a large language model 10 times is not feasible.

In practice, the computational budget determines which CV strategy is viable:

Budget	Strategy
Abundant	10-fold CV or nested CV
Moderate	5-fold CV
Tight	3-fold CV or single validation split
Very tight	Single train/val split, make it large

For neural networks, a common compromise is to use a single validation set but report the average over multiple independent runs with different random seeds - this captures some of the variance that K-fold CV would reveal, at lower cost.

Summary

Method	Bias	Variance	Cost	When to use
Single holdout	Medium	High	1x	Large datasets
K-fold (K=5 or 10)	Low	Medium	Kx	Standard choice
Leave-one-out	Very low	High	nx	Very small datasets
Stratified K-fold	Low	Medium	Kx	Imbalanced classification
Time-series forward chain	Low	Medium	Kx	Temporal data
Nested CV	Very low	Low	$K^2 \times H$x	Small data, rigorous evaluation

Read next: