Helpful context:


Suppose you study for an exam using a set of practice problems, and the night before the test you check your score by redoing the same problems. You get 98%. Should you feel confident? No - because you already know those problems. Your brain has memorized them, and the score tells you nothing about how you will perform on problems you have never seen. This is exactly the situation a machine learning model faces when you evaluate it on its own training data. The model has, in some sense, “seen” every training example and adjusted its parameters to fit them. A training accuracy of 100% is not a signal of a good model; it is a signal that the model has memorized rather than generalized.

The only reliable way to measure whether a model has learned something real is to evaluate it on data it has never seen. This requires setting aside a portion of your data before training begins and never touching it during model development. In practice, you need not one but two held-out sets: a validation set for the many decisions you make during development (which architecture, which learning rate, when to stop training), and a test set for one final, unbiased measure of performance at the very end. These three sets - train, validation, test - serve fundamentally different purposes, and mixing them up is one of the most common and consequential mistakes in applied machine learning.

Understanding this separation is not just a procedural detail. It sits at the heart of what it means to claim that a model generalizes. The train-validation-test framework is how you distinguish “this model works” from “this model works on the problems I have been solving.” Every downstream decision about model selection, hyperparameter tuning, and production deployment depends on getting this right.


Why Training Error Is Meaningless

A model with enough capacity can fit any finite dataset with zero error. A degree-$n$ polynomial can interpolate $n+1$ points exactly. A neural network with enough parameters can memorize every training example. When this happens, training error is 0 and the model is useless on new data.

Even with regularization, the training error is a biased estimate of generalization error. The model’s parameters were chosen to minimize training error, so the training set has been “used up” in a statistical sense. The expected training error is strictly lower than the expected test error for any model with adjustable parameters. The gap between them is the generalization gap, and understanding it (via bias and variance analysis) is the subject of the overfitting literature.

The practical consequence: never report training error as your model’s performance. The only number that matters for deployment is performance on data the model has not seen.


The Three-Way Split

The standard approach partitions available data into three non-overlapping sets:

Training set - used to fit model parameters. Gradient descent, maximum likelihood, the perceptron update - all of these run on training data. This is typically the largest set (60-80% of total data).

Validation set - used to make decisions during development. Which of your five candidate architectures is best? Should you use a learning rate of 0.01 or 0.001? Should you train for 50 or 100 epochs? All of these decisions are made by evaluating on the validation set. The validation set is not used to compute gradients, but it is used by you, the practitioner, to make choices. Typical size: 10-20%.

Test set - used exactly once, at the very end, to report final performance. It is the unbiased estimate of how your model will perform in deployment. Typical size: 10-20%.

A common split for moderately sized datasets is 70/15/15 or 80/10/10. For large datasets (millions of examples), you need far fewer than 10% for validation and test - 1% of 10 million examples is still 100,000 examples, more than enough for a reliable estimate. Conversely, for very small datasets (a few hundred examples), you may need to put 80-90% into training and use cross-validation instead of a fixed validation set (see the Cross-Validation - Testing Your Model on Data It Has Never Seen post).


The Contamination Problem

Here is the subtle issue that trips up even experienced practitioners. The test set gives an unbiased estimate of performance only if no decision during model development was informed by the test set. The moment you look at test set performance and use it to decide anything - which model to deploy, whether to collect more data, whether to add a regularizer - the test set is no longer unbiased. It has become a second validation set.

This happens more easily than you might think. A team trains 20 models, evaluates them on the test set to pick the best, and reports that model’s test accuracy. But they have effectively used the test set for model selection. The reported accuracy is optimistically biased, because out of 20 models, the one that happened to do well on the test set was selected - even if that advantage is partly due to chance.

The rule is stark: the test set is touched exactly once. Before that moment, it does not exist as far as model development is concerned. If you have used your test set to make any decision, acquire more data, retrain, and establish a new test set.


Data Leakage

Data leakage occurs when information from the validation or test set bleeds into the training process, giving the model an unfair advantage it will not have at deployment. Leakage inflates apparent performance and leads to models that fail in production.

Normalization before splitting. A common mistake: compute the mean and standard deviation of each feature across the entire dataset, then split into train/val/test. Now the training set’s normalized features were computed using statistics from the validation and test examples. The model has seen, in aggregate, what the test distribution looks like. The correct procedure: fit the normalization (and any preprocessing - PCA, imputation, scaling) on the training set only, then apply that same transformation to the validation and test sets.

Feature selection on all data. Running a feature importance analysis or correlation filter on the full dataset before splitting means the selected features were chosen using test set labels. The training data appears more predictive than it is. Always do feature selection inside the training fold.

Target encoding before splitting. Encoding categorical features using statistics of the target variable (e.g., replacing each category with its mean target value) must be computed only on training data. Computing it on the full dataset leaks target information from test examples into the training pipeline.

Temporal leakage. Using future information to predict the past. If you are predicting whether a customer will churn this month, any features derived from data after the current month are leaking. This also applies to the split itself: shuffling a time series and splitting randomly means your model trains on data from the future relative to some of its validation examples. Time series data requires chronological splitting.

The general principle: the test set (and validation set, during preprocessing) should be treated as if it does not exist when computing any statistics or selecting any transformations used in the training pipeline.


Stratified Splits

When class frequencies are unequal - say, 95% negative and 5% positive in a fraud detection problem - a random split might put all positive examples in training and none in validation, or vice versa. The validation set then has a different class distribution than the training set, making the validation estimate unreliable.

Stratified splitting preserves the class proportions in each split. If the full dataset is 95/5, each of train, validation, and test is also 95/5. This ensures that each split is a representative sample of the full distribution.

In scikit-learn, train_test_split accepts a stratify parameter. For multi-label and multi-class problems, stratification becomes more complex but the principle is the same: each split should reflect the full label distribution.


Temporal Data: Never Split Randomly

For time-series data, random splitting is wrong by construction. Suppose you are predicting stock prices and you randomly assign 80% of days to training and 20% to testing. Your training set includes days from 2023, and your test set includes days from 2019. You are training on the future and testing on the past - your model can implicitly learn from post-test information.

The correct approach is chronological splitting:

  • Train on the earliest portion of the data.
  • Validate on the next portion.
  • Test on the most recent portion.

This respects the causal structure: we only use past information to predict the future. It also gives a realistic measure of deployment performance, where the model always predicts events that occur after its training cutoff.

For temporal data, there is an additional complication: the statistical properties of the data may drift over time (concept drift). A model trained on data from 2020 may degrade by 2024 because the underlying distribution has shifted. The chronological split naturally tests for this, whereas a random split would hide it.


The Variance of a Single Test Estimate

A single test set estimate has variance. With 1000 test examples and an accuracy of 90%, the standard error is $\sqrt{0.9 \cdot 0.1 / 1000} \approx 0.0095$, giving a 95% confidence interval of roughly $[88.1%, 91.9%]$. With only 100 test examples, the confidence interval widens to $[84.1%, 95.9%]$, which is almost useless for comparing models.

For model selection on small datasets, a single validation set is too noisy to reliably identify the best model. This is the motivation for cross-validation: instead of a single validation split, rotate through multiple splits and average the results. Cross-validation gives a lower-variance estimate of model performance and makes better use of limited data for model selection. The tradeoff and mechanics are covered in the next post.

The key insight here: the test set is for one final estimate, not for iterative improvement. Its variance is the fundamental limit on how precisely you can measure your model’s performance, and you should factor this into how you interpret reported numbers.


Set Purpose Size (typical) Seen during training?
Train Fit model parameters 60-80% Yes
Validation Tune hyperparameters, select architecture 10-20% No (but informs decisions)
Test Final unbiased performance estimate 10-20% No

Read next: