Interactive prototype

Train / Val / Test Split

Understand how data is split for training, tuning, and final evaluation while avoiding leakage.

Structured teaching notes

Connect the interaction to the core idea.

These notes are written to sit below the interactive prototype, preserve the same teaching flow, and help the learner name what the visualization is showing.

Background

Train / val / test split is not only about percentages; it is about assigning the right job to each subset and keeping the target honest. Your prototype makes this unusually clear by showing that a split can produce a flattering dev score while still optimizing the wrong world. The training set is where the model learns parameters, but the training distribution alone does not define the final target. The validation or dev set is where you compare ideas, tune hyperparameters, and decide which direction is actually better. The test set is the final honest check and should stay closed until the end.

In this prototype, the most important lesson is distribution matching: dev and test should reflect the future data you care about, even if that makes the intermediate score look worse. A high number only matters if it is both relevant to future users and still trustworthy.

Important formulas
Train / validation / test = three distinct roles, not three interchangeable copies of the same job

The full dataset is partitioned into three subsets with different roles, not three interchangeable copies of the same job.

Validation error vs training error helps reveal whether the model is learning structure or only fitting the training set

Comparing validation error with training error helps reveal whether the model is learning useful structure or only fitting the training set.

Repeated test peeking biases the final test score

This is not a numeric law but the core rule of the prototype: repeated test peeking turns the final test score into a biased estimate.

Pros
  • Separates learning, model selection, and final evaluation into distinct stages with distinct responsibilities.
  • Helps detect overfitting because training and held-out performance can be compared directly.
  • Keeps model development aligned with future deployment when dev and test reflect the real target distribution.
  • Protects the final reported number from bias, as long as the test set is not reused during iteration.
Cons
  • Poor split design can create a false sense of progress, especially when dev/test do not match future user data.
  • Holding out validation and test data reduces the number of samples available for training.
  • Repeatedly tuning on the dev set can still create subtle overfitting to that split.
  • If the team keeps peeking at test results, the final score stops being an honest final estimate.
Quick example

If web images are cleaner but future users mainly upload noisy mobile-app images, a web-heavy dev set may make the model look stronger than it really is. The prototype shows exactly this trap: the intermediate dev score can rise while the real-world score stays misaligned.

Common mistake

A common mistake is to ask only, 'Which split gives the prettiest score?' The better question is, 'Which dev/test split best represents the future data I actually care about?' Another mistake is to keep reopening test during iteration and still call the final test number unbiased.