train_val_test_split_visualization.py

Interactive Playground

Controls

Pick a preset, adjust each split's source mix, change the data scale, and simulate repeated development cycles.

Quick presets

Jump between the main teaching cases without adjusting everything manually.

Data scale

Big data

This changes the overall split sizes, not the source mix inside each split.

Train app data

20%

How much of the training set comes from real mobile-app style data instead of clean web images.

Dev / Val app data

This split should represent the data you care about when choosing models.

Test app data

Test should usually match the same future distribution as dev, but stay unopened during iteration.

Development cycles

Higher means you tried more ideas, tuned more settings, and pushed harder on dev performance.

Test usage

Peek at test while iterating (bad)

Leave this off for a proper final check. Turn it on to see confidence collapse.

Split builder

Future users vs. what each split actually contains

Future users

90% app

Web images — cleaner, easier, more professional

App images — blurrier, noisier, more realistic

Train

98%

Dev

Test

Role of each split

This is why train / dev / test are not interchangeable

Train

Learn parameters

Use this split for gradient updates and fitting weights. More training data usually helps, but its distribution does not define the target by itself.

Dev / Val

Choose ideas

Use this split to compare architectures, hyperparameters, and data strategies. This split should reflect future data you care about.

Test

Final honest check

Open this only at the end. If you keep using it during iteration, the final score stops being an unbiased estimate.

Development timeline

More cycles mean more tuning pressure on the split you are watching

Healthy workflow: iterate on dev, then check test once at the end.
Unhealthy workflow: keep peeking at test and slowly overfit your project decisions to it.

Current diagnosis

How trustworthy and useful this split is

Wrong target

Dev score

Real-world score

Target match

Trust in final test

Biggest issue Dev and test are still mostly clean web images, but future users are mobile-app images.

What to do next Move dev and test closer to the future app distribution, even if the reported dev score becomes lower.

Beginner takeaway A split can make you feel good with a high dev number while still optimizing the wrong target.

Core rule

The sentence to remember

Train learns the weights.
Dev chooses among ideas.
Test gives the final honest estimate.

The best split is not the one with the prettiest intermediate score. It is the one whose dev/test sets reflect future data and stay honest.

Train / Val / Test SplitData Partitioning

Controls

Train / Val / Test Split
Data Partitioning