Train / val / test split is not only about percentages; it is about assigning the right job to each subset and keeping the target honest. Your prototype makes this unusually clear by showing that a split can produce a flattering dev score while still optimizing the wrong world. The training set is where the model learns parameters, but the training distribution alone does not define the final target. The validation or dev set is where you compare ideas, tune hyperparameters, and decide which direction is actually better. The test set is the final honest check and should stay closed until the end.
In this prototype, the most important lesson is distribution matching: dev and test should reflect the future data you care about, even if that makes the intermediate score look worse. A high number only matters if it is both relevant to future users and still trustworthy.