Training/Validation/Test Sets in Machine Learning
Definitions
During training, data sets used are:
- The data set used to optimize the model’s parameters is called the training set.
- The data set used to optimize the model’s hyperparameters is called the validation set.
After training, the data set used:
- The data set used to evaluate the model’s performance after training is called the test set.
The function value of the loss function for the training/validation/test sets is called training/validation/test loss.
Explanation
Training Set: As the name suggests, the training set is used for training, specifically for optimizing parameters (of network) through backpropagation.
Validation Set: The validation set is used to check if the model is being trained well when optimizing with the training set. It does not affect backpropagation. If enough epochs have passed and the performance on the training set is high but low on the validation set, the number of layers or nodes is adjusted and training is conducted again. Thus, the validation set, in a grandiose term, is used for the optimization of the model’s design, or simply, for the hyperparameter optimization. Therefore, it’s closely involved in the model’s training. It might not be used for parameter optimization, but saying it’s not used for training is a stretch. In fact, during deep learning, the performance on the validation set is monitored and hyperparameters are tweaked until the performance is satisfactorily high. Consequently, adjusting the model’s design until the validation set performs well is natural. Hence, a separate data set is needed to evaluate the model’s final performance.
Test Set: Since the training set is involved in optimizing the model’s parameters, and the validation set in optimizing the model’s hyperparameters, neither is suitable for performance measurement. Therefore, data neither belonging to the training set nor the validation set is used to ultimately assess the model’s performance. This data is referred to as the test set.
The usage of each data set can be shown in pseudocode as,
$$ \begin{align*} &\text{{\bf for} $1 \le$epoch$\le N$ {\bf do}} \\ &\qquad \text{training loss $\leftarrow$ LOSS(model($x_{train}$), $y_{train}$)} \\ &\qquad \text{backprogation for training loss} \\ &\qquad \text{validation loss $\leftarrow$ LOSS(model($x_{valid}$), $y_{valid}$)} \\ &\qquad \text{print and check training and validation losses } \\ &\textbf{end for} \\ &\text{test loss $\leftarrow$ LOSS(model($x_{test}$), $y_{test}$)} \end{align*} $$