Cross-validation
Model Validation
Data analysis often results in a model that needs to be evaluated to determine if its performance is adequate. If the model only explains the given data well and is totally ineffective in practice, then the analysis is meaningless. For this reason, the entire data set is split into a dataset used for obtaining the model and another for evaluating the performance of the model.
Obtaining a model means using given data to explain other data. In this context, the “given data” is referred to as training data. This is in line with the concept of ’learning’ in fields such as machine learning. The “other data” is referred to as test data. Strictly speaking, test data is not created for ‘actual testing’, but it is considered new to the model and can be used to check its performance.
So far, this is the idea behind ‘validating a model’. Just as there are various tests to check if the analysis was done correctly, it’s important to verify whether the obtained model is actually useful. From this perspective, concerns such as whether regression coefficients are significant, there are too many variables, or it passes the goodness of fit test are not the main focus of validation.
If the model obtained from the training data also shows good performance on the test data, it can be inferred that the model will perform well on other data too. If the process of studying for the SAT is considered training, then taking mock exams is the test. By taking these mock exams repeatedly, we can continually check if we will be able to achieve adequate performance in the SAT.
But imagine a student who only repeats the 9th mock exam of the SAT math section before the actual test. Even if they initially score 15, they can eventually score 100 by repeatedly solving the same problems and memorizing the answers. However, it’s difficult to expect their score to improve in the actual SAT. This phenomenon is called overfitting. Because it is too optimized for a limited set of problems, it cannot adapt to the unpredictable questions of the SAT.
However, suppose the student retakes the exam, studying not only the last 10 years of SAT tests but also the 6th and 9th mock exams. Though scoring 100 on all these tests would be challenging, their SAT score would undoubtedly improve. It’s a natural outcome of ‘repeatedly’ putting in ‘a lot of’ ’effort’. This is the essence of what we know as studying.
$k$-Fold
A method that can be used to detect or avoid overfitting is cross-validation. Instead of splitting the data into just two parts, it involves creating various scenarios to select the most flexibly trained model. Since the split is random, sometimes a model might perform exceptionally well purely by chance. In such cases, conducting cross-validation can help suspect overfitting.
There’s no reason why dataB, which was used as training data when dataE was the test data, couldn’t be used as test data itself. It’s possible to rotate the test data and check the performance. Especially, conducting cross-validation with the split number set to $k$ is called $k$-fold Cross Validation.
If the split number is $k=n$, meaning the cross-validation is done by leaving out one piece of data at a time, this method is also called Jackknifing.