In a Machine Learning field, Cross Validation is a very important process. Today I will try to explain this concept with simpler terms and an example.
Cross Validation is an important process to ensure the model performance or the quality of learnings as the word "validation" says. If we say "the model performance is good," we woud like to believe the model perform well in general and believe this performance test is not cheating. How can we check and ensure the quality of learning?
Let's take an example by imagining a student, John and we are a teacher. Assuming John does not know 2 digit addition e.g., 11 + 20, 38 + 2 so we start with creating 50 exmaples of 2 digit addition as below.
2 + 9 = 11
4 + 11 = 15
12 + 32 = 34
83 + 29 = 112
He learned how to solve 2 digit addition. How can we test his learning if he fully understand 2 digit addition, not just remembering the pair of left parts and their answer respectively? Yes, you are right. As we grow up, we should give him "unseen" exercises to him and see if he can solve them or not so that we can see if he essencially understand the addition. This part is validation.
Back to machine learning. In a machine learning world, we cannot easily create another new exercise since it might be computationally or economically expensive to create new one. Instead, in the above case, we just randomly pick and save 10 equations (approximately 20%) in the beginning, and let John to practice using the rest of equations (80%) repeatedly, which we call training set. Then we can test him using the 10 equation that we saved for testing purpose, which we call test set. Note that we should not have any duplicated equations in training and test set because we cannot tell if John literally memorize the answer or not.
Now let's move on "cross" in Cross Validation. As a teacher, we would like to be very sure that John fully understands 2 digit addition if John can solve more than 8 out of 10 equations in a test set. However, in some cases, maybe because the test is extremely easy for him and 10 equations is almost close to 1 digit addition like 10 + 1, 10 +2, 20 + 3. To avoid this kind of lucky for John or unlucky for us, we test him with different test selection. If John were to forget everything a week later, we can test him using different training set, and see the test result by using different test set. By repeating this, we can observe if John can solve 80% of equations in a test, we can safely conclude that John learned 2 digit equations 80% accuracy.
There is some technically inaccurate part in the example but I hope this will help to get some intuition about cross validation. As a part of Feynman technique, I find the last part is more shaky. With this John's example, it is limited to explain the repetition. I should have chosen the other example.