One day, you read data from your parquet or csv files, explore several key columns, and perform EDA. Then, you import some amazing machine learning packages, split data into training and testing, train, cross-validate and tune on training data, evaluate on testing data, and everything seems good, so far.
However, when you apply your model in production. Damn, it’s bad. What could possibly go wrong?
Well, you might be overfitting your data.
The idea of splitting data into training and testing is that testing data should be unseen to the model, more importantly, the modeler. Many developers overlooked the ‘modeler’ part.
Loading the whole data introduced a risk of human-loop overfitting by providing insights of testing data to the modeler. If a modeler sees the whole dataset and make decisions that will affect the modeling, then this is overfitting, even though testing data is not available to the model training.
Instead of splitting data at a later stage of project, I believe it’s better to split data before performing any analysis. If the data is not timestamped, it’s probably safe to randomly split the data file into two files: development and testing. Make sure you never touch testing until you are ready to evaluate your model.
After we evaluate the model on testing set, the results will stuck in our brain and any decisions we make regarding the testing results will potentially lead to overfitting. This kind of overfitting is very subtle.