There is a common trope about data science that beginner data scientists often outperform more experienced ones and get models with 99,9% accuracy, co

How a cross-validation strategy can drastically impact model performance.

submited by
Style Pass
2021-10-27 17:30:10

There is a common trope about data science that beginner data scientists often outperform more experienced ones and get models with 99,9% accuracy, compared to experienced data scientists, who rarely get 80%.

Like human learning, machine learning should follow this protocol. Most scientists are fully aware of how important the third part of the process is, because building models that work only on the data you got is not very useful. But for those just starting out in data science, it’s a common mistake to neglect it. 

Often, ignoring this step results in models that get very high performances while experimenting, but then collapse when deployed and used on real, never beforeseen data. This comes from two main phenomena: 

As machine learning is very powerful, data overfitting can, and will, happen a lot. With thousands of parameters, you will always succeed in building a model that perfectly fits your experimental data, yet predicts few—or no—new data.

Another issue, data leakage, is when your model starts to memorize the relationship between samples and targets instead of understanding them. It’s common with data that’s split into groups, like categories of goods or people. The model might memorize the group’s most frequent target instead of the relationship between its feature and the target.

Leave a Comment