BGSE Data Science Kaggle Competition
During our second term as data scientists, the class split into 10 teams of 3 and participated in an “in-class” kaggle competition. It left every team depleted from late-night efforts and many long days spent obsessing and executing ideas which resulted often in reduced accuracy.
“At one point, I nearly threw my laptop at the wall.”
a data scientist moments after the deadline
Predict popularity of 1-5 (ordinal classification) of Mashable news articles using 60 variables of metadata. Many of these features were useless. You can read more on the UCI Machine Learning Repository.
The top 2 teams pulled ahead of the pack with accuracies of 0.55567 and 0.56744, while groups 3 through 10 were all within .01 (e.g. 1 percentage accuracy).
A collection of lessons learned from the top 3 groups:
- Third place used a single random forest, which multiple teams found to be better than multiple forests or an ensembled forest with other models.
- Second place used an ensemble 3 random forests and 3 generalized boosted regression models. Leveraging the ordinal nature of the classification, the team grouped 3’s, 4’s and 5’s, and achieved higher accuracy predicting the 3-class problem
- The first place team used rolling window for prediction, provided the time series nature of data (a single random forest)
- The first place team created new features from the url to detect title keywords unique to each class. This method only affected 400 out of the 9644 test set but resulted in an improved accuracy by 1% – huge in kaggle-land.
- Keep systematic, meticulous and obsessive track of your randomness, methods and results.
- Trust your cross-validation (don’t trust the public leaderboard): The third place team chose to use the submission for which they got the best cross-validation score on the training set, but not the best score on the public leaderboard.
- Use all your data: many teams extracted new features from the URL.