“At one point, I nearly threw my laptop at the wall.”


BGSE Data Science Kaggle Competition

During our second term as data scientists, the class split into 10 teams of 3 and participated in an “in-class” kaggle competition. It left every team depleted from late-night efforts and many long days spent obsessing and executing ideas which resulted often in reduced accuracy.


“At one point, I nearly threw my laptop at the wall.”  

                                                                            a data scientist moments after the deadline



The challenge

Predict popularity of 1-5 (ordinal classification) of Mashable news articles using 60 variables of metadata. Many of these features were useless. You can read more on the UCI Machine Learning Repository.


The outcome

The top 2 teams pulled ahead of the pack with accuracies of 0.55567 and 0.56744, while groups 3 through 10 were all within .01 (e.g. 1 percentage accuracy).

A collection of lessons learned from the top 3 groups:

  • Third place used a single random forest, which multiple teams found to be better than multiple forests or an ensembled forest with other models.
  • Second place used an ensemble 3 random forests and 3 generalized boosted regression models. Leveraging the ordinal nature of the classification, the team grouped 3’s, 4’s and 5’s, and achieved higher accuracy predicting the 3-class problem
  • The first place team used rolling window for prediction, provided the time series nature of data (a single random forest)
  • The first place team created new features from the url to detect title keywords unique to each class. This method only affected 400 out of the 9644 test set but resulted in an improved accuracy by 1% – huge in kaggle-land.


Other curiosities

  • Keep systematic, meticulous and obsessive track of your randomness, methods and results.
  • Trust your cross-validation (don’t trust the public leaderboard): The third place team chose to use the submission for which they got the best cross-validation score on the training set, but not the best score on the public leaderboard.
  • Use all your data: many teams extracted new features from the URL.

Using H2O for Kaggle: Guest Post by Gaston Besanson and Tim Kreienkamp

From H2o site http://h2o.ai/blog/2015/05/h2o-kaggle-guest-post/


In this special H2O guest blog post, Gaston Besanson and Tim Kreienkamp talk about their experience using H2O for competitive data science. They are both students in the new Master of Data Science Program at the Barcelona Graduate School of Economics and used H2O in an in-class Kaggle competition for their Machine Learning class. Gaston’s team came in second, scoring 0.92838 in overall accuracy, slightly surpassed by Tim’s team with 0.92964, on a subset of the famous “Forest Cover” dataset.

What is your background prior to this challenge?

Tim: We both are students in the Master of Data Science at the Graduate School of Economics in Barcelona. I come from a business background. I took part in a few Kaggle challenges before, but didn’t have a formal machine learning background before this class.

Gaston: I have a mixed background in Economics, Finance and Law. With no prior experience on Kaggle or Machine Learning other than Andrew Ng’s online course :).

Could you give a brief introduction to the dataset and the challenges associated with it?

Tim: The good thing about this dataset is that it is relatively “clean” (no missing values etc) and small (7 mb of training data). This allows for fast iteration and testing out a couple of different methods and hunches relatively quickly (relatively – a classmate of ours spent $300 on AWS trying to train support vector machines). The main challenge I see in the multiclass nature – this always makes it harder as basically one has to train 7 models (due to the one-vs-all nature of multiclass classification).

Gaston: Yes, this dataset is a classic on Kaggle: Forest Cover Type Prediction. Which, as Tim said and adding to it, there are 7 types of trees and 54 features (10 quantitative variables, like Elevation, and 44 binary variables: 4 binary wilderness areas and 40 binary soil type variables). What come to our attention was the highly unbalanced that was the dataset. Class 1 and 2 represented 80% of the training data.

What feature engineering and preprocessing techniques did you use?

Gaston: Our team added an extra layer to this competition that was to predict as best as possible the type of tree in a region with the purpose of minimizing the fires. Even though we used the same loss for each type of misclassification – in other words, all trees are equally important -, we decided to create new features. We created six new variables to try to identify features important to fire risk. And, we applied a normalization on both the training and the test sets to the 60 features.

Tim: We included some difference and interaction terms. However, we didn’t scale the numerical features or use any unsupervised dimension reduction techniques. I briefly tried to do supervised feature learning with H2O Deep Learning – it gave me really impressive results in cross-validation, but broke down on the test set.

Editor’s note: L1/L2/Dropout regularization or fewer neurons can help avoid overfitting

Which supervised learning algorithms did you try and to what success?

Tim: I tried H2O’s implementation of Gradient Boosting, Random Forest, Deep Learning (MLP with stochastic gradient descent), and the standard R implementation of SVM and k-NN. k-NN performed poorly, so did SVM – Deep Learning overfit, as I already mentioned. The tree based methods both performed very well in our initial tests. We finally settled for Random Forest, since it gave the best results and was faster to train than Gradient Boosting.

Gaston: We tried KNN, SVM, Random Forest all from different packages, with not that great results. And finally we used H2O’s implementation of GBM – we ended up using this model because it introduces a lot of freedom into the model design. The model we used had the following attributes: Number of trees: 250; Maximum Depth: 18; Minimum Rows: 10; Shrinkage: 0.1.

What feature selection techniques did you try?

Tim: We didn’t try anything fancy (like LASSO) for this challenge. Instead, we decided to take advantage of the fact that random forests can compute feature importances. I used this to code my own recursive elimination procedure. At each iteration, a random forest was trained and cross-validated (ten fold). The feature importances are computed, the worst two features are discarded, and the next iteration begins with the remaining features. The resulting cross validation errors at each stage made up a nice “textbook-like” curve, where the error first decreased with fewer features and at the end made a sharp increase again. We then chose the set of features that gave the second-best cross validation error, to not overfit by feature selection.

Gaston: Actually, we did not do any feature selection other than removing the variables that did have a variance, which if I am not mistaken was one in the original dataset (before feature creation). Neither turns the binary variables into one categorical (one for wilderness areas and one for soil type). We had a naïve approach of sticking with the story of fire risk no matter what; maybe next time we will change the approach.

Why did you use H2O and what were the major benefits?

Tim: We were constrained by our teachers in the sense that we could only use R – that forced me out of my scikit-learn comfort zone. So I looked for something as accurate and fast. As an occasional Kaggler, I am familiar with Arno’s forum post, and so I decided to give H2O a shot – and I didn’t regret it at all. Apart from the nice R interface, the major benefit is the strong parallelization – this way we were able to make the most of our AWS academic grants.

Gaston: I came across H2O just by searching the web and reading about alternatives within R possibilities after the GBM package proved really untestable. Just to add to what Tim said, I think H2O will be my weapon of choice in the near future.

For a more detailed description of the methods used and results obtained, see the report of Gaston’s and Tim’s teams.

Under the Curve: A primer on kaggle competitions

Why Kaggle? 

Recently, there has been some controversy around kaggle . People point out that it frees you from a lot of the seemingly boring, tedious work you have to do in applied machine learning and is not like working on a real business problem. And while I partly agree with them – in practice, it might not always be worth to put in 40 more hours to get an improvement of 10^-20 in ROC-AUC – I think everybody studying Data Science Should do at least one Kaggle competition. Why? Because of the invaluable practical lessons you’ll hardly find in the classroomNational Data Science Bowl   Kaggle. And because they are actually fun. Without doubt, studying Chernoff Bounds, proving convergence of gradient descent and examining the mulitvariate gaussian distribution have their place in the Data Science curriculum. Kaggle Competitions, however, are for most students probably the first opportunity to get their hands dirty with other data sets than ‘Iris’, ‘German Credit’ or ‘Earnings’ to name a few. This forces us to deal with issues we rarely encounter in the classroom. How to ‘munge’ a dataset that is in a weird format? What to do with data that is too large to fit in main memory? Or how to actually train, tune and validate that magic SUPPORT VECTOR MACHINE on a real data set? What is probably the most important benefit of kaggle competitions though is that it introduces us to the “tricks of the trade”, the little bit of ‘dark magic’ that separates somebody who knows about machine learning from somebody who successfully applies machine learning. While certainly not a kaggle approach, I will use the remainder of this post to share some of my experiences, mixed with some machine learning folk knowledge and kaggle best practices I picked up during the (few) competitions I have participated in.

The Setup

Newbies often ask: Do I need new equipment to participate? The answer is: Not at all. While some competitors have access to clusters etc. (mostly from work), which is probably convenient to try out different things fast, most competitions are won using commodity hardware. The stack of choice for most kagglers is python (mostly 2.7.6, because of it’s module compatibility) along with scikit-learn (sklearn), numpy , pandas and matplotlib . This is what will get you through most competitions. Sometimes people use domain-specific libraries (like scikit-image, for image processing tasks) or Vowpal Wabbit (an online learning library for ‘Big Data’). Sklearn is preferred over R by many kagglers because of several reasons. First it is a python library. This comes brings cleaner syntax, slightly faster execution, less crashs (at least on my laptop) and more efficient memory handling. It also enables us to have the whole workflow, from data munging, over exploration to the actual machine learning in python. Pandas brings convenient, fast I/O and an R like data structure, while numpy helps with linear algebra and matplotlib is for matlab-like visualization. For beginners, it might be most convenient to install the anaconda python distribution, which contains all the mentioned libraries.

Be an Engineer

Let’s face it. You are not going to be the smartest guy in the room just because you use a support vector machine. Everybody can do that. It is contained in the library everybody uses and thus reduced to writing one line of code. Since everybody has access to the same algorithms, it is not about the algorithms. It is about what you feed them. One kaggle ‘master’ told ‘Handelsblatt’ (an important economic newspaper in Germany): ‘If you want to predict if somebody has diabetes, most people try height and weight first. But you won’t get a good result before you combine them to a new feature – the body mass index.’ This is consistent with Pedro Domingos in his well-known article A few useful things to know about Machine Learning’ where he states: ‘Feature Engineering is the Key […]. First-timers are often surprised by how little time in a machine learning project is spent actually doing machine learning.’

That said, we should not forget that competitions are often won by slight differences in the test set loss, so patiently optimizing the hyperparameters of the learning algorithm might very well pay off in the end.

Be A Scientist

My father is a biochemist, a ‘hard scientist’. He does experiments in a ‘real’ lab (not a computer lab). No matter if an experiment is successful or unsuccessful, he and his grad students meticulously document everything. If they do not document something, it is basically lost work. You should do the same. Document every change you make to your model and the changes in performance it brings. Kaggle competitions go over several months and can be stressful at the end. To keep the overview, see what works and what doesn’t, and really improve, documentation is the key. Use at least an excel file – git might be an even better idea.

Together We are Strong

Learning Models are a little bit like humans. They have their strengths and weaknesses, and when you combine them, the weaknesses hopefully cancel out. It has been known for quite some time that ensembles of weak learners (like random forests, or gradient boosting) perform really well. A lot of kagglers take use that intuition and take it a step further. They built sophisticated models which perform well, but different, and combine them in the end. This might go from simple methods, like averaging predictions or taking a vote to more sophisticated procedures like “stacking” (google it.)

Don’t Overfit

Some people use the public leaderboard as the validation set. This is tempting, but might be less than optimal, since loss is usually only calculated on a tiny, like 30%, subset of the test data which does not have to be representative. The danger of overfitting said subset is real. To overcome this, do extensive (say 10 fold) cross validation locally. That is time consuming, but easy to do with sklearn. You can do it overnight. In the end, you can choose two submissions. Choose the one that is best on the public leaderboard, and the one that is best in local cross validation.

Now go to kaggle, do the tutorials, and then take part in a real competition!

written by Tim Kreienkamp