Using H2O for Kaggle: Guest Post by Gaston Besanson and Tim Kreienkamp

From H2o site


In this special H2O guest blog post, Gaston Besanson and Tim Kreienkamp talk about their experience using H2O for competitive data science. They are both students in the new Master of Data Science Program at the Barcelona Graduate School of Economics and used H2O in an in-class Kaggle competition for their Machine Learning class. Gaston’s team came in second, scoring 0.92838 in overall accuracy, slightly surpassed by Tim’s team with 0.92964, on a subset of the famous “Forest Cover” dataset.

What is your background prior to this challenge?

Tim: We both are students in the Master of Data Science at the Graduate School of Economics in Barcelona. I come from a business background. I took part in a few Kaggle challenges before, but didn’t have a formal machine learning background before this class.

Gaston: I have a mixed background in Economics, Finance and Law. With no prior experience on Kaggle or Machine Learning other than Andrew Ng’s online course :).

Could you give a brief introduction to the dataset and the challenges associated with it?

Tim: The good thing about this dataset is that it is relatively “clean” (no missing values etc) and small (7 mb of training data). This allows for fast iteration and testing out a couple of different methods and hunches relatively quickly (relatively – a classmate of ours spent $300 on AWS trying to train support vector machines). The main challenge I see in the multiclass nature – this always makes it harder as basically one has to train 7 models (due to the one-vs-all nature of multiclass classification).

Gaston: Yes, this dataset is a classic on Kaggle: Forest Cover Type Prediction. Which, as Tim said and adding to it, there are 7 types of trees and 54 features (10 quantitative variables, like Elevation, and 44 binary variables: 4 binary wilderness areas and 40 binary soil type variables). What come to our attention was the highly unbalanced that was the dataset. Class 1 and 2 represented 80% of the training data.

What feature engineering and preprocessing techniques did you use?

Gaston: Our team added an extra layer to this competition that was to predict as best as possible the type of tree in a region with the purpose of minimizing the fires. Even though we used the same loss for each type of misclassification – in other words, all trees are equally important -, we decided to create new features. We created six new variables to try to identify features important to fire risk. And, we applied a normalization on both the training and the test sets to the 60 features.

Tim: We included some difference and interaction terms. However, we didn’t scale the numerical features or use any unsupervised dimension reduction techniques. I briefly tried to do supervised feature learning with H2O Deep Learning – it gave me really impressive results in cross-validation, but broke down on the test set.

Editor’s note: L1/L2/Dropout regularization or fewer neurons can help avoid overfitting

Which supervised learning algorithms did you try and to what success?

Tim: I tried H2O’s implementation of Gradient Boosting, Random Forest, Deep Learning (MLP with stochastic gradient descent), and the standard R implementation of SVM and k-NN. k-NN performed poorly, so did SVM – Deep Learning overfit, as I already mentioned. The tree based methods both performed very well in our initial tests. We finally settled for Random Forest, since it gave the best results and was faster to train than Gradient Boosting.

Gaston: We tried KNN, SVM, Random Forest all from different packages, with not that great results. And finally we used H2O’s implementation of GBM – we ended up using this model because it introduces a lot of freedom into the model design. The model we used had the following attributes: Number of trees: 250; Maximum Depth: 18; Minimum Rows: 10; Shrinkage: 0.1.

What feature selection techniques did you try?

Tim: We didn’t try anything fancy (like LASSO) for this challenge. Instead, we decided to take advantage of the fact that random forests can compute feature importances. I used this to code my own recursive elimination procedure. At each iteration, a random forest was trained and cross-validated (ten fold). The feature importances are computed, the worst two features are discarded, and the next iteration begins with the remaining features. The resulting cross validation errors at each stage made up a nice “textbook-like” curve, where the error first decreased with fewer features and at the end made a sharp increase again. We then chose the set of features that gave the second-best cross validation error, to not overfit by feature selection.

Gaston: Actually, we did not do any feature selection other than removing the variables that did have a variance, which if I am not mistaken was one in the original dataset (before feature creation). Neither turns the binary variables into one categorical (one for wilderness areas and one for soil type). We had a naïve approach of sticking with the story of fire risk no matter what; maybe next time we will change the approach.

Why did you use H2O and what were the major benefits?

Tim: We were constrained by our teachers in the sense that we could only use R – that forced me out of my scikit-learn comfort zone. So I looked for something as accurate and fast. As an occasional Kaggler, I am familiar with Arno’s forum post, and so I decided to give H2O a shot – and I didn’t regret it at all. Apart from the nice R interface, the major benefit is the strong parallelization – this way we were able to make the most of our AWS academic grants.

Gaston: I came across H2O just by searching the web and reading about alternatives within R possibilities after the GBM package proved really untestable. Just to add to what Tim said, I think H2O will be my weapon of choice in the near future.

For a more detailed description of the methods used and results obtained, see the report of Gaston’s and Tim’s teams.


Renyi Hour: Searching for information traces in brain (growing) data

On March 12th we had a more than interesting talk by Adrià Tauste.


Modern neural recording technologies can monitor the activity of an ever-increasing number of neurons simultaneously. Conceptually, this exciting moment urges a paradigm shift from a single-unit (neurons) to network (population) hypotheses and analysis for understanding brain functions and disorders. In practice this motivates the application of advanced multivariate tools to population analysis guided by newly formulated questions. In this context, I am interested in analyzing how information about external stimuli is encoded, communicated and transformed by neurons to produce behavior. Motivated by this question I will present a study of simultaneous single-neuron recordings from two monkeys performing a decision-making task based on previously perceived stimuli. By using a non-parametric method to estimate directional correlations between many pairs of neurons we were able to infer a distributed network of interactions that was activated during the key stages of the task. Interestingly, these interactions mostly vanished when the monkeys received the stimuli but had no incentive to perform the task. I will end up discussing new directions of this work along both biological and methodological lines.

Link to the Presentation

Renyi Hour: 3D and enhancing the user experience in (Big) Data visualization

On Feb 19th, we had an amazing talk by Professor Josep Blat:

Visualizations usually deal with a range of objectives, from exploration of raw data to presentation of analysed data – and different combinations of both extremes.

In this talk we will discuss some interactive visualizations, mostly using 3D graphics, on data coming from Sport (for instance, football matches or regattas), and digital cinema, especially from the point of view of the potential user, and what it can be added to the user’s experience as well.

Link to the Presentation

Visit their research group site to see their ongoing projects!

Renyi Hour: Using online ratings to better understand the effect of popularity on evaluations

This past Thursday 12, we had really interesting talk by Gael Le Mens.

In the presentation, I will give a quick summary of the theory and explain how I used a large dataset of online ratings (all the ratings of all the restaurants in San Francisco since 2004 on the website to test the assumptions of the model and measure the extent to which a sampling account can explain the association between popularity and quality estimates. Time permitting, I will discuss related data analyses pertaining to the rating behavior of online user communities.

People often evaluate popular alternatives more positively than unpopular alternatives. This has been attributed to inferences about quality on the basis of popularity, motivated cognition or mere exposure. In this paper, we propose an alternative explanation for the evaluative advantage of popular alternatives. Our theory emphasizes the role of the information samples people have about popular and unpopular alternatives. Under the assumption that, after a poor experience, people are more likely to sample again popular alternatives than unpopular alternatives, we show that systematic information biases will emerge. This information bias frequently provides popular alternatives with an evaluative advantage as compared to unpopular alternatives. Our sampling-based account complements existing explanations that focus on how people process information about popular and unpopular alternatives.

Link to the Presentation

Renyi Hour: SocialPoint

This January the data science students at BGSE were lucky to have Sharon Biggar, Head of Analytics at Socialpoint, come in to talk about her innovative work at the mobile game leader and how they generate insights from 70TB of data about the tens of millions unique users who interact with a Socialpoint game every month.
Beyond just a snapshot of techniques and “stack” Sharon covered specific analyses that drove shareholder value, lessons learned in deciding on infrastructure, what she would have done differently and – quite interesting for new-to-market data scientists – the immense growth Socialpoint has experienced in recent years.
For more information about Socialpoint contact Sharon Biggar, or to be our next expert guest talk to Jordan McIver, data science student here in Barcelona at BGSE.

RENYI HOUR: Key takeaways from the Strata+Hadoop Conference (Nov 2014)

On January 8th, Jordan McIver reviewed key takeaways from attending the Strata+Hadoop Conference focused on Big Data which took place in November 2014 here in Barcelona.

Topics covered will include summaries of relevant talks and materials available at the conference, interesting organizations and technologies that were encountered and more specific information about contacts and networking progress driven from the conference. He concluded the talk offering a networking discussion.
Slides will be available on request. Please contact Jordan through Linkedin.