Recently, there has been some controversy around kaggle . People point out that it frees you from a lot of the seemingly boring, tedious work you have to do in applied machine learning and is not like working on a real business problem. And while I partly agree with them – in practice, it might not always be worth to put in 40 more hours to get an improvement of 10^-20 in ROC-AUC – I think everybody studying Data Science Should do at least one Kaggle competition. Why? Because of the invaluable practical lessons you’ll hardly find in the classroom. And because they are actually fun. Without doubt, studying Chernoff Bounds, proving convergence of gradient descent and examining the mulitvariate gaussian distribution have their place in the Data Science curriculum. Kaggle Competitions, however, are for most students probably the first opportunity to get their hands dirty with other data sets than ‘Iris’, ‘German Credit’ or ‘Earnings’ to name a few. This forces us to deal with issues we rarely encounter in the classroom. How to ‘munge’ a dataset that is in a weird format? What to do with data that is too large to fit in main memory? Or how to actually train, tune and validate that magic SUPPORT VECTOR MACHINE on a real data set? What is probably the most important benefit of kaggle competitions though is that it introduces us to the “tricks of the trade”, the little bit of ‘dark magic’ that separates somebody who knows about machine learning from somebody who successfully applies machine learning. While certainly not a kaggle approach, I will use the remainder of this post to share some of my experiences, mixed with some machine learning folk knowledge and kaggle best practices I picked up during the (few) competitions I have participated in.
Newbies often ask: Do I need new equipment to participate? The answer is: Not at all. While some competitors have access to clusters etc. (mostly from work), which is probably convenient to try out different things fast, most competitions are won using commodity hardware. The stack of choice for most kagglers is python (mostly 2.7.6, because of it’s module compatibility) along with scikit-learn (sklearn), numpy , pandas and matplotlib . This is what will get you through most competitions. Sometimes people use domain-specific libraries (like scikit-image, for image processing tasks) or Vowpal Wabbit (an online learning library for ‘Big Data’). Sklearn is preferred over R by many kagglers because of several reasons. First it is a python library. This comes brings cleaner syntax, slightly faster execution, less crashs (at least on my laptop) and more efficient memory handling. It also enables us to have the whole workflow, from data munging, over exploration to the actual machine learning in python. Pandas brings convenient, fast I/O and an R like data structure, while numpy helps with linear algebra and matplotlib is for matlab-like visualization. For beginners, it might be most convenient to install the anaconda python distribution, which contains all the mentioned libraries.
Be an Engineer
Let’s face it. You are not going to be the smartest guy in the room just because you use a support vector machine. Everybody can do that. It is contained in the library everybody uses and thus reduced to writing one line of code. Since everybody has access to the same algorithms, it is not about the algorithms. It is about what you feed them. One kaggle ‘master’ told ‘Handelsblatt’ (an important economic newspaper in Germany): ‘If you want to predict if somebody has diabetes, most people try height and weight first. But you won’t get a good result before you combine them to a new feature – the body mass index.’ This is consistent with Pedro Domingos in his well-known article A few useful things to know about Machine Learning’ where he states: ‘Feature Engineering is the Key […]. First-timers are often surprised by how little time in a machine learning project is spent actually doing machine learning.’
That said, we should not forget that competitions are often won by slight differences in the test set loss, so patiently optimizing the hyperparameters of the learning algorithm might very well pay off in the end.
Be A Scientist
My father is a biochemist, a ‘hard scientist’. He does experiments in a ‘real’ lab (not a computer lab). No matter if an experiment is successful or unsuccessful, he and his grad students meticulously document everything. If they do not document something, it is basically lost work. You should do the same. Document every change you make to your model and the changes in performance it brings. Kaggle competitions go over several months and can be stressful at the end. To keep the overview, see what works and what doesn’t, and really improve, documentation is the key. Use at least an excel file – git might be an even better idea.
Together We are Strong
Learning Models are a little bit like humans. They have their strengths and weaknesses, and when you combine them, the weaknesses hopefully cancel out. It has been known for quite some time that ensembles of weak learners (like random forests, or gradient boosting) perform really well. A lot of kagglers take use that intuition and take it a step further. They built sophisticated models which perform well, but different, and combine them in the end. This might go from simple methods, like averaging predictions or taking a vote to more sophisticated procedures like “stacking” (google it.)
Some people use the public leaderboard as the validation set. This is tempting, but might be less than optimal, since loss is usually only calculated on a tiny, like 30%, subset of the test data which does not have to be representative. The danger of overfitting said subset is real. To overcome this, do extensive (say 10 fold) cross validation locally. That is time consuming, but easy to do with sklearn. You can do it overnight. In the end, you can choose two submissions. Choose the one that is best on the public leaderboard, and the one that is best in local cross validation.
Now go to kaggle, do the tutorials, and then take part in a real competition!