Can Computers See?

Computers today have the ability to process information from images notably thanks to object recognition. This ability improved greatly over the past few years and reached human levels on complex tasks in 2015. The algorithm allowing them to do such thing is called Convolutional Neural Network (CNN). This method also enabled Google’s AI to beat one of the best GO players in the world and build self-driving cars.

This week, the Renyi Hour invited Dario Garcia-Gasulla from the Barcelona Supercomputing Center who introduced the Data Science students to this method

You can find the presentation slides here      .01-06-16_BGSE

 

Challenges in Digital Marketing Vol1

Next week starts the Big Data conference in BGSE, therefore, Renyi Hour decided to make a preliminary research on one of the topics which will be discussed there – what challenges does the digital marketing face. We went directly to the source of information – a practitioner. Sidharth Gummall is a Client Success Manager in Datorama, “a global marketing intelligence company that provides the world’s first marketing data hub to help executives, analysts and IT professionals quickly manage and confidently execute upon vast amounts of data.” (according to their profile in LinkedIn)

Before asking Sidharth what his opinion is on the particular question we have in mind, we squeezed him for some basic knowledge of what exactly digital marketing is and why it is so important. So, itSidharth highlighted three important aspects. :

Goal: before we do any kind of marketing we need to define what the goal of the campaign is – to raise awareness, to increase sales, etc.

Objectives: there are different types – impression (if the person has seen the ad), click (visiting the advertised website) and if there is impact (so if there is actually sale).

Intermediary: it is called ad server – it collect the information which is needed (like number of clicks, impressions or just the customers’ information). Most of the customers’ data comes from cookies which store your profile information while you are even browsing through some random websites.

So, having a glimpse of the digital marketing slang, what makes this type of marketing is different from any other type. The quantity and variety of information which you can extrapolate is tremendous – you can find exactly who is the person across the screen, what he likes, where he is, what he has browsed before, is he interested? Information which the TV, radio and flyer ads rarely can provide. Though Sidharth also mentioned that TV spots are still the most expensive marketing tool digital marketing is conquering the market exponentially which makes the connection between Big Data and digital marketing obvious. Acquiring data with every click and push of a button on a single web-connected device, is the reality nowadays and there is no doubt that its quantity accumulates faster and faster. Therefore, we need to master machine learning tools in order to analyze and extract relevant insights from the big data pile.

Having this introductory “course”, we are ready to see what challenges the digital market faces. According to Sidharth there are three major ones:

  1. The AD blocker policy which prevents the marketing agencies to assess the potential customers
  2. The ad server (the intermediary which collects the data, like number of views) is not very flexible. For example, let’s say that a website where an ad is placed is modified. Now the number of views is called views whereas before that it was named impressions. This prevents the ad server to acquire the necessary data because it will look for information called impressions, which does not exist anymore in the website construction.
  3. Loss of information when users clean their cookies and the inability to restore that data

 

Now backed up with knowledge and Sidharth’s ideas, we are eager to find out what the talk on Tuesday (22.03.2016) is going to be. Would the presented challenges be the same or would we be surprised with other insights?

 

“At one point, I nearly threw my laptop at the wall.”

 

BGSE Data Science Kaggle Competition

During our second term as data scientists, the class split into 10 teams of 3 and participated in an “in-class” kaggle competition. It left every team depleted from late-night efforts and many long days spent obsessing and executing ideas which resulted often in reduced accuracy.

 

“At one point, I nearly threw my laptop at the wall.”  

                                                                            a data scientist moments after the deadline

 

 

The challenge

Predict popularity of 1-5 (ordinal classification) of Mashable news articles using 60 variables of metadata. Many of these features were useless. You can read more on the UCI Machine Learning Repository.

 

The outcome

The top 2 teams pulled ahead of the pack with accuracies of 0.55567 and 0.56744, while groups 3 through 10 were all within .01 (e.g. 1 percentage accuracy).

A collection of lessons learned from the top 3 groups:

  • Third place used a single random forest, which multiple teams found to be better than multiple forests or an ensembled forest with other models.
  • Second place used an ensemble 3 random forests and 3 generalized boosted regression models. Leveraging the ordinal nature of the classification, the team grouped 3’s, 4’s and 5’s, and achieved higher accuracy predicting the 3-class problem
  • The first place team used rolling window for prediction, provided the time series nature of data (a single random forest)
  • The first place team created new features from the url to detect title keywords unique to each class. This method only affected 400 out of the 9644 test set but resulted in an improved accuracy by 1% – huge in kaggle-land.

 

Other curiosities

  • Keep systematic, meticulous and obsessive track of your randomness, methods and results.
  • Trust your cross-validation (don’t trust the public leaderboard): The third place team chose to use the submission for which they got the best cross-validation score on the training set, but not the best score on the public leaderboard.
  • Use all your data: many teams extracted new features from the URL.

Dragoncity Hackathon by Aimee Barciauskas

Just beyond Agbar Tower (e.g. Barcelona’s Gherkin) is the home of SocialPoint, a local game development company, where I and 6 other data scientists arrived at 9am this past Saturday morning. Riding up to the 10th floor, we were presented with 270 degree views of the cityscape and the Mediterranean, ready to participate in Barcelona’s first ever data hackathon.
Sharron Bigger began the introduction at 9:30am: We had 24 hours to predict 7-day churn for SocialPoint’s Dragon City game using data on players’ first 48 hours of game play. I can’t say I played the game, but I can ask, how adorable are their graphics?

unnamed

We would be presenting our results the following morning to a panel of 4 judges: Andrés Bou (CEO, SocialPoint), Sharon Biggar (Head of Analytics, SocialPoint), Tim Van Kasteren (Data Scientist, Schibsted Media Group) and … Christian Fons-Rosen?! Awesome! Soon after that welcome surprise, we were unleashed on our data challenge:

 

  • Training data on 524,966 users: Each user’s row included a churn value of 0 (not churned) or 1 (churned), in addition to 40 other variables on game play. These included variables such as num_sessions, attacks, dragon breedings, cash_spent, and country where the user registered.
  • Test data for 341,667 users: excluding the churn variable
  • Prizes: Prizes would be awarded to both the accuracy (first and second prizes of €750 and €500) and the business insights tracks (one prize of €500).
  • Objective evaluation of the accuracy track: Achieve the highest AUC (area under the ROC curve) for an unknown test set.
    • We were provided a baseline logistic regression model which achieved an AUC of 0.723.
    • A public scoreboard was available to test predictions, however to avoid overfitting, submissions to the public scoreboard were only evaluated against ⅕ of the final test data set.
  • Subjective evaluation of the business insights track: Each group would present to the panel of judges on actionable insights to address churn.

 

After poking around at the data all morning, barely nudging the baseline AUC by .01, Roger shot our team to the top of the rankings with a 0.96. It was nice to see our team (the g-priors) on top.

 

Sadly, it didn’t last. After lunch, an organizer announced we couldn’t use the variable which was a near exact proxy of churn: date_last_logged can predict churn with near perfect accuracy. The only reason it was not perfect was some people logged back in after the window for measuring churn had closed. Other teams had found this same trick, but all scores were wiped from the public leader board and the lot of us were back down to dreams of a 0.73.

 

The rest of the 24 hours were a cycle of idea, execution, and frustration. We cumulatively executed gradient boosting, lasso regression, business insights presentation and many iterative brute-force analysis of variable combinations, interactions and transformations.

 

Some of us were up until 6:30am, emailing colleagues with new ideas.

 

We arrived at 9am again the next morning to submit our final results and finish our business insights presentation. Christian Fons-Rosen arrived soon after, surprised to see us as well.

 

Well, when I arrived on Sunday morning at SocialPoint after having slept beautiful 9 hours and I saw the faces of Aimee, Roger, and Uwe…  For a moment I thought: ‘Wow, it must have been a great night at Razzmatazz or Sala Apolo.’ … It took me a while to understand what really happened that night!

 

Though it may not have been a scientifically methodical process, if you bang your head against a data set for 24 hours, you will learn something. Maybe participating won’t have an impact on exam results, but it certainly gave this data hacker more insights into how to practically approach a data challenge.

BGSE Data SCience Team


 

  1. My team included Roger Cuscó, Harihara Subramanyam Sreenivasan, and Uwe-Herbert Hönig. Domagoj Fizulic, Niti Mishra and Guglielmo Pelino also participated on other teams.
  2. At least that’s what we were told! Full details on the hackathon can be found on BCNAnalytic’s website.
  3. Churn rate is the rate at which users stop playing the game within a specified time period.

 

Personal Approach

Big_data   The moral from last Wednesday’s Renyi Hour is the “personal approach.” The main message which Alberto Barroso del Toro: Senior Manager Advanced Business Analytics, and Alan Fortuny Sicart: Senior Data Scientist, from Indra Business Conculting is:

“Cuando haces preguntas aleatorias obtienes respuestas aleatorias” ROBERTO RIGOBON Sloan University MIT.

English translation: “When you get random questions, you get random answers

So although we are dealing with big data which of course means enormous data quantity, variety, etc. the most important part is not to look at it just as random facts or digits. You need to understand the business, to get to know the field. Only then you can ask the “specific” question, seek for the “specific” answers and hopefully see how to optimize the process. To be successful researchers need not only the usual package of data analytics skills but also they should leave their comfortable chairs, “get dirty” and explore the business filed.

Using H2O for Kaggle: Guest Post by Gaston Besanson and Tim Kreienkamp

From H2o site http://h2o.ai/blog/2015/05/h2o-kaggle-guest-post/

######

In this special H2O guest blog post, Gaston Besanson and Tim Kreienkamp talk about their experience using H2O for competitive data science. They are both students in the new Master of Data Science Program at the Barcelona Graduate School of Economics and used H2O in an in-class Kaggle competition for their Machine Learning class. Gaston’s team came in second, scoring 0.92838 in overall accuracy, slightly surpassed by Tim’s team with 0.92964, on a subset of the famous “Forest Cover” dataset.

What is your background prior to this challenge?

Tim: We both are students in the Master of Data Science at the Graduate School of Economics in Barcelona. I come from a business background. I took part in a few Kaggle challenges before, but didn’t have a formal machine learning background before this class.

Gaston: I have a mixed background in Economics, Finance and Law. With no prior experience on Kaggle or Machine Learning other than Andrew Ng’s online course :).

Could you give a brief introduction to the dataset and the challenges associated with it?

Tim: The good thing about this dataset is that it is relatively “clean” (no missing values etc) and small (7 mb of training data). This allows for fast iteration and testing out a couple of different methods and hunches relatively quickly (relatively – a classmate of ours spent $300 on AWS trying to train support vector machines). The main challenge I see in the multiclass nature – this always makes it harder as basically one has to train 7 models (due to the one-vs-all nature of multiclass classification).

Gaston: Yes, this dataset is a classic on Kaggle: Forest Cover Type Prediction. Which, as Tim said and adding to it, there are 7 types of trees and 54 features (10 quantitative variables, like Elevation, and 44 binary variables: 4 binary wilderness areas and 40 binary soil type variables). What come to our attention was the highly unbalanced that was the dataset. Class 1 and 2 represented 80% of the training data.

What feature engineering and preprocessing techniques did you use?

Gaston: Our team added an extra layer to this competition that was to predict as best as possible the type of tree in a region with the purpose of minimizing the fires. Even though we used the same loss for each type of misclassification – in other words, all trees are equally important -, we decided to create new features. We created six new variables to try to identify features important to fire risk. And, we applied a normalization on both the training and the test sets to the 60 features.

Tim: We included some difference and interaction terms. However, we didn’t scale the numerical features or use any unsupervised dimension reduction techniques. I briefly tried to do supervised feature learning with H2O Deep Learning – it gave me really impressive results in cross-validation, but broke down on the test set.

Editor’s note: L1/L2/Dropout regularization or fewer neurons can help avoid overfitting

Which supervised learning algorithms did you try and to what success?

Tim: I tried H2O’s implementation of Gradient Boosting, Random Forest, Deep Learning (MLP with stochastic gradient descent), and the standard R implementation of SVM and k-NN. k-NN performed poorly, so did SVM – Deep Learning overfit, as I already mentioned. The tree based methods both performed very well in our initial tests. We finally settled for Random Forest, since it gave the best results and was faster to train than Gradient Boosting.

Gaston: We tried KNN, SVM, Random Forest all from different packages, with not that great results. And finally we used H2O’s implementation of GBM – we ended up using this model because it introduces a lot of freedom into the model design. The model we used had the following attributes: Number of trees: 250; Maximum Depth: 18; Minimum Rows: 10; Shrinkage: 0.1.

What feature selection techniques did you try?

Tim: We didn’t try anything fancy (like LASSO) for this challenge. Instead, we decided to take advantage of the fact that random forests can compute feature importances. I used this to code my own recursive elimination procedure. At each iteration, a random forest was trained and cross-validated (ten fold). The feature importances are computed, the worst two features are discarded, and the next iteration begins with the remaining features. The resulting cross validation errors at each stage made up a nice “textbook-like” curve, where the error first decreased with fewer features and at the end made a sharp increase again. We then chose the set of features that gave the second-best cross validation error, to not overfit by feature selection.

Gaston: Actually, we did not do any feature selection other than removing the variables that did have a variance, which if I am not mistaken was one in the original dataset (before feature creation). Neither turns the binary variables into one categorical (one for wilderness areas and one for soil type). We had a naïve approach of sticking with the story of fire risk no matter what; maybe next time we will change the approach.

Why did you use H2O and what were the major benefits?

Tim: We were constrained by our teachers in the sense that we could only use R – that forced me out of my scikit-learn comfort zone. So I looked for something as accurate and fast. As an occasional Kaggler, I am familiar with Arno’s forum post, and so I decided to give H2O a shot – and I didn’t regret it at all. Apart from the nice R interface, the major benefit is the strong parallelization – this way we were able to make the most of our AWS academic grants.

Gaston: I came across H2O just by searching the web and reading about alternatives within R possibilities after the GBM package proved really untestable. Just to add to what Tim said, I think H2O will be my weapon of choice in the near future.

For a more detailed description of the methods used and results obtained, see the report of Gaston’s and Tim’s teams.

Under the Curve: A primer on kaggle competitions

Why Kaggle? 

Recently, there has been some controversy around kaggle . People point out that it frees you from a lot of the seemingly boring, tedious work you have to do in applied machine learning and is not like working on a real business problem. And while I partly agree with them – in practice, it might not always be worth to put in 40 more hours to get an improvement of 10^-20 in ROC-AUC – I think everybody studying Data Science Should do at least one Kaggle competition. Why? Because of the invaluable practical lessons you’ll hardly find in the classroomNational Data Science Bowl   Kaggle. And because they are actually fun. Without doubt, studying Chernoff Bounds, proving convergence of gradient descent and examining the mulitvariate gaussian distribution have their place in the Data Science curriculum. Kaggle Competitions, however, are for most students probably the first opportunity to get their hands dirty with other data sets than ‘Iris’, ‘German Credit’ or ‘Earnings’ to name a few. This forces us to deal with issues we rarely encounter in the classroom. How to ‘munge’ a dataset that is in a weird format? What to do with data that is too large to fit in main memory? Or how to actually train, tune and validate that magic SUPPORT VECTOR MACHINE on a real data set? What is probably the most important benefit of kaggle competitions though is that it introduces us to the “tricks of the trade”, the little bit of ‘dark magic’ that separates somebody who knows about machine learning from somebody who successfully applies machine learning. While certainly not a kaggle approach, I will use the remainder of this post to share some of my experiences, mixed with some machine learning folk knowledge and kaggle best practices I picked up during the (few) competitions I have participated in.

The Setup

Newbies often ask: Do I need new equipment to participate? The answer is: Not at all. While some competitors have access to clusters etc. (mostly from work), which is probably convenient to try out different things fast, most competitions are won using commodity hardware. The stack of choice for most kagglers is python (mostly 2.7.6, because of it’s module compatibility) along with scikit-learn (sklearn), numpy , pandas and matplotlib . This is what will get you through most competitions. Sometimes people use domain-specific libraries (like scikit-image, for image processing tasks) or Vowpal Wabbit (an online learning library for ‘Big Data’). Sklearn is preferred over R by many kagglers because of several reasons. First it is a python library. This comes brings cleaner syntax, slightly faster execution, less crashs (at least on my laptop) and more efficient memory handling. It also enables us to have the whole workflow, from data munging, over exploration to the actual machine learning in python. Pandas brings convenient, fast I/O and an R like data structure, while numpy helps with linear algebra and matplotlib is for matlab-like visualization. For beginners, it might be most convenient to install the anaconda python distribution, which contains all the mentioned libraries.

Be an Engineer

Let’s face it. You are not going to be the smartest guy in the room just because you use a support vector machine. Everybody can do that. It is contained in the library everybody uses and thus reduced to writing one line of code. Since everybody has access to the same algorithms, it is not about the algorithms. It is about what you feed them. One kaggle ‘master’ told ‘Handelsblatt’ (an important economic newspaper in Germany): ‘If you want to predict if somebody has diabetes, most people try height and weight first. But you won’t get a good result before you combine them to a new feature – the body mass index.’ This is consistent with Pedro Domingos in his well-known article A few useful things to know about Machine Learning’ where he states: ‘Feature Engineering is the Key […]. First-timers are often surprised by how little time in a machine learning project is spent actually doing machine learning.’

That said, we should not forget that competitions are often won by slight differences in the test set loss, so patiently optimizing the hyperparameters of the learning algorithm might very well pay off in the end.

Be A Scientist

My father is a biochemist, a ‘hard scientist’. He does experiments in a ‘real’ lab (not a computer lab). No matter if an experiment is successful or unsuccessful, he and his grad students meticulously document everything. If they do not document something, it is basically lost work. You should do the same. Document every change you make to your model and the changes in performance it brings. Kaggle competitions go over several months and can be stressful at the end. To keep the overview, see what works and what doesn’t, and really improve, documentation is the key. Use at least an excel file – git might be an even better idea.

Together We are Strong

Learning Models are a little bit like humans. They have their strengths and weaknesses, and when you combine them, the weaknesses hopefully cancel out. It has been known for quite some time that ensembles of weak learners (like random forests, or gradient boosting) perform really well. A lot of kagglers take use that intuition and take it a step further. They built sophisticated models which perform well, but different, and combine them in the end. This might go from simple methods, like averaging predictions or taking a vote to more sophisticated procedures like “stacking” (google it.)

Don’t Overfit

Some people use the public leaderboard as the validation set. This is tempting, but might be less than optimal, since loss is usually only calculated on a tiny, like 30%, subset of the test data which does not have to be representative. The danger of overfitting said subset is real. To overcome this, do extensive (say 10 fold) cross validation locally. That is time consuming, but easy to do with sklearn. You can do it overnight. In the end, you can choose two submissions. Choose the one that is best on the public leaderboard, and the one that is best in local cross validation.

Now go to kaggle, do the tutorials, and then take part in a real competition!

written by Tim Kreienkamp