Just beyond Agbar Tower (e.g. Barcelona’s Gherkin) is the home of SocialPoint, a local game development company, where I and 6 other data scientists arrived at 9am this past Saturday morning. Riding up to the 10th floor, we were presented with 270 degree views of the cityscape and the Mediterranean, ready to participate in Barcelona’s first ever data hackathon.
Sharron Bigger began the introduction at 9:30am: We had 24 hours to predict 7-day churn for SocialPoint’s Dragon City game using data on players’ first 48 hours of game play. I can’t say I played the game, but I can ask, how adorable are their graphics?
We would be presenting our results the following morning to a panel of 4 judges: Andrés Bou (CEO, SocialPoint), Sharon Biggar (Head of Analytics, SocialPoint), Tim Van Kasteren (Data Scientist, Schibsted Media Group) and … Christian Fons-Rosen?! Awesome! Soon after that welcome surprise, we were unleashed on our data challenge:
- Training data on 524,966 users: Each user’s row included a churn value of 0 (not churned) or 1 (churned), in addition to 40 other variables on game play. These included variables such as num_sessions, attacks, dragon breedings, cash_spent, and country where the user registered.
- Test data for 341,667 users: excluding the churn variable
- Prizes: Prizes would be awarded to both the accuracy (first and second prizes of €750 and €500) and the business insights tracks (one prize of €500).
- Objective evaluation of the accuracy track: Achieve the highest AUC (area under the ROC curve) for an unknown test set.
- We were provided a baseline logistic regression model which achieved an AUC of 0.723.
- A public scoreboard was available to test predictions, however to avoid overfitting, submissions to the public scoreboard were only evaluated against ⅕ of the final test data set.
- Subjective evaluation of the business insights track: Each group would present to the panel of judges on actionable insights to address churn.
After poking around at the data all morning, barely nudging the baseline AUC by .01, Roger shot our team to the top of the rankings with a 0.96. It was nice to see our team (the g-priors) on top.
Sadly, it didn’t last. After lunch, an organizer announced we couldn’t use the variable which was a near exact proxy of churn: date_last_logged can predict churn with near perfect accuracy. The only reason it was not perfect was some people logged back in after the window for measuring churn had closed. Other teams had found this same trick, but all scores were wiped from the public leader board and the lot of us were back down to dreams of a 0.73.
The rest of the 24 hours were a cycle of idea, execution, and frustration. We cumulatively executed gradient boosting, lasso regression, business insights presentation and many iterative brute-force analysis of variable combinations, interactions and transformations.
Some of us were up until 6:30am, emailing colleagues with new ideas.
We arrived at 9am again the next morning to submit our final results and finish our business insights presentation. Christian Fons-Rosen arrived soon after, surprised to see us as well.
“Well, when I arrived on Sunday morning at SocialPoint after having slept beautiful 9 hours and I saw the faces of Aimee, Roger, and Uwe… For a moment I thought: ‘Wow, it must have been a great night at Razzmatazz or Sala Apolo.’ … It took me a while to understand what really happened that night!”
Though it may not have been a scientifically methodical process, if you bang your head against a data set for 24 hours, you will learn something. Maybe participating won’t have an impact on exam results, but it certainly gave this data hacker more insights into how to practically approach a data challenge.
My team included Roger Cuscó, Harihara Subramanyam Sreenivasan, and Uwe-Herbert Hönig. Domagoj Fizulic, Niti Mishra and Guglielmo Pelino also participated on other teams.
At least that’s what we were told! Full details on the hackathon can be found on BCNAnalytic’s website.
Churn rate is the rate at which users stop playing the game within a specified time period.