Competition Report: Monster Legends Datathon

Just a few days after officially starting our classes for the 1st term in the master program, we received an email about a gaming data hackathon being organized by SocialPoint – a big name in mobile gaming. The company is spread across 7 floors of a 10-storey building in the heart of Barcelona with amazing views of the city. Having just finished the brush-ups, we were not quite aware of the kind of data challenges that lay ahead in such competitions but it was surely going to be a learning experience so we went ahead and registered a team.

We arrived at the venue on the morning of 15 Oct to learn about the challenge at hand. We were provided with gameplay data from the company’s most successful game Monster Legends. For our readers with some time at their disposal (not quite the case with most of us doing the master), you should try this game but I shall warn you it can be addictive! The basic aim of the game is to collect the strongest monsters and build up your team of monsters to win the maximum battles against other players. And this brings me to the problem we had to solve: Given the list of monsters currently owned by a player, predict the monster they don’t already have and are most likely to choose next.

We were given:

  • A training dataset of 70,000 players and all the monsters they owned.
  • A test dataset of 30,000 players with one random monster removed from their list of monsters – This removed item was to be predicted
  • Basic information about each of the 100,000 players (country, sex, platform, days_played, payer_dummy etc)
  • Features of each monster available in the game
  • A baseline Item Based Collaborative Filtering (IBCF) model in R which gave an accuracy of 15.5%

There were 2 tracks on which each team would be judged:

  1. Objective evaluation of the precision of the predictions which was accessible through a live leaderboard – it evaluated the submissions on 20% of the solutions to avoid overfitting.
  2. Subjective evaluation of a Business Insights presentation which was supposed to be delivered the next morning with suggestions to SocialPoint about how to possibly increase revenue from the game, given our analysis on the provided data.

There were cash prizes for both tracks – 1st and 2nd prize of €750 and €500 for the accuracy track and a unique prize of €500 for the business insights track.

We would be presenting our results to – Horacio Martos (CEO, SocialPoint), Sharon Biggar (Head of Analytics, SocialPoint), Manuel Bruscas (Director of Analytics, eDreams) and our program co-director Christian Fons-Rosen.

With all this information at our disposal – the countdown timer for 24 hours started and we put on our thinking caps to brainstorm. We began with visualizing as much data as possible to get a general sense of the data provided. Apparently our brains were waiting for some calorie intake as only after the lunch (which was a fancy courtesy of the organizers), we came up with a basic idea of modeling the game as chronological event and predicting the monster for a player based on the player’s current level.

2 of us set on to implement this approach while the other 2 continued tweaking the baseline model in hopes of achieving a better precision. However, we failed to cross the 15% mark after several submissions coupled with hours of stackoverflowing and reading about various techniques.

Much to our disappointment, our chronological approach also didn’t give very optimistic results and DeepYellow (our team) was still lingering around 17% by the time we left the venue on Saturday evening. Some of us stayed up all night trying to implement different ideas – clustering players on features other than level and other techniques like user-based filtering. I believe the competitiveness of the event with regular submissions by all teams even at 3AM and the idea of trying new models kept us going.

After all our efforts, we were still at 19% at the end of 24 hours and clearly not the winning the team – the winners managed to achieve 37% accuracy. Given our unsuccessful attempts at analyzing the data – we were not really motivated about the presentation and we gave a very basic one which was not really very insightful. However, we learned a lot from the presentations of the other teams.

At the end of the event, we were given a tour of the offices and we got back to our lecture slides albeit not with cash prizes but with ideas on practically facing a data problem and hopefully some acquired knowledge. Hackathons are always a fun way of squeezing out the creative side in you in a limited amount of time coupled with the bonus of meeting many like-minded new people.

Our team consisted of Akhil Lohia, Nandan Rao, Robert Lange and Roman Kopso (an exchange student at UPC who was included in our team to assure all teams had 4 members).

About the author (Akhil Lohia):

I graduated with a bachelor’s degree in Economics from the Indian Institute of Technology Kanpur in June 2015. I have been involved in economics research projects in the capacity of a research assistant. My deep fascination towards technology and the ever increasing involvement of data to augment and provide more useful services to users in their context led my decision in choosing the Data Science program at BGSE in a smart city like Barcelona.14796165_679913662161695_15741260_o

Advertisements

Finding an optimal group allocation

The Task

During our first week of class, the BGSE Data Science class of 2017 was asked to form groups of 3 (and one group of 4) for the upcoming group projects of the first term. The question on-hand was how and on what basis to form these groups.

After some discussions, we decided to go with the following approach: Each of the 25 students provides a ranking of fellow students from 1 (most preferred) to 24 (least), considering the people’s’ previous academic background, interests, career aspirations, etc. From this input we would then compute the group allocation that is collectively most favorable, i.e. which minimizes the sum over the group member links, weighted by their ranking.

Finding the optimum solution to this problem turned out to be non-trivial. In the following I shall discuss a possible solution to that problem.

Representation as Graph
First, note that the problem on hand can be formulated as a complete, undirected graph where each student represents a vertex and the edge’s weight between 2 vertices A and B forms the sum of the rank of student A for student B and vice versa. The aggregated rank matrix therefore becomes the adjacency matrix of that graph.

The goal now is to form complete subgraphs of size 3 (and one subgraph with 4 vertices) such that the total weights of used edges gives the minimum.

Brute-Force
A graph with 25 vertices allows for over 2.3\cdot 10^{18} different combinations of group allocations:
{25 \choose 4}{21 \choose 3}{18 \choose 3}{15 \choose 3}{12 \choose 3}{9 \choose 3}{6 \choose 3} = 2308743493056000000

A brute-force approach – i.e. computing all possible combinations and then picking the combination with the minimum total edge weight – can therefore be deemed infeasible.

Literature Review

The next step was to review the literature for similar algorithmic problems and associated efficient solutions.
The stable matching (or marriage) problem considers finding a stable matching between two equally sized sets of elements given an ordering of preferences for each element. It differs from our problem in that each vertex is associated with one of two groups and it only allows for pair-wise matching. Furthermore, the solution to the problem is stable (i.e. no two individuals have an incentive to re-match) but not necessarily optimal from all individuals’ points of view.
In the similar stable roommate problem all elements (roommates) belong to a single pool, however, the problem is still restricted to pair-matching and the efficient algorithm by Irving (1985) does not generalize to larger groups.

Proposed Algorithm
With no ready-to-use algorithm on hand, we were forced to design a new group-matching algorithm.
The proposed algorithm starts from any valid solution and then iteratively improves the solution by swapping unstable pairs of students from different groups. The produced solution is therefore guaranteed to be stable for pairs of individuals but not necessarily for 3 or more individuals trading “under-the-table”. From that follows that the algorithm is likely to find a “good” solution but not necessarily an optimal solution.

The pseudo-code of the algorithm  is given below:

INPUT: adjacency matrix, group_sizes
OUTPUT: group allocation
--------------------------------------------------------------------------
form groups of given group sizes
initial solution: randomly assign vertices to initial groups
--------------------------------------------------------------------------
WHILE solution has changed
     FOREACH pair of vertices (i,j)
        cur_cost <- sum(edge weights of i)
                              + sum(edge weights of j)
        alt_cost <- sum(edge weights of i if in group(j))  
                              + sum(edge weights of j if in group(i))
        IF alt_cost < cur_cost
              swap i and j

Complexity

The algorithm compares {N \choose 2}\approx\frac{N^2}{2}  pairs of students in each WHILE loop. In the worst case, the algorithm could start from the worst possible solution and in each WHILE loop iteration only move to the next-best solution, thus leading to brute-force-like performance.

Tests with an implemented version of the algorithm (in R), however, showed that for graphs of size 25, the algorithm converges after an average of 20 swaps and no more than 3 WHILE loop iterations. Thus, the algorithm is believed to be efficient in the average case.

Further tests with different initializations for the same rank matrix confirmed that the algorithm only converges to a local minimum in the solution space. It is believed that the algorithm produces a good solution when run several times with different initializations and then choosing the solution with lowest total cost, however, by no means it can be guaranteed to have found the global optimum solution.

Final Note
The described group allocation process based on rankings turned out to be not self-enforcing and therefore was not actually used in the end.

References
Irving, Robert W. (1985), “An efficient algorithm for the “stable roommates” problem”, Journal of Algorithms, 6 (4): 577–595, doi:10.1016/0196-6774(85)90033-1

About The Author (Hans-Peter Höllwirth):
I have several years of working experience in the software industry as programmer, analyst, and project manager, and graduated from the University of Edinburgh with a bachelor degree in Computer Science and Management Science in 2016. My interests in machine learning, computing, and finance led me in my decision to join the BGSE Data Science program.

 

 

 

 

return(„Hola mundo“)

Hello and welcome to the 2016/2017 edition of the Data Science Blog!

I know it is already a little late for the first entry but we have big plans for the future: hackathon reports, technical essays, interviews and personal diary entries are only a few of those. Our general aim is to inform you about all the wonderful details of our experience at the Barcelona Graduate School of Economics and the specifics of the Data Science program.

If there are any topics or questions you are particularly interested in, please let Nandan (nandan.rao@barcelonagse.eu) or me (robert.lange@barcelonagse.eu) know.

Next week Hans-Peter Höllwirth is going to write about one of the first practical problems we encountered during our Masters: finding an optimal group allocation for our projects during the first term.

Rob

P.S.: Here is a picture of almost all of us from the Data Science Networking Eveningds_alumni_meeting

 

Can Computers See?

Computers today have the ability to process information from images notably thanks to object recognition. This ability improved greatly over the past few years and reached human levels on complex tasks in 2015. The algorithm allowing them to do such thing is called Convolutional Neural Network (CNN). This method also enabled Google’s AI to beat one of the best GO players in the world and build self-driving cars.

This week, the Renyi Hour invited Dario Garcia-Gasulla from the Barcelona Supercomputing Center who introduced the Data Science students to this method

You can find the presentation slides here      .01-06-16_BGSE

 

Meeting with Dr Lahiri

silicon-valley-real-estate-the-registry

from the series Silicon Valley (2014)

Last Thursday, we had the pleasure to discuss with Dr Mayank Lahiri about large scale algorithms in tech companies and about the job situation in the Silicon Valley and in Europe.

Mayank is currently working in the startup IceRoad, which he founded. Prior to that, he worked at Google and Facebook. He received his PhD in 2011 from the University of Chicago, where he worked with Pr. Berger-Wolf on “Measuring and Mining Dynamic Networks”.

Dr Lahiri first told us about how data warehousing and analysis is done in big tech companies. He presented practical methods for large network analysis, knowledge graph representations, and methods that he finds most promising. Concerning this, at Google, Mayank was part of the team in charge of analyzing and improving the indexing mechanisms including Web graph and Pagerank ; and at Facebook, he built a system for analytics on Facebook Pages as well as various backend systems.

We then had a Q&A session about data science and startups in the Silicon Valley. We notably talked about Mayank’s background and about his work experience in tech companies. We also discussed about the challenges he faced while creating his startup, IceRoad. He presented his vision on entrepreneurship for data-oriented companies. In that regard, we asked him what he thinks successful tech companies will look like in a few years – a tough one actually! Finally, we learned about his ambitions with his current project.

Dr Lahiri was really eager to help us and encouraged us to look for job opportunities in the Silicon Valley. He confirmed that this is still where most of exciting data science innovations (and investments) are made. Some of the DS students might join him there!

Physics and Baysian Statistics

This week in the Renyi Hour session, the BGSE students got out of their comfortable zone. We had a talk with Johannes Bergstron, a Postdoctoral researcher at Universitat de Barcelona,  about physics and the implications of Bayesian statistics in this filed. As a person totally new to the field of physics I can say that it was quite an adventure and a challenge to fully grasp all the concepts, but as a “hopefully-to-be” data scientist I was happy to see that there are more and more fields open to frequentist but also baysian point of view. If you want to have a taste of what the talk was about, here you can find the slides.

presentation_slides