Pandas and Data Scientists

Brij Kishore Pandey wanted to see if pandas, a python library, is a popular tool among the data science community. He took data from StackOverflow and came up with the following results:



You can see the whole article here: Pandas among Data Scientists (Trend Analysis) 



Can Computers See?

Computers today have the ability to process information from images notably thanks to object recognition. This ability improved greatly over the past few years and reached human levels on complex tasks in 2015. The algorithm allowing them to do such thing is called Convolutional Neural Network (CNN). This method also enabled Google’s AI to beat one of the best GO players in the world and build self-driving cars.

This week, the Renyi Hour invited Dario Garcia-Gasulla from the Barcelona Supercomputing Center who introduced the Data Science students to this method

You can find the presentation slides here      .01-06-16_BGSE


Challenges in Digital Marketing Vol1

Next week starts the Big Data conference in BGSE, therefore, Renyi Hour decided to make a preliminary research on one of the topics which will be discussed there – what challenges does the digital marketing face. We went directly to the source of information – a practitioner. Sidharth Gummall is a Client Success Manager in Datorama, “a global marketing intelligence company that provides the world’s first marketing data hub to help executives, analysts and IT professionals quickly manage and confidently execute upon vast amounts of data.” (according to their profile in LinkedIn)

Before asking Sidharth what his opinion is on the particular question we have in mind, we squeezed him for some basic knowledge of what exactly digital marketing is and why it is so important. So, itSidharth highlighted three important aspects. :

Goal: before we do any kind of marketing we need to define what the goal of the campaign is – to raise awareness, to increase sales, etc.

Objectives: there are different types – impression (if the person has seen the ad), click (visiting the advertised website) and if there is impact (so if there is actually sale).

Intermediary: it is called ad server – it collect the information which is needed (like number of clicks, impressions or just the customers’ information). Most of the customers’ data comes from cookies which store your profile information while you are even browsing through some random websites.

So, having a glimpse of the digital marketing slang, what makes this type of marketing is different from any other type. The quantity and variety of information which you can extrapolate is tremendous – you can find exactly who is the person across the screen, what he likes, where he is, what he has browsed before, is he interested? Information which the TV, radio and flyer ads rarely can provide. Though Sidharth also mentioned that TV spots are still the most expensive marketing tool digital marketing is conquering the market exponentially which makes the connection between Big Data and digital marketing obvious. Acquiring data with every click and push of a button on a single web-connected device, is the reality nowadays and there is no doubt that its quantity accumulates faster and faster. Therefore, we need to master machine learning tools in order to analyze and extract relevant insights from the big data pile.

Having this introductory “course”, we are ready to see what challenges the digital market faces. According to Sidharth there are three major ones:

  1. The AD blocker policy which prevents the marketing agencies to assess the potential customers
  2. The ad server (the intermediary which collects the data, like number of views) is not very flexible. For example, let’s say that a website where an ad is placed is modified. Now the number of views is called views whereas before that it was named impressions. This prevents the ad server to acquire the necessary data because it will look for information called impressions, which does not exist anymore in the website construction.
  3. Loss of information when users clean their cookies and the inability to restore that data


Now backed up with knowledge and Sidharth’s ideas, we are eager to find out what the talk on Tuesday (22.03.2016) is going to be. Would the presented challenges be the same or would we be surprised with other insights?


An Alumni at Campus

This week in the Renyi Hour (RH), we had an interesting speaker – Jordan McIver, a BGSE Data Science alumni. He tried to show the class of 2016 what is actually like to be a data scientist in the real world. If you missed his presentation, you can read the interview below, which he gave for the Renyi Hour and have a glimpse of what you have missed. Also, after the article you can find his presentation slides and contact information.


RH person: What do you think is the future of DS (data science)?

Jordan: I think it’s going to be the driver of competitive advantage for most industries. As things become digital, predominantly digital, customers will expect Google-level type of experience.  Every industry that doesn’t have data scientists is going to be disrupted. So eventually, you’ll have industries which are disrupted or industries which now have DS as the driver.

RH person: You talked about data hubs before that. Barcelona is one of them. Why do you think is that? Maybe it’s the location?

Jordan: Well, the big thing is that city council has done a lot. So a lot of investment has been done to make sure it works. If you go to one area of the town, they took all of that buildings and let them be accelerators and incubators. Certainly, Barcelona is a great place to live. And there is general engineering talent as well. But Europe needs more hubs. There is overflow in London and Berlin and they can’t have everything. But I think mostly because of the investment.

RH person: You actually worked in the field of DS before. Why did you decide to have a master in DS?

Jordan: I started in consulting and analysis of data in SQL and Excel, doing some programming. I was learning something and also applying some statistics. I recognized I didn’t know a lot. I needed to get a kick-up. I wasn’t going to learn this things just on my own. Some people can, but I needed to dive into it. Also I managed things I don’t know about. I wanted to be the person who knew the things that manages.

RH person: So know you can say that you know them now? The master was beneficial?

Jordan: Yes, absolutely. I think it’s not going to be an end point. I don’t know other master fields, but potentially the environmental science – you take what you know and then apply it. Here you take what you know, find what you like and then go learn it for real. Especially in the application to the real world. So it’s shifting how it works. If you are expecting to come out of it and not have to know a lot more, it’s going to be tough. But if you want to find something you like, it’s great. Finding your area, having a happy life that also you’ll be successful in.

RH person: What did you want to be when you were a small child? I think DS wasn’t it?

Jordan: I wanted to be like a sports player, like a hockey player

RH person: What will be your advice for the class of 2016?

Jordan: Be grateful for all the knowledge you are presented. Learn the most of it. Really, break your back and try to learn from the great people you have here. And realize that this is the one year of your life that you’re not working, so find your own way, find what you want to do. Just become like data evangelist for yourself. And meet as much people as possible. You won’t have time to go to meetups, have a beer with people and get up late. Go do it NOW, because when you start a job, it’s going to be a lot more difficult.


Jordan McIver contact information:



PRESENTATION: BGSE_Valtech_Agile_Data_160205

Dragoncity Hackathon by Aimee Barciauskas

Just beyond Agbar Tower (e.g. Barcelona’s Gherkin) is the home of SocialPoint, a local game development company, where I and 6 other data scientists arrived at 9am this past Saturday morning. Riding up to the 10th floor, we were presented with 270 degree views of the cityscape and the Mediterranean, ready to participate in Barcelona’s first ever data hackathon.
Sharron Bigger began the introduction at 9:30am: We had 24 hours to predict 7-day churn for SocialPoint’s Dragon City game using data on players’ first 48 hours of game play. I can’t say I played the game, but I can ask, how adorable are their graphics?


We would be presenting our results the following morning to a panel of 4 judges: Andrés Bou (CEO, SocialPoint), Sharon Biggar (Head of Analytics, SocialPoint), Tim Van Kasteren (Data Scientist, Schibsted Media Group) and … Christian Fons-Rosen?! Awesome! Soon after that welcome surprise, we were unleashed on our data challenge:


  • Training data on 524,966 users: Each user’s row included a churn value of 0 (not churned) or 1 (churned), in addition to 40 other variables on game play. These included variables such as num_sessions, attacks, dragon breedings, cash_spent, and country where the user registered.
  • Test data for 341,667 users: excluding the churn variable
  • Prizes: Prizes would be awarded to both the accuracy (first and second prizes of €750 and €500) and the business insights tracks (one prize of €500).
  • Objective evaluation of the accuracy track: Achieve the highest AUC (area under the ROC curve) for an unknown test set.
    • We were provided a baseline logistic regression model which achieved an AUC of 0.723.
    • A public scoreboard was available to test predictions, however to avoid overfitting, submissions to the public scoreboard were only evaluated against ⅕ of the final test data set.
  • Subjective evaluation of the business insights track: Each group would present to the panel of judges on actionable insights to address churn.


After poking around at the data all morning, barely nudging the baseline AUC by .01, Roger shot our team to the top of the rankings with a 0.96. It was nice to see our team (the g-priors) on top.


Sadly, it didn’t last. After lunch, an organizer announced we couldn’t use the variable which was a near exact proxy of churn: date_last_logged can predict churn with near perfect accuracy. The only reason it was not perfect was some people logged back in after the window for measuring churn had closed. Other teams had found this same trick, but all scores were wiped from the public leader board and the lot of us were back down to dreams of a 0.73.


The rest of the 24 hours were a cycle of idea, execution, and frustration. We cumulatively executed gradient boosting, lasso regression, business insights presentation and many iterative brute-force analysis of variable combinations, interactions and transformations.


Some of us were up until 6:30am, emailing colleagues with new ideas.


We arrived at 9am again the next morning to submit our final results and finish our business insights presentation. Christian Fons-Rosen arrived soon after, surprised to see us as well.


Well, when I arrived on Sunday morning at SocialPoint after having slept beautiful 9 hours and I saw the faces of Aimee, Roger, and Uwe…  For a moment I thought: ‘Wow, it must have been a great night at Razzmatazz or Sala Apolo.’ … It took me a while to understand what really happened that night!


Though it may not have been a scientifically methodical process, if you bang your head against a data set for 24 hours, you will learn something. Maybe participating won’t have an impact on exam results, but it certainly gave this data hacker more insights into how to practically approach a data challenge.

BGSE Data SCience Team


  1. My team included Roger Cuscó, Harihara Subramanyam Sreenivasan, and Uwe-Herbert Hönig. Domagoj Fizulic, Niti Mishra and Guglielmo Pelino also participated on other teams.
  2. At least that’s what we were told! Full details on the hackathon can be found on BCNAnalytic’s website.
  3. Churn rate is the rate at which users stop playing the game within a specified time period.


Personal Approach

Big_data   The moral from last Wednesday’s Renyi Hour is the “personal approach.” The main message which Alberto Barroso del Toro: Senior Manager Advanced Business Analytics, and Alan Fortuny Sicart: Senior Data Scientist, from Indra Business Conculting is:

“Cuando haces preguntas aleatorias obtienes respuestas aleatorias” ROBERTO RIGOBON Sloan University MIT.

English translation: “When you get random questions, you get random answers

So although we are dealing with big data which of course means enormous data quantity, variety, etc. the most important part is not to look at it just as random facts or digits. You need to understand the business, to get to know the field. Only then you can ask the “specific” question, seek for the “specific” answers and hopefully see how to optimize the process. To be successful researchers need not only the usual package of data analytics skills but also they should leave their comfortable chairs, “get dirty” and explore the business filed.