Pandas and Data Scientists


Brij Kishore Pandey wanted to see if pandas, a python library, is a popular tool among the data science community. He took data from StackOverflow and came up with the following results:

6a7844db-49e9-4d9e-b763-694a5643dbce-large

 

You can see the whole article here: Pandas among Data Scientists (Trend Analysis) 

 

 

Advertisements

Can Computers See?

Computers today have the ability to process information from images notably thanks to object recognition. This ability improved greatly over the past few years and reached human levels on complex tasks in 2015. The algorithm allowing them to do such thing is called Convolutional Neural Network (CNN). This method also enabled Google’s AI to beat one of the best GO players in the world and build self-driving cars.

This week, the Renyi Hour invited Dario Garcia-Gasulla from the Barcelona Supercomputing Center who introduced the Data Science students to this method

You can find the presentation slides here      .01-06-16_BGSE

 

Physics and Baysian Statistics

This week in the Renyi Hour session, the BGSE students got out of their comfortable zone. We had a talk with Johannes Bergstron, a Postdoctoral researcher at Universitat de Barcelona,  about physics and the implications of Bayesian statistics in this filed. As a person totally new to the field of physics I can say that it was quite an adventure and a challenge to fully grasp all the concepts, but as a “hopefully-to-be” data scientist I was happy to see that there are more and more fields open to frequentist but also baysian point of view. If you want to have a taste of what the talk was about, here you can find the slides.

presentation_slides

Challenges in Digital Marketing Vol1

Next week starts the Big Data conference in BGSE, therefore, Renyi Hour decided to make a preliminary research on one of the topics which will be discussed there – what challenges does the digital marketing face. We went directly to the source of information – a practitioner. Sidharth Gummall is a Client Success Manager in Datorama, “a global marketing intelligence company that provides the world’s first marketing data hub to help executives, analysts and IT professionals quickly manage and confidently execute upon vast amounts of data.” (according to their profile in LinkedIn)

Before asking Sidharth what his opinion is on the particular question we have in mind, we squeezed him for some basic knowledge of what exactly digital marketing is and why it is so important. So, itSidharth highlighted three important aspects. :

Goal: before we do any kind of marketing we need to define what the goal of the campaign is – to raise awareness, to increase sales, etc.

Objectives: there are different types – impression (if the person has seen the ad), click (visiting the advertised website) and if there is impact (so if there is actually sale).

Intermediary: it is called ad server – it collect the information which is needed (like number of clicks, impressions or just the customers’ information). Most of the customers’ data comes from cookies which store your profile information while you are even browsing through some random websites.

So, having a glimpse of the digital marketing slang, what makes this type of marketing is different from any other type. The quantity and variety of information which you can extrapolate is tremendous – you can find exactly who is the person across the screen, what he likes, where he is, what he has browsed before, is he interested? Information which the TV, radio and flyer ads rarely can provide. Though Sidharth also mentioned that TV spots are still the most expensive marketing tool digital marketing is conquering the market exponentially which makes the connection between Big Data and digital marketing obvious. Acquiring data with every click and push of a button on a single web-connected device, is the reality nowadays and there is no doubt that its quantity accumulates faster and faster. Therefore, we need to master machine learning tools in order to analyze and extract relevant insights from the big data pile.

Having this introductory “course”, we are ready to see what challenges the digital market faces. According to Sidharth there are three major ones:

  1. The AD blocker policy which prevents the marketing agencies to assess the potential customers
  2. The ad server (the intermediary which collects the data, like number of views) is not very flexible. For example, let’s say that a website where an ad is placed is modified. Now the number of views is called views whereas before that it was named impressions. This prevents the ad server to acquire the necessary data because it will look for information called impressions, which does not exist anymore in the website construction.
  3. Loss of information when users clean their cookies and the inability to restore that data

 

Now backed up with knowledge and Sidharth’s ideas, we are eager to find out what the talk on Tuesday (22.03.2016) is going to be. Would the presented challenges be the same or would we be surprised with other insights?

 

“At one point, I nearly threw my laptop at the wall.”

 

BGSE Data Science Kaggle Competition

During our second term as data scientists, the class split into 10 teams of 3 and participated in an “in-class” kaggle competition. It left every team depleted from late-night efforts and many long days spent obsessing and executing ideas which resulted often in reduced accuracy.

 

“At one point, I nearly threw my laptop at the wall.”  

                                                                            a data scientist moments after the deadline

 

 

The challenge

Predict popularity of 1-5 (ordinal classification) of Mashable news articles using 60 variables of metadata. Many of these features were useless. You can read more on the UCI Machine Learning Repository.

 

The outcome

The top 2 teams pulled ahead of the pack with accuracies of 0.55567 and 0.56744, while groups 3 through 10 were all within .01 (e.g. 1 percentage accuracy).

A collection of lessons learned from the top 3 groups:

  • Third place used a single random forest, which multiple teams found to be better than multiple forests or an ensembled forest with other models.
  • Second place used an ensemble 3 random forests and 3 generalized boosted regression models. Leveraging the ordinal nature of the classification, the team grouped 3’s, 4’s and 5’s, and achieved higher accuracy predicting the 3-class problem
  • The first place team used rolling window for prediction, provided the time series nature of data (a single random forest)
  • The first place team created new features from the url to detect title keywords unique to each class. This method only affected 400 out of the 9644 test set but resulted in an improved accuracy by 1% – huge in kaggle-land.

 

Other curiosities

  • Keep systematic, meticulous and obsessive track of your randomness, methods and results.
  • Trust your cross-validation (don’t trust the public leaderboard): The third place team chose to use the submission for which they got the best cross-validation score on the training set, but not the best score on the public leaderboard.
  • Use all your data: many teams extracted new features from the URL.

An Alumni at Campus

This week in the Renyi Hour (RH), we had an interesting speaker – Jordan McIver, a BGSE Data Science alumni. He tried to show the class of 2016 what is actually like to be a data scientist in the real world. If you missed his presentation, you can read the interview below, which he gave for the Renyi Hour and have a glimpse of what you have missed. Also, after the article you can find his presentation slides and contact information.

 

RH person: What do you think is the future of DS (data science)?

Jordan: I think it’s going to be the driver of competitive advantage for most industries. As things become digital, predominantly digital, customers will expect Google-level type of experience.  Every industry that doesn’t have data scientists is going to be disrupted. So eventually, you’ll have industries which are disrupted or industries which now have DS as the driver.

RH person: You talked about data hubs before that. Barcelona is one of them. Why do you think is that? Maybe it’s the location?

Jordan: Well, the big thing is that city council has done a lot. So a lot of investment has been done to make sure it works. If you go to one area of the town, they took all of that buildings and let them be accelerators and incubators. Certainly, Barcelona is a great place to live. And there is general engineering talent as well. But Europe needs more hubs. There is overflow in London and Berlin and they can’t have everything. But I think mostly because of the investment.

RH person: You actually worked in the field of DS before. Why did you decide to have a master in DS?

Jordan: I started in consulting and analysis of data in SQL and Excel, doing some programming. I was learning something and also applying some statistics. I recognized I didn’t know a lot. I needed to get a kick-up. I wasn’t going to learn this things just on my own. Some people can, but I needed to dive into it. Also I managed things I don’t know about. I wanted to be the person who knew the things that manages.

RH person: So know you can say that you know them now? The master was beneficial?

Jordan: Yes, absolutely. I think it’s not going to be an end point. I don’t know other master fields, but potentially the environmental science – you take what you know and then apply it. Here you take what you know, find what you like and then go learn it for real. Especially in the application to the real world. So it’s shifting how it works. If you are expecting to come out of it and not have to know a lot more, it’s going to be tough. But if you want to find something you like, it’s great. Finding your area, having a happy life that also you’ll be successful in.

RH person: What did you want to be when you were a small child? I think DS wasn’t it?

Jordan: I wanted to be like a sports player, like a hockey player

RH person: What will be your advice for the class of 2016?

Jordan: Be grateful for all the knowledge you are presented. Learn the most of it. Really, break your back and try to learn from the great people you have here. And realize that this is the one year of your life that you’re not working, so find your own way, find what you want to do. Just become like data evangelist for yourself. And meet as much people as possible. You won’t have time to go to meetups, have a beer with people and get up late. Go do it NOW, because when you start a job, it’s going to be a lot more difficult.


 

Jordan McIver contact information:

email: jordan.j.mciver@gmail.com

LinkedIn: https://www.linkedin.com/in/jmcdatascience

PRESENTATION: BGSE_Valtech_Agile_Data_160205