Can Computers See?

Computers today have the ability to process information from images notably thanks to object recognition. This ability improved greatly over the past few years and reached human levels on complex tasks in 2015. The algorithm allowing them to do such thing is called Convolutional Neural Network (CNN). This method also enabled Google’s AI to beat one of the best GO players in the world and build self-driving cars.

This week, the Renyi Hour invited Dario Garcia-Gasulla from the Barcelona Supercomputing Center who introduced the Data Science students to this method

You can find the presentation slides here      .01-06-16_BGSE

 

Advertisements

Physics and Baysian Statistics

This week in the Renyi Hour session, the BGSE students got out of their comfortable zone. We had a talk with Johannes Bergstron, a Postdoctoral researcher at Universitat de Barcelona,  about physics and the implications of Bayesian statistics in this filed. As a person totally new to the field of physics I can say that it was quite an adventure and a challenge to fully grasp all the concepts, but as a “hopefully-to-be” data scientist I was happy to see that there are more and more fields open to frequentist but also baysian point of view. If you want to have a taste of what the talk was about, here you can find the slides.

presentation_slides

Challenges in Digital Marketing Vol1

Next week starts the Big Data conference in BGSE, therefore, Renyi Hour decided to make a preliminary research on one of the topics which will be discussed there – what challenges does the digital marketing face. We went directly to the source of information – a practitioner. Sidharth Gummall is a Client Success Manager in Datorama, “a global marketing intelligence company that provides the world’s first marketing data hub to help executives, analysts and IT professionals quickly manage and confidently execute upon vast amounts of data.” (according to their profile in LinkedIn)

Before asking Sidharth what his opinion is on the particular question we have in mind, we squeezed him for some basic knowledge of what exactly digital marketing is and why it is so important. So, itSidharth highlighted three important aspects. :

Goal: before we do any kind of marketing we need to define what the goal of the campaign is – to raise awareness, to increase sales, etc.

Objectives: there are different types – impression (if the person has seen the ad), click (visiting the advertised website) and if there is impact (so if there is actually sale).

Intermediary: it is called ad server – it collect the information which is needed (like number of clicks, impressions or just the customers’ information). Most of the customers’ data comes from cookies which store your profile information while you are even browsing through some random websites.

So, having a glimpse of the digital marketing slang, what makes this type of marketing is different from any other type. The quantity and variety of information which you can extrapolate is tremendous – you can find exactly who is the person across the screen, what he likes, where he is, what he has browsed before, is he interested? Information which the TV, radio and flyer ads rarely can provide. Though Sidharth also mentioned that TV spots are still the most expensive marketing tool digital marketing is conquering the market exponentially which makes the connection between Big Data and digital marketing obvious. Acquiring data with every click and push of a button on a single web-connected device, is the reality nowadays and there is no doubt that its quantity accumulates faster and faster. Therefore, we need to master machine learning tools in order to analyze and extract relevant insights from the big data pile.

Having this introductory “course”, we are ready to see what challenges the digital market faces. According to Sidharth there are three major ones:

  1. The AD blocker policy which prevents the marketing agencies to assess the potential customers
  2. The ad server (the intermediary which collects the data, like number of views) is not very flexible. For example, let’s say that a website where an ad is placed is modified. Now the number of views is called views whereas before that it was named impressions. This prevents the ad server to acquire the necessary data because it will look for information called impressions, which does not exist anymore in the website construction.
  3. Loss of information when users clean their cookies and the inability to restore that data

 

Now backed up with knowledge and Sidharth’s ideas, we are eager to find out what the talk on Tuesday (22.03.2016) is going to be. Would the presented challenges be the same or would we be surprised with other insights?

 

“At one point, I nearly threw my laptop at the wall.”

 

BGSE Data Science Kaggle Competition

During our second term as data scientists, the class split into 10 teams of 3 and participated in an “in-class” kaggle competition. It left every team depleted from late-night efforts and many long days spent obsessing and executing ideas which resulted often in reduced accuracy.

 

“At one point, I nearly threw my laptop at the wall.”  

                                                                            a data scientist moments after the deadline

 

 

The challenge

Predict popularity of 1-5 (ordinal classification) of Mashable news articles using 60 variables of metadata. Many of these features were useless. You can read more on the UCI Machine Learning Repository.

 

The outcome

The top 2 teams pulled ahead of the pack with accuracies of 0.55567 and 0.56744, while groups 3 through 10 were all within .01 (e.g. 1 percentage accuracy).

A collection of lessons learned from the top 3 groups:

  • Third place used a single random forest, which multiple teams found to be better than multiple forests or an ensembled forest with other models.
  • Second place used an ensemble 3 random forests and 3 generalized boosted regression models. Leveraging the ordinal nature of the classification, the team grouped 3’s, 4’s and 5’s, and achieved higher accuracy predicting the 3-class problem
  • The first place team used rolling window for prediction, provided the time series nature of data (a single random forest)
  • The first place team created new features from the url to detect title keywords unique to each class. This method only affected 400 out of the 9644 test set but resulted in an improved accuracy by 1% – huge in kaggle-land.

 

Other curiosities

  • Keep systematic, meticulous and obsessive track of your randomness, methods and results.
  • Trust your cross-validation (don’t trust the public leaderboard): The third place team chose to use the submission for which they got the best cross-validation score on the training set, but not the best score on the public leaderboard.
  • Use all your data: many teams extracted new features from the URL.

Renyi Hour: Searching for information traces in brain (growing) data

On March 12th we had a more than interesting talk by Adrià Tauste.

Abstract

Modern neural recording technologies can monitor the activity of an ever-increasing number of neurons simultaneously. Conceptually, this exciting moment urges a paradigm shift from a single-unit (neurons) to network (population) hypotheses and analysis for understanding brain functions and disorders. In practice this motivates the application of advanced multivariate tools to population analysis guided by newly formulated questions. In this context, I am interested in analyzing how information about external stimuli is encoded, communicated and transformed by neurons to produce behavior. Motivated by this question I will present a study of simultaneous single-neuron recordings from two monkeys performing a decision-making task based on previously perceived stimuli. By using a non-parametric method to estimate directional correlations between many pairs of neurons we were able to infer a distributed network of interactions that was activated during the key stages of the task. Interestingly, these interactions mostly vanished when the monkeys received the stimuli but had no incentive to perform the task. I will end up discussing new directions of this work along both biological and methodological lines.

Link to the Presentation

Renyi Hour: 3D and enhancing the user experience in (Big) Data visualization

On Feb 19th, we had an amazing talk by Professor Josep Blat:

Visualizations usually deal with a range of objectives, from exploration of raw data to presentation of analysed data – and different combinations of both extremes.

In this talk we will discuss some interactive visualizations, mostly using 3D graphics, on data coming from Sport (for instance, football matches or regattas), and digital cinema, especially from the point of view of the potential user, and what it can be added to the user’s experience as well.

Link to the Presentation

Visit their research group site to see their ongoing projects!

Renyi Hour: Using online ratings to better understand the effect of popularity on evaluations

This past Thursday 12, we had really interesting talk by Gael Le Mens.

In the presentation, I will give a quick summary of the theory and explain how I used a large dataset of online ratings (all the ratings of all the restaurants in San Francisco since 2004 on the website Yelp.com) to test the assumptions of the model and measure the extent to which a sampling account can explain the association between popularity and quality estimates. Time permitting, I will discuss related data analyses pertaining to the rating behavior of online user communities.

Abstract:
People often evaluate popular alternatives more positively than unpopular alternatives. This has been attributed to inferences about quality on the basis of popularity, motivated cognition or mere exposure. In this paper, we propose an alternative explanation for the evaluative advantage of popular alternatives. Our theory emphasizes the role of the information samples people have about popular and unpopular alternatives. Under the assumption that, after a poor experience, people are more likely to sample again popular alternatives than unpopular alternatives, we show that systematic information biases will emerge. This information bias frequently provides popular alternatives with an evaluative advantage as compared to unpopular alternatives. Our sampling-based account complements existing explanations that focus on how people process information about popular and unpopular alternatives.

Link to the Presentation