Pandas and Data Scientists


Brij Kishore Pandey wanted to see if pandas, a python library, is a popular tool among the data science community. He took data from StackOverflow and came up with the following results:

6a7844db-49e9-4d9e-b763-694a5643dbce-large

 

You can see the whole article here: Pandas among Data Scientists (Trend Analysis) 

 

 

Can Computers See?

Computers today have the ability to process information from images notably thanks to object recognition. This ability improved greatly over the past few years and reached human levels on complex tasks in 2015. The algorithm allowing them to do such thing is called Convolutional Neural Network (CNN). This method also enabled Google’s AI to beat one of the best GO players in the world and build self-driving cars.

This week, the Renyi Hour invited Dario Garcia-Gasulla from the Barcelona Supercomputing Center who introduced the Data Science students to this method

You can find the presentation slides here      .01-06-16_BGSE

 

Physics and Baysian Statistics

This week in the Renyi Hour session, the BGSE students got out of their comfortable zone. We had a talk with Johannes Bergstron, a Postdoctoral researcher at Universitat de Barcelona,  about physics and the implications of Bayesian statistics in this filed. As a person totally new to the field of physics I can say that it was quite an adventure and a challenge to fully grasp all the concepts, but as a “hopefully-to-be” data scientist I was happy to see that there are more and more fields open to frequentist but also baysian point of view. If you want to have a taste of what the talk was about, here you can find the slides.

presentation_slides

Challenges in Digital Marketing Vol1

Next week starts the Big Data conference in BGSE, therefore, Renyi Hour decided to make a preliminary research on one of the topics which will be discussed there – what challenges does the digital marketing face. We went directly to the source of information – a practitioner. Sidharth Gummall is a Client Success Manager in Datorama, “a global marketing intelligence company that provides the world’s first marketing data hub to help executives, analysts and IT professionals quickly manage and confidently execute upon vast amounts of data.” (according to their profile in LinkedIn)

Before asking Sidharth what his opinion is on the particular question we have in mind, we squeezed him for some basic knowledge of what exactly digital marketing is and why it is so important. So, itSidharth highlighted three important aspects. :

Goal: before we do any kind of marketing we need to define what the goal of the campaign is – to raise awareness, to increase sales, etc.

Objectives: there are different types – impression (if the person has seen the ad), click (visiting the advertised website) and if there is impact (so if there is actually sale).

Intermediary: it is called ad server – it collect the information which is needed (like number of clicks, impressions or just the customers’ information). Most of the customers’ data comes from cookies which store your profile information while you are even browsing through some random websites.

So, having a glimpse of the digital marketing slang, what makes this type of marketing is different from any other type. The quantity and variety of information which you can extrapolate is tremendous – you can find exactly who is the person across the screen, what he likes, where he is, what he has browsed before, is he interested? Information which the TV, radio and flyer ads rarely can provide. Though Sidharth also mentioned that TV spots are still the most expensive marketing tool digital marketing is conquering the market exponentially which makes the connection between Big Data and digital marketing obvious. Acquiring data with every click and push of a button on a single web-connected device, is the reality nowadays and there is no doubt that its quantity accumulates faster and faster. Therefore, we need to master machine learning tools in order to analyze and extract relevant insights from the big data pile.

Having this introductory “course”, we are ready to see what challenges the digital market faces. According to Sidharth there are three major ones:

  1. The AD blocker policy which prevents the marketing agencies to assess the potential customers
  2. The ad server (the intermediary which collects the data, like number of views) is not very flexible. For example, let’s say that a website where an ad is placed is modified. Now the number of views is called views whereas before that it was named impressions. This prevents the ad server to acquire the necessary data because it will look for information called impressions, which does not exist anymore in the website construction.
  3. Loss of information when users clean their cookies and the inability to restore that data

 

Now backed up with knowledge and Sidharth’s ideas, we are eager to find out what the talk on Tuesday (22.03.2016) is going to be. Would the presented challenges be the same or would we be surprised with other insights?

 

“At one point, I nearly threw my laptop at the wall.”

 

BGSE Data Science Kaggle Competition

During our second term as data scientists, the class split into 10 teams of 3 and participated in an “in-class” kaggle competition. It left every team depleted from late-night efforts and many long days spent obsessing and executing ideas which resulted often in reduced accuracy.

 

“At one point, I nearly threw my laptop at the wall.”  

                                                                            a data scientist moments after the deadline

 

 

The challenge

Predict popularity of 1-5 (ordinal classification) of Mashable news articles using 60 variables of metadata. Many of these features were useless. You can read more on the UCI Machine Learning Repository.

 

The outcome

The top 2 teams pulled ahead of the pack with accuracies of 0.55567 and 0.56744, while groups 3 through 10 were all within .01 (e.g. 1 percentage accuracy).

A collection of lessons learned from the top 3 groups:

  • Third place used a single random forest, which multiple teams found to be better than multiple forests or an ensembled forest with other models.
  • Second place used an ensemble 3 random forests and 3 generalized boosted regression models. Leveraging the ordinal nature of the classification, the team grouped 3’s, 4’s and 5’s, and achieved higher accuracy predicting the 3-class problem
  • The first place team used rolling window for prediction, provided the time series nature of data (a single random forest)
  • The first place team created new features from the url to detect title keywords unique to each class. This method only affected 400 out of the 9644 test set but resulted in an improved accuracy by 1% – huge in kaggle-land.

 

Other curiosities

  • Keep systematic, meticulous and obsessive track of your randomness, methods and results.
  • Trust your cross-validation (don’t trust the public leaderboard): The third place team chose to use the submission for which they got the best cross-validation score on the training set, but not the best score on the public leaderboard.
  • Use all your data: many teams extracted new features from the URL.

Renyi Hour: 3D and enhancing the user experience in (Big) Data visualization

On Feb 19th, we had an amazing talk by Professor Josep Blat:

Visualizations usually deal with a range of objectives, from exploration of raw data to presentation of analysed data – and different combinations of both extremes.

In this talk we will discuss some interactive visualizations, mostly using 3D graphics, on data coming from Sport (for instance, football matches or regattas), and digital cinema, especially from the point of view of the potential user, and what it can be added to the user’s experience as well.

Link to the Presentation

Visit their research group site to see their ongoing projects!

RENYI HOUR: Key takeaways from the Strata+Hadoop Conference (Nov 2014)

On January 8th, Jordan McIver reviewed key takeaways from attending the Strata+Hadoop Conference focused on Big Data which took place in November 2014 here in Barcelona.


Topics covered will include summaries of relevant talks and materials available at the conference, interesting organizations and technologies that were encountered and more specific information about contacts and networking progress driven from the conference. He concluded the talk offering a networking discussion.
Slides will be available on request. Please contact Jordan through Linkedin.

March 5th: Visit to the Barcelona Supercomputing Center

On March 5th, the BGSE Data Science Students are visiting the Barcelona Supercomputing Center at Technical University of Catalonia (UPC).

After visiting the facilities, Professor Jordi Torres will give a lecture on:

Next generation big data systems: Cognitive Computing

Big Data is the newest and one of the most exciting technologies currently available to business and society. It allows companies to gain the edge over their competitors in business and, in many ways, to the benefit their customers directly. Also for customers, the influences of big data are far reaching, but the technology is often so subtle that consumers have no idea that big data is actually helping make their lives easier. Sensors and devices from various sources are generating massive amounts of data that are fundamental for people to socialize, exchange information, and consume services from several places. All of these technologies are key to construct services that could be much smarter, more proactive, and customized according to user needs. This will be achieved with the advent of new technologies as cognitive computing, that will have the potential to augment our reasoning capabilities and empower us to make better informed real-time decision. Undoubtedly Cognitive Computing will change the way we work and live. These and other trends will shape next-generation big data systems in important ways.
In this presentation we will talk about how a more varied data channels, increasingly diverse analytics methods, and new technology shifts, will impact the next generation of big data systems.

Jordi Torres has a Masters degree in Computer Science from the Technical University of Catalonia (UPC Barcelona Tech, 1988) and also holds a Ph. D. from the same institution (Best UPC Computer Science Thesis Award, 1993). Currently he is a full professor in the Computer Architecture Department at UPC Barcelona Tech. He has more than twenty five years of experience in research and development of advanced distributed and parallel systems in the High Performance Computing Group at UPC. He has been a visiting researcher at the “Center for Supercomputing Research & Development” at Urbana-Champaign (Illinois, USA, 1992). His principal interest as a researcher is Processing and Analyzing Big Data in a Sustainable Cloud. This involves making modern distributed and parallel cloud computing environments more efficient as required by today’s Big Data Analytics challenges. He has about 150 research publications and he was involved in several conferences in the area. He was a member of the European Center for Parallelism of Barcelona (CEPBA) (1994-2004) and a member of the board of managers of CEPBA-IBM Research Institute (CIRI) (2000-2004). In 2005 the Barcelona Supercomputing Center – Centro Nacional de Supercomputación (BSC) was founded and he was nominated as a Manager for Autonomic Systems and eBusiness Platforms research line in BSC. He has worked and works in a number of EU and industrial research and development projects. He lectures on Computer Science courses in the UPC Barcelona Tech (High Performance Computing, Cloud Computing, Big Data, …). He has been Vice-dean of Institutional Relations at the Computer Science School (1998- 2001), and a member of the Catedra Telefonica-UPC where he worked in teaching innovation (2003-2005). He has also participated in numerous academic management activities and institutional representation. He acts as an expert on these topics for various organizations, companies and mentoring entrepreneurs. During a period (2009-2012) he collaborated with spanish mass media to disseminate ICT and published two books about science and technology for a general audience.