Dragoncity Hackathon by Aimee Barciauskas

Just beyond Agbar Tower (e.g. Barcelona’s Gherkin) is the home of SocialPoint, a local game development company, where I and 6 other data scientists arrived at 9am this past Saturday morning. Riding up to the 10th floor, we were presented with 270 degree views of the cityscape and the Mediterranean, ready to participate in Barcelona’s first ever data hackathon.
Sharron Bigger began the introduction at 9:30am: We had 24 hours to predict 7-day churn for SocialPoint’s Dragon City game using data on players’ first 48 hours of game play. I can’t say I played the game, but I can ask, how adorable are their graphics?

unnamed

We would be presenting our results the following morning to a panel of 4 judges: Andrés Bou (CEO, SocialPoint), Sharon Biggar (Head of Analytics, SocialPoint), Tim Van Kasteren (Data Scientist, Schibsted Media Group) and … Christian Fons-Rosen?! Awesome! Soon after that welcome surprise, we were unleashed on our data challenge:

 

  • Training data on 524,966 users: Each user’s row included a churn value of 0 (not churned) or 1 (churned), in addition to 40 other variables on game play. These included variables such as num_sessions, attacks, dragon breedings, cash_spent, and country where the user registered.
  • Test data for 341,667 users: excluding the churn variable
  • Prizes: Prizes would be awarded to both the accuracy (first and second prizes of €750 and €500) and the business insights tracks (one prize of €500).
  • Objective evaluation of the accuracy track: Achieve the highest AUC (area under the ROC curve) for an unknown test set.
    • We were provided a baseline logistic regression model which achieved an AUC of 0.723.
    • A public scoreboard was available to test predictions, however to avoid overfitting, submissions to the public scoreboard were only evaluated against ⅕ of the final test data set.
  • Subjective evaluation of the business insights track: Each group would present to the panel of judges on actionable insights to address churn.

 

After poking around at the data all morning, barely nudging the baseline AUC by .01, Roger shot our team to the top of the rankings with a 0.96. It was nice to see our team (the g-priors) on top.

 

Sadly, it didn’t last. After lunch, an organizer announced we couldn’t use the variable which was a near exact proxy of churn: date_last_logged can predict churn with near perfect accuracy. The only reason it was not perfect was some people logged back in after the window for measuring churn had closed. Other teams had found this same trick, but all scores were wiped from the public leader board and the lot of us were back down to dreams of a 0.73.

 

The rest of the 24 hours were a cycle of idea, execution, and frustration. We cumulatively executed gradient boosting, lasso regression, business insights presentation and many iterative brute-force analysis of variable combinations, interactions and transformations.

 

Some of us were up until 6:30am, emailing colleagues with new ideas.

 

We arrived at 9am again the next morning to submit our final results and finish our business insights presentation. Christian Fons-Rosen arrived soon after, surprised to see us as well.

 

Well, when I arrived on Sunday morning at SocialPoint after having slept beautiful 9 hours and I saw the faces of Aimee, Roger, and Uwe…  For a moment I thought: ‘Wow, it must have been a great night at Razzmatazz or Sala Apolo.’ … It took me a while to understand what really happened that night!

 

Though it may not have been a scientifically methodical process, if you bang your head against a data set for 24 hours, you will learn something. Maybe participating won’t have an impact on exam results, but it certainly gave this data hacker more insights into how to practically approach a data challenge.

BGSE Data SCience Team


 

  1. My team included Roger Cuscó, Harihara Subramanyam Sreenivasan, and Uwe-Herbert Hönig. Domagoj Fizulic, Niti Mishra and Guglielmo Pelino also participated on other teams.
  2. At least that’s what we were told! Full details on the hackathon can be found on BCNAnalytic’s website.
  3. Churn rate is the rate at which users stop playing the game within a specified time period.

 

Coffee Talks: How important is a win in Basketball?

Following the fun and interesting post Scraping and Analyzing Baseball Data with R, here it is replicated but using Basketball (NBA) data

Here is the PDF

And below the Rmd code.

# Introduction

After reading the fun and interesting post *Scraping and Analyzing Baseball Data with R*[^1], I decided to replicate it with Basketball (NBA) data.

[^1]: http://blog.yhathq.com/posts/scraping-and-analyzing-baseball-data-with-r.html


# Data

The data used comes from **ESPN** site[^2]. The data that is going to be used is attendance and standings from 2002 to 2013. That would be from the regular season 2001-2002 to the regular season 2012-2013. The attempt here is to determine how much winning (winning a game) impacts attendance (average attendance during the season).

[^2]: http://espn.go.com/nba/

#Covenants

There some issues with the data that were not addressed:

1. Each regular season have 82 games. 41 played at home and 41 on the road. The season 2011-2012 had 66 games instead of 82. This season has not been removed. Al alternative is to used the winning percentage instead of the "wined games", but that would be a loss in interpretation.
2. Many teams move to other cities[^3] during this period and other change venues (larger stadiums). An alternative would be to remove these teams. This was not done.

[^3]: The Charlotte Hornets in 2001 changed city and became the New Orleans Hornets. After Katrina hurricane, the team moved for two seasons to Oklahoma coming back to New Orleans in 2007. Finally, in 2013 the name of the team was changed to Pelicans. To add more "complexity" to the story in 2004, Charlotte got a new team called the Charlotte Bobcats that in 2014 changed it named to Hornets (but it was not the same original franchise).

# Scraping with R

Here the `XML` library is used. The data is read from two different tables. One has the Regular Season Standings and the other the Average Attendance. The different years were obtain using a `for loop` (old habits, not efficient, I know). From the Standings table the following columns were obtain: `Team name`, `# Wins`, `# Losses` and `Percentage of wins`. And from the Attendance table: `Team name`, `Average Attendance`. To each new data frame a column of `Year` is added.


```{r}

#Library needed
library(XML)

#Obtaining Standings

  dg <- data.frame(TEAM=character(0),
                   Win=numeric(0), Loss=numeric(0), Percentage=numeric(0),
                    YEAR=numeric(0)) 

for (i in 2002:2014) {
  dg <- rbind(dg,
              cbind(as.data.frame(readHTMLTable(paste0("http://espn.go.com/nba/standings/_/year/",
                                                       i,"/group/1"), 
                                                    header = TRUE, as.data.frame = TRUE, skip.rows = 1,
                                                    stringsAsFactors=FALSE))[1:4],i))}
  colnames(dg)<-c("Team", "Win", "Loss", "Percentage", "Year")
  
  # remove the aditional information in the names of the teams like "z - Indiana"
  dg$Team<-sub('.*\\- ', '', dg$Team)
  
  # change Wins as numeric
  dg$Win<-as.numeric(dg$Win)
  
  head(dg)

# Obtaining Attendance

  df <- data.frame(TEAM=character(0),
                 ATTENDANCE=numeric(0), 
                  YEAR=numeric(0)) 
for (i in 2002:2014) {
  df <- rbind(df,
              cbind(as.data.frame(readHTMLTable(paste0("http://espn.go.com/nba/attendance/_/year/",
                                                       i), 
                                                    header = TRUE, as.data.frame = TRUE, skip.rows = 1,
                                                    stringsAsFactors=FALSE))[,c(2,5)],i)) }
  colnames(df)<-c("Team", "AvgAttendance", "Year")

  # change Attendance as numeric
  df$AvgAttendance <- as.numeric(gsub(",","", df$AvgAttendance))

  head(df)
```

## Getting the names correct(?)

The idea here is to homogenize the names in both tables. Because, one table has the teams by city names (for example, "San Antonio" instead of "Spurs"), while the other has the teams by its name (for example, "Spurs"). For this i used a package called `RecordLinkage` and a dictionary with all the names of the teams including old and new teams.

The table with the attendance that has the team’s names used the names of the actual teams in past data. In other words, the Hornets name is used for information related to the Bobcats. *Nice mess. Can be fixed? Yes. But! This is a coffee talk*.

```{r, message=FALSE, warning= FALSE}
#Dictionary

names<-c("Atlanta Hawks", "Boston Celtics", "Brooklyn Nets", "New Jersey Nets", "Charlotte Hornets", "Charlotte Bobcats", 
"Chicago Bulls", "Cleveland Cavaliers", "Dallas Mavericks", "Denver Nuggets", "Detroit Pistons",
"Golden State Warriors", "Houston Rockets ", "Indiana Pacers", "Los Angeles Clippers", "Los Angeles Lakers",
"Memphis Grizzlies","Miami Heat", "Milwaukee Bucks", "Minnesota Timberwolves", "New Orleans Pelicans",
"New Orleans Pelicans", "NO/Oklahoma City Hornets", "New Orleans Hornets", "New York Knicks", "Oklahoma City Thunder",
"Seattle SuperSonics", "Orlando Magic", "Philadelphia 76ers", "Phoenix Suns", "Portland Trail Blazers",
"Sacramento Kings", "San Antonio Spurs", "Toronto Raptors", "Utah Jazz", "Washington Wizards")

#Library and Function
  library(RecordLinkage)
  
  ClosestMatch2 = function(string, stringVector){
    distance = levenshteinSim(string, stringVector);
    stringVector[distance == max(distance)]}

# Loop that changes each the names in table of Attendance
for (i in 1:length(df$Team)){
  if(df$Team[i]=="Pelicans"){df$Team[i]<-"New Orleans Pelicans"}
  else{df$Team[i]<-ClosestMatch2(df$Team[i],names)}}

# Loop that changes each the names in table of Wins

for (i in 1:length(dg$Team)){
    dg$Team[i]<-ClosestMatch2(dg$Team[i],names)}

```

## Crunch Time = Analysis Time

First the two data frames are merged doing a "join" and then removing rows with missing data. In the merge some data is miss due to not correctly matching information, like for example, Brooklyn Nets in the original table that has only the team names. That information is missed.

```{r}
clean <- merge(dg, df, by=c("Team", "Year"))
clean_data<-clean[complete.cases(clean),]

training<-clean_data[clean_data$Year!=2014,][,c(1,3,6)]

test<-clean_data[clean_data$Year==2014,][,c(1,3,6)]

```

The plot of Average Attendance and win suggest **that there is something there** *[sic]*.

```{r, fig.width = 10, fig.height = 5, fig.fullwidth = TRUE, fig.cap = "Average Attendance and Wins Relationship"}
library(ggplot2)
ggplot(training, aes(x=Win, y=AvgAttendance)) + geom_point()
```

A linear model is used to try to understand better the relationship. *Coffee talking*. Also is considered the idiosyncratic effect of each team (different stadiums size, different "stories").

```{r}
result<-summary(lm(AvgAttendance ~ 0 + Win + Team, data= training))
data.frame(Coef=round(result$coefficients[,1],0))
```

Every coefficient is significant. **It looks like an additional win brings to the Stadium 77 more people**. Sounds small, but the regular season has 82 matches.

# Conclusion

By the end of the second coffee, the predicted values are compared with the information (Average Attendance and Wins) for the regular season 2013-2014.

```{r}

predict_value<-matrix(0,29,1)
  for (i in 1:29){
    predict_value[i,1]<-round(result$coefficients[1,1]*test$Win[i]+result$coefficients[i+1,1],0)
  }

compare<-data.frame(Team=test$Team,
           Abs_Diff_Per=round(abs(test$AvgAttendance-predict_value)/test$AvgAttendance*100,2))

compare

```

**It looks that this simple model is not that bad predicting! The average difference is 5.5% with respect to the actual average attendance.**