Coffee Talks: How important is a win in Basketball?

Following the fun and interesting post Scraping and Analyzing Baseball Data with R, here it is replicated but using Basketball (NBA) data

Here is the PDF

And below the Rmd code.

# Introduction

After reading the fun and interesting post *Scraping and Analyzing Baseball Data with R*[^1], I decided to replicate it with Basketball (NBA) data.


# Data

The data used comes from **ESPN** site[^2]. The data that is going to be used is attendance and standings from 2002 to 2013. That would be from the regular season 2001-2002 to the regular season 2012-2013. The attempt here is to determine how much winning (winning a game) impacts attendance (average attendance during the season).



There some issues with the data that were not addressed:

1. Each regular season have 82 games. 41 played at home and 41 on the road. The season 2011-2012 had 66 games instead of 82. This season has not been removed. Al alternative is to used the winning percentage instead of the "wined games", but that would be a loss in interpretation.
2. Many teams move to other cities[^3] during this period and other change venues (larger stadiums). An alternative would be to remove these teams. This was not done.

[^3]: The Charlotte Hornets in 2001 changed city and became the New Orleans Hornets. After Katrina hurricane, the team moved for two seasons to Oklahoma coming back to New Orleans in 2007. Finally, in 2013 the name of the team was changed to Pelicans. To add more "complexity" to the story in 2004, Charlotte got a new team called the Charlotte Bobcats that in 2014 changed it named to Hornets (but it was not the same original franchise).

# Scraping with R

Here the `XML` library is used. The data is read from two different tables. One has the Regular Season Standings and the other the Average Attendance. The different years were obtain using a `for loop` (old habits, not efficient, I know). From the Standings table the following columns were obtain: `Team name`, `# Wins`, `# Losses` and `Percentage of wins`. And from the Attendance table: `Team name`, `Average Attendance`. To each new data frame a column of `Year` is added.


#Library needed

#Obtaining Standings

  dg <- data.frame(TEAM=character(0),
                   Win=numeric(0), Loss=numeric(0), Percentage=numeric(0),

for (i in 2002:2014) {
  dg <- rbind(dg,
                                                    header = TRUE, = TRUE, skip.rows = 1,
  colnames(dg)<-c("Team", "Win", "Loss", "Percentage", "Year")
  # remove the aditional information in the names of the teams like "z - Indiana"
  dg$Team<-sub('.*\\- ', '', dg$Team)
  # change Wins as numeric

# Obtaining Attendance

  df <- data.frame(TEAM=character(0),
for (i in 2002:2014) {
  df <- rbind(df,
                                                    header = TRUE, = TRUE, skip.rows = 1,
                                                    stringsAsFactors=FALSE))[,c(2,5)],i)) }
  colnames(df)<-c("Team", "AvgAttendance", "Year")

  # change Attendance as numeric
  df$AvgAttendance <- as.numeric(gsub(",","", df$AvgAttendance))


## Getting the names correct(?)

The idea here is to homogenize the names in both tables. Because, one table has the teams by city names (for example, "San Antonio" instead of "Spurs"), while the other has the teams by its name (for example, "Spurs"). For this i used a package called `RecordLinkage` and a dictionary with all the names of the teams including old and new teams.

The table with the attendance that has the team’s names used the names of the actual teams in past data. In other words, the Hornets name is used for information related to the Bobcats. *Nice mess. Can be fixed? Yes. But! This is a coffee talk*.

```{r, message=FALSE, warning= FALSE}

names<-c("Atlanta Hawks", "Boston Celtics", "Brooklyn Nets", "New Jersey Nets", "Charlotte Hornets", "Charlotte Bobcats", 
"Chicago Bulls", "Cleveland Cavaliers", "Dallas Mavericks", "Denver Nuggets", "Detroit Pistons",
"Golden State Warriors", "Houston Rockets ", "Indiana Pacers", "Los Angeles Clippers", "Los Angeles Lakers",
"Memphis Grizzlies","Miami Heat", "Milwaukee Bucks", "Minnesota Timberwolves", "New Orleans Pelicans",
"New Orleans Pelicans", "NO/Oklahoma City Hornets", "New Orleans Hornets", "New York Knicks", "Oklahoma City Thunder",
"Seattle SuperSonics", "Orlando Magic", "Philadelphia 76ers", "Phoenix Suns", "Portland Trail Blazers",
"Sacramento Kings", "San Antonio Spurs", "Toronto Raptors", "Utah Jazz", "Washington Wizards")

#Library and Function
  ClosestMatch2 = function(string, stringVector){
    distance = levenshteinSim(string, stringVector);
    stringVector[distance == max(distance)]}

# Loop that changes each the names in table of Attendance
for (i in 1:length(df$Team)){
  if(df$Team[i]=="Pelicans"){df$Team[i]<-"New Orleans Pelicans"}

# Loop that changes each the names in table of Wins

for (i in 1:length(dg$Team)){


## Crunch Time = Analysis Time

First the two data frames are merged doing a "join" and then removing rows with missing data. In the merge some data is miss due to not correctly matching information, like for example, Brooklyn Nets in the original table that has only the team names. That information is missed.

clean <- merge(dg, df, by=c("Team", "Year"))




The plot of Average Attendance and win suggest **that there is something there** *[sic]*.

```{r, fig.width = 10, fig.height = 5, fig.fullwidth = TRUE, fig.cap = "Average Attendance and Wins Relationship"}
ggplot(training, aes(x=Win, y=AvgAttendance)) + geom_point()

A linear model is used to try to understand better the relationship. *Coffee talking*. Also is considered the idiosyncratic effect of each team (different stadiums size, different "stories").

result<-summary(lm(AvgAttendance ~ 0 + Win + Team, data= training))

Every coefficient is significant. **It looks like an additional win brings to the Stadium 77 more people**. Sounds small, but the regular season has 82 matches.

# Conclusion

By the end of the second coffee, the predicted values are compared with the information (Average Attendance and Wins) for the regular season 2013-2014.


  for (i in 1:29){




**It looks that this simple model is not that bad predicting! The average difference is 5.5% with respect to the actual average attendance.**


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s