Predicting NFL stadium attendances with tidymodels

Introduction

In this post I’m going to be trying out the new tidymodels framework in R.

I’ve been reading through the corresponding book and it looks to me like a real game-changer for building robust statistical/ML models quickly in the R language. But reading only gets you so far! Time to test it out.

The data I’m going to be using is a #TidyTuesday dataset from last year related to crowd attendances in the American national football league (NFL).

The tidymodels framework covers all stages of the modeling lifecyle, from data preprocessing and resampling to model building, tuning and evaluation. In this post I’m only going to look at data resampling and model building and evaluation. I’ll look at the other stages in future posts.

I owe credit to Julia Silge, whose blog post served as the inspiration for this piece.

Load and inspect the data

attendance <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv")
results <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv")

Our first dataset is about attendance per team every week since the year 2000:

head(attendance)
## # A tibble: 6 x 8
##   team    team_name  year  total   home   away  week weekly_attendance
##   <chr>   <chr>     <dbl>  <dbl>  <dbl>  <dbl> <dbl>             <dbl>
## 1 Arizona Cardinals  2000 893926 387475 506451     1             77434
## 2 Arizona Cardinals  2000 893926 387475 506451     2             66009
## 3 Arizona Cardinals  2000 893926 387475 506451     3                NA
## 4 Arizona Cardinals  2000 893926 387475 506451     4             71801
## 5 Arizona Cardinals  2000 893926 387475 506451     5             66985
## 6 Arizona Cardinals  2000 893926 387475 506451     6             44296

And our second dataset gives us results stats for each team about things like final standings, wins/losses, points for/against and whether the team made the playoffs for every year since 2000.

head(results) 
## # A tibble: 6 x 15
##   team    team_name  year  wins  loss points_for points_against points_differen…
##   <chr>   <chr>     <dbl> <dbl> <dbl>      <dbl>          <dbl>            <dbl>
## 1 Miami   Dolphins   2000    11     5        323            226               97
## 2 Indian… Colts      2000    10     6        429            326              103
## 3 New Yo… Jets       2000     9     7        321            321                0
## 4 Buffalo Bills      2000     8     8        315            350              -35
## 5 New En… Patriots   2000     5    11        276            338              -62
## 6 Tennes… Titans     2000    13     3        346            191              155
## # … with 7 more variables: margin_of_victory <dbl>, strength_of_schedule <dbl>,
## #   simple_rating <dbl>, offensive_ranking <dbl>, defensive_ranking <dbl>,
## #   playoffs <chr>, sb_winner <chr>

Let’s join these datasets by team and year to get one large dataset combining data on results and attendances which we can use for modelling.

attendance_joined <- attendance %>%
  left_join(results,
    by = c("year", "team_name", "team")
  )

head(attendance_joined)
## # A tibble: 6 x 20
##   team   team_name  year  total   home   away  week weekly_attendan…  wins  loss
##   <chr>  <chr>     <dbl>  <dbl>  <dbl>  <dbl> <dbl>            <dbl> <dbl> <dbl>
## 1 Arizo… Cardinals  2000 893926 387475 506451     1            77434     3    13
## 2 Arizo… Cardinals  2000 893926 387475 506451     2            66009     3    13
## 3 Arizo… Cardinals  2000 893926 387475 506451     3               NA     3    13
## 4 Arizo… Cardinals  2000 893926 387475 506451     4            71801     3    13
## 5 Arizo… Cardinals  2000 893926 387475 506451     5            66985     3    13
## 6 Arizo… Cardinals  2000 893926 387475 506451     6            44296     3    13
## # … with 10 more variables: points_for <dbl>, points_against <dbl>,
## #   points_differential <dbl>, margin_of_victory <dbl>,
## #   strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## #   defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>

Data exploration

As usual, we need to look at our data first to get a sense of it before starting any modelling. I like to do this using the skim function from R’s skimr package, which gives us an overview of the data:

skimr::skim(attendance_joined)
Table 1: Data summary
Name attendance_joined
Number of rows 10846
Number of columns 20
_______________________
Column type frequency:
character 4
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
team 0 1 5 13 0 32 0
team_name 0 1 4 10 0 32 0
playoffs 0 1 8 11 0 2 0
sb_winner 0 1 12 13 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2009.53 5.75 2000.0 2005.0 2010.0 2015.00 2019.0 ▇▇▇▇▇
total 0 1.00 1080910.03 72876.97 760644.0 1040509.0 1081089.5 1123230.00 1322087.0 ▁▁▇▆▁
home 0 1.00 540455.01 66774.65 202687.0 504360.0 543185.0 578342.00 741775.0 ▁▁▅▇▁
away 0 1.00 540455.01 25509.33 450295.0 524974.0 541757.0 557741.00 601655.0 ▁▂▇▇▂
week 0 1.00 9.00 4.90 1.0 5.0 9.0 13.00 17.0 ▇▆▆▆▇
weekly_attendance 638 0.94 67556.88 9022.02 23127.0 63245.5 68334.0 72544.75 105121.0 ▁▁▇▃▁
wins 0 1.00 7.98 3.08 0.0 6.0 8.0 10.00 16.0 ▂▆▇▆▂
loss 0 1.00 7.98 3.08 0.0 6.0 8.0 10.00 16.0 ▂▆▇▆▂
points_for 0 1.00 350.28 71.35 161.0 299.0 348.0 396.00 606.0 ▂▇▇▂▁
points_against 0 1.00 350.28 59.50 165.0 310.0 347.0 392.00 517.0 ▁▃▇▆▁
points_differential 0 1.00 0.00 101.01 -261.0 -75.0 1.5 73.00 315.0 ▂▆▇▅▁
margin_of_victory 0 1.00 0.00 6.32 -16.3 -4.7 0.1 4.60 19.7 ▂▆▇▅▁
strength_of_schedule 0 1.00 0.00 1.63 -4.6 -1.1 0.0 1.20 4.3 ▁▅▇▅▁
simple_rating 0 1.00 0.00 6.20 -17.4 -4.5 0.0 4.50 20.1 ▁▆▇▅▁
offensive_ranking 0 1.00 0.00 4.34 -11.7 -3.2 0.0 2.70 15.9 ▁▇▇▂▁
defensive_ranking 0 1.00 0.00 3.56 -9.8 -2.4 0.1 2.50 9.8 ▁▅▇▅▁

This looks like a pretty kind dataset, the only obvious problem being the 638 missing values which we will need to deal with. None of the features look particularly skewed (scroll to the right on the table above to see this).

Let’s see how weekly attendance varied over the 20 years of data we have by utilising box and whisker plots. Has going to NFL matches become more popular in recent years?

attendance_joined %>%
  mutate(year = factor(year)) %>%
  ggplot(aes(year, weekly_attendance, fill = year)) +
  geom_boxplot(show.legend = FALSE, outlier.alpha = 0.5) +
  labs(
    x = "Year of NFL season",
    y = "Weekly NFL game attendance"
  ) +
  scale_y_continuous(labels = comma)

Not much variation really. The 2020 figures would be interesting - presumably be much lower due to the pandemic - but we don’t have them unfortunately.

I wonder if teams that got to the playoffs had a higher weekly attendance? Let’s use a histogram to check out the respective distributions.

attendance_joined %>%
  filter(!is.na(weekly_attendance)) %>% 
  group_by(team_name, year, margin_of_victory, playoffs) %>% 
  summarise(avg_weekly_attendance = as.integer(mean(weekly_attendance))) %>% 
  ggplot(aes(avg_weekly_attendance, fill = playoffs)) +
  geom_histogram(position = "identity", alpha = 0.4) +
  labs(
    x = "Average weekly attendance",
    y = "Number of teams",
    fill = NULL
  )

It looks like the Playoffs distribution is slightly to the right of the No Playoffs. Let’s see if we can view this difference more explicitly, using a boxplot.

attendance_joined %>%
  filter(!is.na(weekly_attendance)) %>% 
  group_by(team_name, year, margin_of_victory, playoffs) %>% 
  summarise(avg_weekly_attendance = as.integer(mean(weekly_attendance)))  %>% 
  ggplot(aes(playoffs, avg_weekly_attendance, fill = playoffs)) +
  geom_boxplot() +
  labs(
    x = "Playoff reached",
    y = "Average weekly attendance"
  )

The lower quartile and median are definitely higher for the teams that made it to the Playoffs. We’ll need to take this into account later when we modelling.

In a more comprehensive analysis I would do further EDA and then systematic feature selection, but that’s not the purpose of this exercise - I just want to practise building a model to predict weekly_attendance using several features of my choosing.

Let’s prepare the data for modelling by removing any rows without a weekly attendance and keeping only columns that I think may be predictors for attendance:

attendance_df <- attendance_joined %>%
  filter(!is.na(weekly_attendance)) %>%
  select(
    weekly_attendance, team_name, year, week,
    margin_of_victory, strength_of_schedule, playoffs
  )

head(attendance_df)
## # A tibble: 6 x 7
##   weekly_attendance team_name  year  week margin_of_victory strength_of_schedule
##               <dbl> <chr>     <dbl> <dbl>             <dbl>                <dbl>
## 1             77434 Cardinals  2000     1             -14.6                 -0.7
## 2             66009 Cardinals  2000     2             -14.6                 -0.7
## 3             71801 Cardinals  2000     4             -14.6                 -0.7
## 4             66985 Cardinals  2000     5             -14.6                 -0.7
## 5             44296 Cardinals  2000     6             -14.6                 -0.7
## 6             38293 Cardinals  2000     7             -14.6                 -0.7
## # … with 1 more variable: playoffs <chr>

Building models

Splitting the data

Now we can start using some tidymodels packages!

First up we need to split our data - we do this using the initial_split() function from the rsample package. We’ll use the default setting of a 75:25 split for training:test set size. Note that in a real-life ML project we would use resampling to build multiple training and test sets for evaluating model performance - this will be discussed later. For now a simpler approach will suffice.

Note that as per our EDA above, we need to stratify this to ensure that each set has roughly the same number of teams that went on to the playoffs - we do this using the strata argument.

library(tidymodels)

set.seed(1234)
attendance_split <- attendance_df %>%
  initial_split(strata = playoffs)

nfl_train <- training(attendance_split)
nfl_test <- testing(attendance_split)

Now we’ve got our datasets, we can move on to the modelling itself.

Specifying and fitting models

For this exercise we will compare two models - a linear model and a random forest. To do this we will use functions from the parsnip package.

  1. Linear model

First let’s specify the model:

lm_spec <- linear_reg() %>% 
  set_engine(engine = "lm")

lm_spec
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Next we fit the model on the training data:

lm_fit <- lm_spec %>% 
  fit(weekly_attendance ~ .,
      data = nfl_train)

lm_fit
## parsnip model object
## 
## Fit time:  26ms 
## 
## Call:
## stats::lm(formula = weekly_attendance ~ ., data = data)
## 
## Coefficients:
##          (Intercept)        team_nameBears      team_nameBengals  
##             -81107.9               -2879.8               -4875.5  
##       team_nameBills      team_nameBroncos       team_nameBrowns  
##               -361.1                2805.2                -324.1  
##  team_nameBuccaneers    team_nameCardinals     team_nameChargers  
##              -3063.7               -6139.8               -6489.3  
##      team_nameChiefs        team_nameColts      team_nameCowboys  
##               1974.3               -3392.8                6068.7  
##    team_nameDolphins       team_nameEagles      team_nameFalcons  
##                139.7                1259.2                -204.2  
##      team_nameGiants      team_nameJaguars         team_nameJets  
##               5447.1               -3095.5                4044.2  
##       team_nameLions      team_namePackers     team_namePanthers  
##              -3480.7                1114.1                1227.3  
##    team_namePatriots      team_nameRaiders         team_nameRams  
##               -214.2               -6324.7               -2884.8  
##      team_nameRavens     team_nameRedskins       team_nameSaints  
##               -398.9                6447.1                 381.0  
##    team_nameSeahawks     team_nameSteelers       team_nameTexans  
##              -1405.9               -3567.8                 264.1  
##      team_nameTitans      team_nameVikings                  year  
##              -1118.2               -3183.1                  74.7  
##                 week     margin_of_victory  strength_of_schedule  
##                -72.8                 137.6                 230.7  
##     playoffsPlayoffs  
##               -427.9

This model is now ready to use. Now we can move on to the second model.

  1. Random forest

Specify the model:

rf_spec <- rand_forest(mode = "regression") %>%
  set_engine("ranger")

rf_spec
## Random Forest Model Specification (regression)
## 
## Computational engine: ranger

Fit the model:

rf_fit <- rf_spec %>%
  fit(weekly_attendance ~ .,
    data = nfl_train
  )

rf_fit
## parsnip model object
## 
## Fit time:  7.4s 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1)) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      7656 
## Number of independent variables:  6 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       7.5e+07 
## R squared (OOB):                  0.082

Evaluating the models

Now we need to measure how effective our models were and compare them. One way to do this is to predict() the weekly attendance for the test set.

results_test <- lm_fit %>%
  predict(new_data = nfl_test) %>%
  mutate(
    truth = nfl_test$weekly_attendance,
    model = "lm"
  ) %>%
  bind_rows(rf_fit %>%
    predict(new_data = nfl_test) %>%
    mutate(
      truth = nfl_test$weekly_attendance,
      model = "rf"
    ))

This gives us a tibble with the predictions and true values for both models

results_test %>% slice(c(1:5, (n() -5):n()))
## # A tibble: 11 x 3
##     .pred truth model
##     <dbl> <dbl> <chr>
##  1 59162. 65356 lm   
##  2 59017. 50289 lm   
##  3 58871. 37452 lm   
##  4 58798. 65711 lm   
##  5 66880. 73025 lm   
##  6 67178. 58613 rf   
##  7 68914. 66910 rf   
##  8 65276. 60926 rf   
##  9 62456. 65265 rf   
## 10 66187. 71504 rf   
## 11 67666. 90646 rf

Let’s visualise the predictions versus the truth for both models:

results_test %>%
  ggplot(aes(x = truth, y = .pred, color = model)) +
  geom_abline(lty = 2) +
  geom_point(alpha = 0.5) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma) +
  facet_wrap(~model) +
  labs(y = "Predicted weekly attendance", x = "Actual weekly attendance") +
  coord_obs_pred() +  # scale and size the x- and y-axis uniformly
  theme(panel.spacing = unit(2, "lines"))  # widen the gap between plots for aesthetic purposes

It looks like neither model does a particularly good job of predicting the attendance, though it’s hard to see which is better. Let’s try to establish this numerically rather than visually. The yardstick package contains a metric_set function, from which we can compute any metrics of interest for both models.

As discussed here in the tidymodels textbook, the choice of which metric to examine can be critical. Accuracy, as measured by the RMSE, and correlation, as measured by the R2, are not the same thing, and in proper ML projects the choice of which to optimise should be made on a case-by-case basis. For now, we will simply evaluate both.

nfl_metrics <- metric_set(rmse, rsq)

results_test %>% 
  group_by(model) %>% 
  nfl_metrics(truth = truth, estimate = .pred) %>% 
  select(model, .metric, .estimate)
## # A tibble: 4 x 3
##   model .metric .estimate
##   <chr> <chr>       <dbl>
## 1 lm    rmse     8351.   
## 2 rf    rmse     8582.   
## 3 lm    rsq         0.142
## 4 rf    rsq         0.117

The linear model produced a lower error and higher r-squared score, showing that for this particular dataset it’s the better model. Its average error in predicting weekly attendance is ~8350 people, which isn’t too bad given how basic it is!

Presumably the random forest model, being much more complex than the linear one, overfit to the training data, which is why it did a worse job on the test set (further analysis showed this to indeed be the case - code not shown).

Resampling the data

As touched on earlier, in a real-lfie ML project, we would not simply train models on the training set and then immediately apply them to the test set. The test set should not be touched until validation for model selection has been completed. Instead, we need to take a resampling approach using cross-validation.

Below, we use the vfold_cv() function from the rsample package to achieve this and then re-compute our metrics of interest for both models.

Linear model:

set.seed(1234)
nfl_folds <- vfold_cv(nfl_train, strata = playoffs)

lm_res <- fit_resamples(
  lm_spec,
  weekly_attendance ~ .,
  nfl_folds,
  control = control_resamples(save_pred = TRUE)
)

lm_res %>%
  collect_metrics() %>% 
  select(.metric, mean)
## # A tibble: 2 x 2
##   .metric     mean
##   <chr>      <dbl>
## 1 rmse    8335.   
## 2 rsq        0.148

Random forest model:

set.seed(1234)
nfl_folds <- vfold_cv(nfl_train, strata = playoffs)

rf_res <- fit_resamples(
  rf_spec,
  weekly_attendance ~ .,
  nfl_folds,
  control = control_resamples(save_pred = TRUE)
)

rf_res %>%
  collect_metrics() %>% 
  select(.metric, mean)
## # A tibble: 2 x 2
##   .metric     mean
##   <chr>      <dbl>
## 1 rmse    8617.   
## 2 rsq        0.117

So now using only the training set we have got metric estimates very close to the ones we computed from the test set earlier. Using cross-validation in this way allows us to be more economical with our data spending, as well as making it possible to tune the models without the risk of any data leakage from the test set.

Conclusions

In this post we:

  1. Loaded and explored the data, which allowed us to make more informed choices about how to later model it.
  2. Gained familiarity with the tidymodels framework of packages, such as:
  • Using rsample for data splitting
  • Using parsnip for model specifying and fitting.
  • Using yardstick for model evaluation
  • Using rsample again, this time for doing cross-validation.
  1. Established that, for this particular dataset, the linear model did a better job of predicting on new data than the random forest, which overfit to the training set.
Avatar
Data Scientist