Spicy Like Curry - To shoot or not to shoot?

Author

Nils Indreiten

Executive Summary

The idea for this project came from a conversation with a friend, we were discussing whether it is possible to predict whether a basketball player will make a shot or not. In specific, the question was whether Stephen Curry would make a shot given he attempts one. This project aims to predict whether Stephen Curry will make a shot at a play-by-play level, based on a variety of features.

The project is structured as follows: First general trends in the NBA are explored, honing in on the point guard position, which is Curry’s position. Then a few key features are selected and an XGBoost classifier is trained and tested. The model performance is then evaluated in unseen data, season 2024. The model was trained and tested on data from the 2022 and 2023 seasons. Finally, the findings are concluded and future research directions are suggested.

Exploratory Data Analysis & Data Wrangling

Data Collection

Since Stephen Curry is a point guard, let’s focus on the trends for that position. First, let’s take a look at the overall stats for all point guards in the NBA. The hoopR package let’s users retrieve NBA data at different levels of granularity, for this we need data at all levels of aggregations, therefore we retrieve team box, player box, and play by play data:

Show the code
# NBA team box data, 2022-2024 seasons:
nba_team_box <- hoopR::load_nba_team_box(2022:hoopR::most_recent_nba_season())

# NBA player box data, 2022-2024 seasons:
nba_player_box <- hoopR::load_nba_player_box(2022:hoopR::most_recent_nba_season())

# NBA play by play data, 2022-2024 seasons:
nba_pbp <- hoopR::load_nba_pbp(2022:hoopR::most_recent_nba_season())

We now have three dataframes, each denoting different information, then nba_team_box dataframe is the most broad, followed by nba_player_box and nba_pbp which are more granular. Next, using the nba_player_box dataframe as the starting point, we filter only for point guards and then aggregate the data to get a sense of the overall performance by player and team:

Show the code
# Team names data frame:

team_names <- nba_team_box %>% 
  select(team_name,team_display_name) %>% 
  unique()

# Filter nba_player_box, for point guards

nba_player_box %>% 
  filter(athlete_position_name == "Point Guard") -> point_guard_df

# Select columns that we need 
point_guard_df %>% 
  select(season,athlete_id,athlete_display_name,athlete_position_name,team_name,
         minutes,field_goals_made,three_point_field_goals_made,free_throws_made,
         rebounds,assists,offensive_rebounds,defensive_rebounds,steals,blocks) -> pg_stats

# PG stats summary  

pg_stats %>% 
  filter(season == 2023 & !grepl("Team",team_name)) %>%
  group_by(athlete_display_name, team_name) %>% 
  mutate(total_points =    sum(
    field_goals_made + three_point_field_goals_made + free_throws_made,
    na.rm = TRUE
  )) %>% 
  ungroup() %>%
  group_by(athlete_display_name, team_name) %>% 
  summarise(
  minutes = sum(minutes, na.rm = TRUE),
  field_goals_made = mean(field_goals_made, na.rm = TRUE),
  three_point_field_goals_made = mean(three_point_field_goals_made, na.rm = TRUE),
  free_throws_made = mean(free_throws_made, na.rm = TRUE),
  assists = mean(assists, na.rm = TRUE),
  rebounds = mean(offensive_rebounds + defensive_rebounds, na.rm = TRUE),
  total_points = sum(max(total_points))) %>%
  arrange(desc(total_points)) %>% 
  ungroup() -> pg_stats

pg_stats %>% 
  head(5) 
# A tibble: 5 × 9
  athlete_display_name team_name minutes field_goals_made three_point_field_go…¹
  <chr>                <chr>       <dbl>            <dbl>                  <dbl>
1 Trae Young           Hawks        2774             8.32                  2.18 
2 Shai Gilgeous-Alexa… Thunder      2417            10.4                   0.853
3 Luka Doncic          Mavericks    2389            10.9                   2.80 
4 Stephen Curry        Warriors     2435            10.2                   4.78 
5 Damian Lillard       Trail Bl…    2113             9.59                  4.21 
# ℹ abbreviated name: ¹​three_point_field_goals_made
# ℹ 4 more variables: free_throws_made <dbl>, assists <dbl>, rebounds <dbl>,
#   total_points <dbl>

The resulting data frame contains a summary of the player performance for the 2023 season, summarised by player & team. Let’s wrap the dataframe in a reactive framework, so that you (the reader), can filter and slice the data as you wish, where AFG refers to average field goals made, 3PAFG refers to average three point made, AFT refers to average free throws made, AST refers to average assists, REB refers to average rebounds, and TPTS refers to total points:

In terms of total points it looks like Stephen Curry comes in at 4th place, whilst Trae Young comes out on top. However, if we focus on the 3PAFG metric, Curry comes out on top. Let’s hone in on the top 6 point guards in terms of 2-pointers, 3-pointers, assists, rebounds and free throws made, scaling to accommodate the different magnitudes of each metric:

Show the code
pg_stats %>%
  select(team_name,athlete_display_name,field_goals_made,three_point_field_goals_made,free_throws_made,rebounds,assists) %>%
  rename(
    Team = team_name,
    Player = athlete_display_name,
    P2M = field_goals_made,
    P3M = three_point_field_goals_made,
    FTM = free_throws_made,
    REB = rebounds,
    AST = assists) %>% 
  mutate(Player = case_when(Team == "Nets" & Player == "Kyrie Irving" ~ "Kyrie Irving (BKN)",
                            Team == "Mavericks" & Player == "Kyrie Irving" ~ "Kyrie Irving (DAL)",
                            TRUE ~ Player)) %>%
  as.data.frame() %>% 
  arrange(desc(P2M)) %>% 
  slice_head(n = 6) -> Pbox.PG
  
attach(Pbox.PG)
X <- data.frame(P2M, P3M, FTM, REB, AST)

detach(Pbox.PG)
radialprofile(data=X, title=Pbox.PG$Player, std=TRUE,ncol.arrange = 3)

The first thing to note is that there are two entries for Kyrie Irving, one for the Brooklyn Nets and one for the Dallas Mavericks. This is due to a transfer sometime in the season, either way Irving’s performance in the Nets is comparable to his performance in Dallas, perhaps slightly better in Dallas at an overall level. Curry clearly differentiates himself in terms of 3-pointers, relative to the other point guards. Luka Doncic is more of a rebound and assist heavy point-guard, similarly Damian Lillard over indexes on assists. Shai Gilgeous Alexander stands out in terms of free throws made. Lets look at the shot chart for Stephen Curry in the 2023 season excluding free throws, how many shots did he make and miss?

Show the code
nba_pbp %>%  
  filter(season == 2023) %>% 
  select(coordinate_x_raw,coordinate_y_raw,scoring_play,clock_minutes) %>% 
  mutate(result=as.factor(if_else(scoring_play == TRUE, "made","missed")))-> curry_shots


subdata <- curry_shots %>% as.data.frame()
subdata$xx <- subdata$coordinate_x_raw-25
subdata$yy <- subdata$coordinate_y_raw-44

shotchart(data=subdata, x="xx", y="yy",z="result" ,type=NULL,scatter=TRUE)+
  theme(legend.position = "bottom")

We see a concentration of field goals made near the board, as well as quite an evenly distribution of shots along the 3-point line. There also seems to be a higher concentration of misses along the 3-point line, suggesting that whilst Curry might be making the most three points he is also missing a high number of 3-point shots. Let’s look at the density of shots made using a heatmap:

Show the code
shotchart(data=subdata, x="xx", y="yy", type="density-raster",scatter=FALSE)

This reveals a similar pattern, with a high density of shots made near the board and a large number of shots made along the top of the 3-point line.

Modelling

For this modelling exercise we will consider the play-by-play data and try to predict the outcome of a play, excluding free throw plays for Stephen Curry. we will need to do some data wrangling to retrieve some key information at the play-by-play level, such as the opponent athlete name, game id and filtering for all seasons equal to or less than 2023. These steps are outlined below:

First we get a distinct list of players and game IDs for the Golden State Warriors (Steph Curry’s team), this will help up us match the opponent athlete and the respective play-by-play data:

Show the code
# Get the distinct set of player ids:
nba_player_box %>%
  select(athlete_id,athlete_display_name) %>% 
  distinct() %>%
  arrange(athlete_id)-> player_ids

player_ids %>% 
  group_by(athlete_id) %>%
  mutate(row = row_number()) %>%
  filter(row==1) %>% 
  ungroup() %>% 
  select(athlete_id,athlete_display_name) -> player_ids

# Get Game IDs for the Golden State Warriors
nba_team_box %>% 
  filter(season <= 2023 & team_display_name == "Golden State Warriors") %>%
  select(game_id,opponent_team_name) -> game_ids

Second, we combine the player_ids and game_ids dataframes with the play-by-play data, giving us the base data for the modelling:

Show the code
nba_pbp %>%  
  filter(season <= 2023, 
           athlete_id_1 == 3975 & # Filter for Curry
           shooting_play == TRUE & # And only shooting plays
           !grepl("Free Throw",type_text)) %>% # exclude free throws
  select(game_id,coordinate_x_raw,coordinate_y_raw,scoring_play,
         clock_minutes, athlete_id_2,clock_minutes,clock_seconds,qtr) %>% 
  left_join(player_ids,by=c("athlete_id_2"="athlete_id")) %>% 
  inner_join(game_ids, by=join_by(game_id)) %>% 
  select(-athlete_id_2) %>% 
  rename(oponent_athlete=athlete_display_name) %>% 
  mutate(oponent_athlete = if_else(is.na(oponent_athlete),"no direct oponent",oponent_athlete),
         scoring_play = fct_rev(as_factor(scoring_play))) -> base_steph_pbp
# A tibble: 3,107 × 9
     game_id coordinate_x_raw coordinate_y_raw scoring_play clock_minutes
       <dbl>            <dbl>            <dbl> <fct>                <dbl>
 1 401442535               20                4 FALSE                    8
 2 401442535               36               29 FALSE                    5
 3 401442535               24                6 TRUE                     4
 4 401442535               48                9 TRUE                     0
 5 401442535               30                4 FALSE                    8
 6 401442535               22                9 TRUE                     5
 7 401442535               25               28 TRUE                     4
 8 401442535               26                4 TRUE                     3
 9 401442535               31                7 FALSE                    1
10 401442535               12               23 TRUE                     9
# ℹ 3,097 more rows
# ℹ 4 more variables: clock_seconds <dbl>, qtr <dbl>, oponent_athlete <chr>,
#   opponent_team_name <chr>

With our base dataframe, we can now create the training, testing and cross-validation folds:

Show the code
set.seed(123)
steph_split <- initial_split(base_steph_pbp, prop = 0.8, strata = scoring_play)
steph_train <- training(steph_split)
steph_test <- testing(steph_split)

steph_folds <- vfold_cv(steph_train, v = 10, strata = scoring_play)

With the base data in place, we are now able to create the splits for training and testing the model. We will use a 80/20 split for training and testing, and a 10-fold cross validation splits:

Show the code
# Create splits:
set.seed(123)
steph_split <- initial_split(base_steph_pbp, prop = 0.8, strata = scoring_play)
steph_train <- training(steph_split)
steph_test <- testing(steph_split)

steph_folds <- vfold_cv(steph_train, v = 10, strata = scoring_play)

The next step is to create a recipe for the model, we do this below, with the game_id variable as the ID variable and one hot encoding for all the nominal variables. The dependent variable is scoring_play, which will be a function of Curry’s position on the court (coordinate_x_raw, coordinate_y_raw), the opponent athlete, the quarter, the minutes and seconds left on the clock:

Show the code
steph_recipe <- recipe(scoring_play ~ ., data = steph_train) %>%
  update_role(game_id, new_role = "ID") %>% # Add ID step, for game level info
  step_novel(all_nominal(), -all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes()) 

In addition to creating dummy variables for the nominal variables, we will also account for the uniqueness of the play. That is, because there is an element of randomness in a basketball game, some of the players Curry’s has been up against may not be in the test set or vice versa. To account for this, we will use step_novel. Now that we have the recipe, we can create the model workflow:

Show the code
# Model spec:
xgb_spec <- boost_tree(
  trees = 5000,
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  mtry = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

# Model workflow:
steph_workflow <- workflow() %>%
  add_recipe(steph_recipe) %>%
  add_model(xgb_spec)

# Tuning grid:
xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), steph_train),
  learn_rate(),
  size = 30
)

doParallel::registerDoParallel() # Activate parallel computing
set.seed(1234)

# Results of the tuning:
xgb_results <-  tune_grid(
  steph_workflow,
  resamples = steph_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)

Let’s take a look at the fluctuation in terms of performance according to the model parameters:

Show the code
xgb_results %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  select(mean, mtry:sample_size) %>%
  pivot_longer(mtry:sample_size,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean)) +
  geom_point(alpha = 0.8, show.legend = FALSE, color="indianred") +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "AUC")+
  scale_color_brewer(palette = "Spectral")+
  theme_minimal()

It looks like the AUC metric is stabilising at around 0.75-0.80. With these results in mind we finalise the workflow and perform the last fit:

Show the code
# Select best model and finalise workflow:

best_auc_model <- select_best(xgb_results,"roc_auc")

final_xgb <- finalize_workflow(
  steph_workflow,
  best_auc_model)

final_xgb %>%
  fit(data = steph_train) %>%
  extract_fit_parsnip() %>%
  vip(geom = "point") +
  theme_minimal()

Show the code
final_res <- last_fit(final_xgb, steph_split)

The variable with the largest influence on whether Curry will score, is whether there is a direct opponent or not, i.e. there is a much higher likelihood that the shot will go in if Curry is not against a specific opponent. The next two most important variables are coordinate_x_raw and coordinate_y_raw, which denote the position on the field. We can also plot the area under the curve and confusion matrix:

Show the code
final_xgb %>%
  fit(data = steph_train) -> fit_steph

augment(fit_steph,steph_test)%>% 
  roc_curve(truth = scoring_play, .pred_TRUE) %>% 
  autoplot()

The model has a area under the curve of 0.78 and an accuracy of 0.73, the confusion matrix is displayed below:

Show the code
collect_predictions(final_res) %>%
  conf_mat(scoring_play, .pred_class) %>%
  autoplot()+
  theme_minimal()

It looks like the model is better at predicting shots that don’t go in, whereas its accuracy is less clear cut when it comes to shots that actually went in.

Previously unseen data - 2024 NBA season

Up until now we have trained the model from the 2022 season up until the 2023 season. Let’s get the 2024 season data and make our predictions on that, to see how the model would perform with unseen data:

Show the code
# Get the data for the new season:
# So we filtered the data for the Golden State Warriors:

# NBA Team box data:
nba_team_box %>% 
  filter(season == 2024 & team_display_name == "Golden State Warriors") %>%
  select(game_id,opponent_team_name) -> game_ids
  
# NBA play-by-play data:  
nba_pbp %>%  
  filter(season == 2024) %>% 
  select(game_id,coordinate_x_raw,coordinate_y_raw,scoring_play,
         clock_minutes, athlete_id_2,clock_minutes,clock_seconds,qtr) %>% 
  left_join(player_ids,by=c("athlete_id_2"="athlete_id")) %>% 
  inner_join(game_ids, by=join_by(game_id)) %>% 
  select(-athlete_id_2) %>% 
  rename(oponent_athlete=athlete_display_name) %>% 
  mutate(oponent_athlete = if_else(is.na(oponent_athlete),"no direct oponent",oponent_athlete),
         scoring_play = fct_rev(as_factor(scoring_play))) -> base_steph_pbp_2024

Now we have Curry’s 2024 season play-by-play data. Before predicting on the 2024 data, we fit the model to the whole training data:

Show the code
# Fit the model to the whole data: 

final_xgb %>%
    fit(data = base_steph_pbp) -> fit_steph

augment(fit_steph, base_steph_pbp_2024) %>%
  roc_auc(truth = scoring_play, .pred_TRUE)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.803
Show the code
# Assign to predictions to an object:

augment(fit_steph, base_steph_pbp_2024) -> steph_preds_2024

The model gives us an area under the curve of 80%. We can also break down the predictions according to specific sections of the basketball court and also overlay the time on the clock when the shot was made:

Show the code
# Plot the results:

steph_preds_2024 %>% 
  select(coordinate_x_raw,coordinate_y_raw,scoring_play,clock_minutes,.pred_class) %>%
  mutate(result=as.factor(if_else(.pred_class == TRUE,"made","missed")))->curry_shots_2024

subdata <- curry_shots_2024 %>% as.data.frame()
subdata$xx <- subdata$coordinate_x_raw-25
subdata$yy <- subdata$coordinate_y_raw-44

shotchart(data=subdata, x="xx", y="yy", z="clock_minutes", 
          num.sect=4, type="sectors", scatter=FALSE, result="result")+
  theme(legend.position = "bottom")+
  guides(fill = guide_legend(title = "Clock Minutes",title.position = "top",title.hjust = 0.5))

According to the model predictions, Curry is more likely to score when he is near the basket. Similarly, the model predicts that somewhere between 40-50% of shots from the bottom left and right corners of the court will go in.

Let’s take a look at the model results, specifically focusing on false negatives and positives:

Show the code
# Plot the results:

steph_preds_2024 %>% 
  select(coordinate_x_raw,coordinate_y_raw,scoring_play,.pred_class,clock_minutes) %>%
  mutate(result=as.factor(if_else(scoring_play == TRUE,"made","missed"))) %>% 
  mutate(result2 = as.factor(case_when(
    scoring_play == TRUE & .pred_class == FALSE ~ "false negative",
    scoring_play == FALSE & .pred_class == TRUE ~ "false positive",.default = "Truth"))) %>% 
  filter(result2 != "Truth") %>% 
  select(-result) %>% 
  rename(result=result2) -> curry_shots_2024


subdata <- curry_shots_2024 %>% as.data.frame()
subdata$xx <- subdata$coordinate_x_raw-25
subdata$yy <- subdata$coordinate_y_raw-44

shotchart(data=subdata, x="xx", y="yy",z="result" ,type=NULL,scatter=TRUE)+
  theme(legend.position = "bottom")

It looks like the model predicts more false negatives for three pointers, i.e. along the three point line, as shown by the red points. In contrast, for false positives the concentration is mostly around the backboard, there is also a concentration of false negatives slightly to the right of the backboard, however for the most part it seems there are more false positives around that area. In terms of accuracy, the model achieves 75% accuracy, which is not great, but not terrible either.

Conclusion & Future Research Directions

This project explores point guard trends in the NBA from season 2022 to 2024. It begins by visualising and aggregating key stats by player for the 2023 season. Next, we honed in on the top 6 point guards in the league, comparing their key stats, where Curry clearly stood out in terms of 3-pointers. We then refocused on Curry’s play by play data, given that was the original starting point of the project. Curry’s shot was overlayed onto the basketball court unveiling patterns in his shots.

The second part of the project involves building an XGBoost classifier, that predicts whether Curry’s shots will go in or not. The model was trained on the 2022 and 2023 seasons, and then used to predict plays in the 2024 season. The model achieved an AUC of 80% on the 2024 season data, with an accuracy of 75%. The play-by-play predictions were overlayed onto the basketball court, showing where the model predicted Curry would score, most importantly focusing on false negatives and false positives.

Future iterations of the model should include additional data points, such as age and height, to see if these factors have an impact on the model’s predictions. Similarly, training match data might help make the model predictions more robust. Another avenue of exploration, would be to experiment with feature engineering, such as using PCA to reduce the dimensionality of the data.