Football Analytics - Predicting Player Valuation in Euros

Executive Summary

Sports analytics is a vibrant field, which sees data practitioners apply statistical methods for predictive analytics in a sports context. This project engages in that realm, honing in on football. The dataset is from Kaggle and contains attributes about players according to certain dates, the most recent update is considered for this project. The main aim of the project is to predict player valuation in euros. To do so, the dataset is explored, to identify key trends and characteristics. Next, an XGBoost model is trained and hyperparameter tuned. Finally, the model outputs are examined, a model exploration dashboard is presented and future analyses directions are recommended. Let’s begin!

Exploratory Data Analysis

The original dataset had a lot more columns but these were filtered for simplification, future analysis may wish to incorporate more columns included as part of the original dataset. In addition, only the top 4 leagues in Europe are part of the dataset, League 1, Serie A, Bundesliga, La Liga and Premier League. Before we begin our EDA, it would be useful to get a sense of some of the columns in the data:

fc24 %>% 
 glimpse()

## Rows: 3,467
## Columns: 20
## $ player_id        <dbl> 24630, 49472, 106795, 118794, 138412, 140233, 140293,…
## $ long_name        <chr> "José Manuel Reina Páez", "Ludovic Butelle", "Gianluc…
## $ league_name      <chr> "La Liga", "Ligue 1", "Serie A", "Premier League", "P…
## $ club_name        <chr> "Villarreal", "Stade de Reims", "Sassuolo", "Everton"…
## $ player_positions <chr> "GK", "GK", "GK", "GK", "CM, RB", "GK", "GK", "ST", "…
## $ overall          <dbl> 77, 68, 69, 58, 77, 81, 71, 75, 80, 78, 68, 76, 72, 8…
## $ potential        <dbl> 77, 68, 69, 58, 77, 81, 71, 75, 80, 78, 68, 76, 72, 8…
## $ value_eur        <dbl> 1200000, 130000, 150000, 25000, 3300000, 2900000, 210…
## $ wage_eur         <dbl> 14000, 4000, 4000, 3000, 52000, 21000, 14000, 22000, …
## $ age              <dbl> 40, 40, 42, 39, 37, 37, 39, 37, 37, 30, 39, 37, 37, 3…
## $ height_cm        <dbl> 188, 188, 184, 192, 175, 185, 193, 186, 172, 191, 187…
## $ weight_kg        <dbl> 92, 84, 76, 87, 70, 78, 79, 86, 60, 84, 80, 65, 75, 8…
## $ weak_foot        <dbl> 3, 3, 2, 1, 4, 3, 2, 2, 2, 3, 3, 3, 3, 3, 2, 3, 3, 3,…
## $ skill_moves      <dbl> 1, 1, 1, 1, 3, 1, 1, 3, 4, 3, 1, 4, 1, 3, 1, 2, 1, 1,…
## $ pace             <dbl> NA, NA, NA, NA, 53, NA, NA, 50, 82, 47, NA, 65, NA, 5…
## $ shooting         <dbl> NA, NA, NA, NA, 70, NA, NA, 79, 68, 78, NA, 66, NA, 6…
## $ passing          <dbl> NA, NA, NA, NA, 79, NA, NA, 64, 79, 74, NA, 77, NA, 7…
## $ dribbling        <dbl> NA, NA, NA, NA, 76, NA, NA, 70, 80, 67, NA, 75, NA, 7…
## $ defending        <dbl> NA, NA, NA, NA, 76, NA, NA, 36, 76, 77, NA, 76, NA, 8…
## $ physic           <dbl> NA, NA, NA, NA, 75, NA, NA, 75, 55, 81, NA, 71, NA, 7…

We can get a general view of a player and their attributes, such as their age, height and weight. Also play related information is included, such as shooting, pace, player position, etc. The following section explores the relationship between these variables.

The relationship between overall rating and valuation

Intuitively we may think that as overall rating increases, so too does the player valuation. The plot below explores this relationship, colored by league.

fc24 %>% 
  ggplot(aes(x = overall, y = value_eur, color = league_name)) +
  geom_jitter(alpha = 0.2) +
  scale_color_brewer(palette = "Set1") +
  labs(x = "Overall rating",
       y = "Valuation of players (in million euros)")+
 theme_minimal(base_size = 10) +
 scale_fill_manual(values = thematic::okabe_ito(6)) +
  scale_y_continuous(labels = scales::dollar_format(prefix = "€")) +
 my_base_theme() +
  theme(legend.position = "bottom",
        legend.title = element_blank())

Figure 1. Relationship between valuation and overall rating.

The distribution seems to follow an exponential curve, that is, the valuation of players gradually increases as their rating increases and when a certain point is reached, the valuation increases drastically. We see that players with a ~80+ experience high evaluations and players with an overall rating 90 or above have a very high valuation. The visual is broken down by league in the figure below:

fc24 %>% 
  ggplot(aes(x = overall, y = value_eur, color = league_name)) +
  geom_jitter(alpha = 0.2) +
  scale_color_brewer(palette = "Set1") +
  labs(x = "Overall rating",
       y = "Valuation of players (in million euros)")+
 theme_minimal(base_size = 10) +
 scale_fill_manual(values = thematic::okabe_ito(6)) +
  scale_y_continuous(labels = scales::dollar_format(prefix = "€")) +
 my_base_theme() +
  theme(legend.position = "none",
        legend.title = element_blank())+
  geom_smooth(se=FALSE)+facet_wrap(~league_name)

Figure 2.Relationship between overall rating and valuation by league.

Here we see more clearly that this relationship does indeed repeat itself across all leagues. The extreme valuations are more evident in the Premiere League, La Liga and League 1. However, Bundesliga and Serie A still experience high valuations relative to the overall player rating.

The distribution of player valuation

It would be useful to understand the difference in the distribution of the player valuation variable across leagues, this is done in the figure below:

fc24 %>% 
  ggplot(aes(reorder(league_name,-value_eur),value_eur,fill =league_name))+
  geom_violin()+
  geom_boxplot(alpha = 0.2)+
  theme_minimal(base_size = 10) +
  scale_fill_manual(values = thematic::okabe_ito(6))+
  scale_y_continuous(labels=scales::comma,trans = "log10")+
  my_base_theme()+
  theme(legend.position = "none",
        legend.title = element_blank())+
  labs(x = "League",
       y = "Valuation in euros")

Figure 3. Distribution of valuation across leagues in log10 scale.

We can see distributions of the player valuation in euros in the violin plot. Note that the volumes are in the log10 scale, to better display the valuations that are dramatically higher across leagues, particularly in the Premiere League.The graph is ordered in descending order, highlighting the averages across leagues, notable the premiere league has the highest median, whilst Bundesliga the lowest.

Player attributes across leagues

The key player attributes are defined as shooting, defending, passing, dribbling, physicality and pace. Let’s uncover how these differ across leagues:

fc24 %>%
    select(league_name, pace, shooting, passing, dribbling, defending, physic) %>%
    na.omit() %>%
    pivot_longer(cols = -league_name) %>%
    group_by(league_name, name) %>%
    summarise(mean = mean(value)) %>%
    ungroup() %>% 
    ggplot(aes(reorder(interaction(league_name, name, sep = "_"), -mean), mean, fill = name)) +
    geom_col(width = .8) +
    facet_wrap(~league_name, scales = "free") +
    theme_minimal(base_size = 10) +
    scale_fill_manual(values = thematic::okabe_ito(6)) +
    my_base_theme() +
    theme(legend.position = "none",
          legend.title = element_blank()) +
    labs(title = "",
         x = "",
         y = "Average value") +
    geom_text(aes(label = round(mean, 0)), vjust = .5, size = 5, hjust = 1.2, color = "white") +
    scale_x_discrete(labels = function(x) sapply(strsplit(x, "_"), tail, n=1)) +
    coord_flip()

Figure 4. Key attributes averages by league.

At an overall level, the average player stats are similar across the leagues. This is perhaps not too surprising given that these are the top leagues in Europe.

Average player valuation in euros

What is the average player valuation and how certain are we of this estimate. First of all what is the average player valuation, using the mean as a central tendency:

fc24 %>% 
  summarise(mean_value_eur = mean(value_eur)) %>% 
  pull() -> obs_mean

obs_mean

## [1] 8785141

The mean valuation in our sample is €8,785,141. How sure are we about this and whith what level of certainty? To address this and get a sense of within what range the valuation may vary seson on season, we can use bootstrap resampling:

set.seed(123)  # for reproducibility
bootstrap_distrib <- fc24 %>% 
  infer::specify(response = value_eur) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "mean")

conf_int <- bootstrap_distrib %>% 
  get_confidence_interval(level = 0.95, point_estimate = obs_mean)

The boostrap_distrib object, resamples the data 1000 times and calculates the mean for that specific resample. We visualise the distribution off this below, to get a 95% confidence interval:

ggplot(bootstrap_distrib, aes(x = stat)) +
  geom_histogram(bins = 30,fill = "midnightblue", boundary = 0,color="white") +
  geom_vline(aes(xintercept = obs_mean), color = "darkred", linewidth=1) +
  geom_vline(aes(xintercept = conf_int$lower_ci), color = "blue", linetype = "dashed",linewidth=1.5) +
  geom_vline(aes(xintercept = conf_int$upper_ci), color = "blue", linetype = "dashed",linewidth=1.5) +
  labs(
       x = "Mean Valuation",
       y = "Frequency") +
  theme_minimal(base_size = 10)+
  my_base_theme()+
  scale_y_continuous(expand = c(0.005,0.005))+
  scale_x_continuous(labels = scales::dollar_format(prefix = "€ "))+
  annotate(
    "text", x = obs_mean, y = 5,  # Adjust y to place text where you prefer
    label = paste("Mean =", scales::dollar_format(prefix = "€ ")(obs_mean)),
    hjust = .10,  # Horizontal adjustment for text position
    vjust = -.5,  # Vertical adjustment for text position
    color = "white", 
    size = 4,
    angle=90,
    fontface = "bold"# Text size
  )

Figure 5. Bootstrap distribution of average player valuation.

At a 95% confidence interval we can expect that the mean will be between €8,273,882 and €9,296,300.

Correlation between variables

Prior to modeling it would be useful to get a sense of the extent to which variables are correlated with each other:

fc24 %>%
  select(pace, shooting, passing, dribbling, defending, physic, overall, potential,age,height_cm,weight_kg,
         skill_moves,wage_eur,value_eur) %>%
  na.omit() %>%
  cor() %>% 
  corrplot::corrplot(method = 'shade')

Figure 6. Correlation between numeric variables.

The majority of the numeric variables in the dataset are positively correlated. In terms of the player attributes, pace, shooting, passing and dribbling are positively correlated. In contrast, height and weight are negatively correlated with pace, shooting, passing and dribbling, suggesting that these attributes decline as height and weight increase. Unsurprisingly, wage and value are highly positively correlated, similarly these are positively correlated with overall rating.

Modeling Player Valuation

We are now ready to model our dataset. There are a few pre-processing steps that we are going to do. The first is to create dummy variables for the league_name variable, and the second is the normalisation of the dependent variable using a log transformation. Another notable pre-processing step is to remove the NA values. The recipe is also defined below:

set.seed(1234)

fc24_splits <-  initial_split(fc24_model, prop = 0.8, strata = league_name) # Initial split object
fc24_training <-  training(fc24_splits) # Training split
fc24_testing <-  testing(fc24_splits) # Testing split
f24_folds <-  vfold_cv(fc24_training, strata = value_eur,v = 10) # Create validation folds


fc24_recipe <-  recipe(value_eur~league_name+overall+potential+ 
                         age+height_cm+weight_kg+weak_foot+skill_moves+
                         pace+shooting+passing+dribbling+defending+physic,data = fc24_model) %>% 
  step_naomit(all_numeric_predictors()) %>% # Omit NA values
  step_dummy(all_nominal()) %>%  # Create dummy variables for league_name
  step_log(value_eur) # Log transformation of the valuation variable

We can take a look at how this transforms the data:

fc24_recipe %>% 
  prep() %>% 
  bake(new_data = NULL)

## # A tibble: 3,065 × 18
##    overall potential   age height_cm weight_kg weak_foot skill_moves  pace
##      <dbl>     <dbl> <dbl>     <dbl>     <dbl>     <dbl>       <dbl> <dbl>
##  1      77        77    37       175        70         4           3    53
##  2      75        75    37       186        86         2           3    50
##  3      80        80    37       172        60         2           4    82
##  4      78        78    30       191        84         3           3    47
##  5      76        76    37       175        65         3           4    65
##  6      83        83    37       184        82         3           3    57
##  7      81        81    37       190        82         3           2    36
##  8      78        78    39       189        91         4           3    32
##  9      84        84    38       183        79         3           2    51
## 10      74        74    32       190        83         4           3    82
## # ℹ 3,055 more rows
## # ℹ 10 more variables: shooting <dbl>, passing <dbl>, dribbling <dbl>,
## #   defending <dbl>, physic <dbl>, value_eur <dbl>, league_name_La.Liga <dbl>,
## #   league_name_Ligue.1 <dbl>, league_name_Premier.League <dbl>,
## #   league_name_Serie.A <dbl>

We see that the NA values have been removed, also the value_eur variable is now in log scale,and finally there are 4 new dummy variables, one for each league. Next, we specify the specs for the XGBoost model, defining that the trees, mtry, min_n variables will be tuned. The model spec and recipe are put together in a workflow object:

xgb_spec <-
  boost_tree(
    trees = tune(),
    mtry = tune(),
    min_n = tune(),
    learn_rate = 0.01
  ) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

xgb_wf <- workflow(fc24_recipe, xgb_spec)

xgb_wf

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: boost_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_naomit()
## • step_dummy()
## • step_log()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
##   min_n = tune()
##   learn_rate = 0.01
## 
## Computational engine: xgboost

To tune the hyper parameters, we create a grid latin hypercube

grid <- grid_latin_hypercube(
  trees(),
  finalize(mtry(),fc24_training),
  min_n(),
  size = 20
)

grid

## # A tibble: 20 × 3
##    trees  mtry min_n
##    <int> <int> <int>
##  1   175     6    26
##  2  1128    13    18
##  3  1831     7    23
##  4   466    19    34
##  5   559    14    20
##  6   755     8    16
##  7   892    16     9
##  8  1782    18    36
##  9  1311    17     5
## 10  1259    11    39
## 11   647    20    23
## 12  1696     1     3
## 13   286     9    12
## 14  1511     7    32
## 15    58     3    29
## 16  1012     4    35
## 17   317    12    28
## 18  1998     2    14
## 19  1406    15    11
## 20   961    10     6

The object is a tibble with various values to be passed to the parameters of the model for tuning. Parallel computing is activated and the model is tuned using the grid and resampling folds:

doParallel::registerDoParallel()

set.seed(234)
xgb_res <- tune_grid(
  xgb_wf,
  resamples = f24_folds,
  grid = grid,
  control = control_grid(save_pred = TRUE)
)

With the xgb_res object we can see how the different values of the grid affect performance.

xgb_res %>% 
  autoplot() +
  theme_minimal()+
  my_base_theme()

Figure 7. Effect of parameters on rmse and rsq

There seems to be a few combinations that manage to minimise both rmse and rsq, let’s select the best combination of parameters that minimises rmse and finalise the workflow:

best_xgb <- xgb_res %>%
  select_best("rmse")

final_xgb <-  finalize_workflow(xgb_wf, best_xgb)

final_xgb

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: boost_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_naomit()
## • step_dummy()
## • step_log()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   mtry = 16
##   trees = 1637
##   min_n = 3
##   learn_rate = 0.01
## 
## Computational engine: xgboost

We can get the feature importance by pulling the fit model and using the vip function from the vip package:

last_fit(final_xgb, fc24_splits)  %>% 
    extract_fit_parsnip() %>% 
    vip::vip(num_features = 5)+
    theme_minimal()+
    my_base_theme()

Figure 8. Feature importance.

It looks like the 5 variables with the greatest impact on the valuation are overall player rating, potential, age, dribbling and shooting.Overall player rating is by far the variable with the most importance for player valuation, followed by their potential and age.

final_res %>% 
  collect_predictions() %>% 
  select(.pred,value_eur) %>% 
  mutate(.pred = exp(.pred), value_eur = exp(value_eur)) %>%
  ggplot(aes(value_eur,.pred))+
  geom_smooth(method = "lm", linetype="dashed",se=FALSE, color="black",linewidth=.5)+
  geom_point(alpha=0.2, color="darkgrey")+
  labs(x ="Actual valuation",
       y="Predicted valuation")+
  theme_minimal()+
  scale_y_continuous(labels = scales::dollar_format(prefix = "€ ",scale = 1e-6,suffix = "M"))+
  scale_x_continuous(labels = scales::dollar_format(prefix = "€ ",scale = 1e-6,suffix = "M"))+
  my_base_theme()+
  theme(plot.margin = margin(10, 18, 10, 10))

Figure 9. Predicted and actual player valuation.

The model seems to more accurately predict the players with lower valuations, whereas with the higher valuations it is less accurate.However overall the model retains performance across the values. On average how far off are the predictions to the actual valuation values, both the mean and median variance in absolute terms is shown below for the last fit on the splits:

final_res %>%
  collect_predictions() %>%
  select(.pred,value_eur) %>% 
  mutate(variance = abs(exp(.pred)-exp(value_eur))) %>% 
  summarise(median_variance = median(variance),
            mean_variance = mean(variance))

## # A tibble: 1 × 2
##   median_variance mean_variance
##             <dbl>         <dbl>
## 1          94063.       320776.

The median variance is € 94,062.64 and the mean variance is € 320,776.5, in other words there seems to be some values that drive the average variance up. Likely these are the higher values we see in figure 9. Let’s generate a SHAP plot to understand how the feature values affect predicted valuation:

fc24_recipe %>% 
  prep()->fc24_prepped

fc24_shap <-
  shap.prep(
    xgb_model = extract_fit_engine(final_res),
    X_train = bake(
      fc24_prepped,
      has_role("predictor"),
      new_data = NULL,
      composition = "matrix"
    )
  )

shap.plot.summary(fc24_shap)

Figure 10. Shap values summary

The SHAP plot above shows that the variable with the highest importance is overall rating, higher values of overall rating have a positive effect on valuation, negative overall rating negatively affects valuation. So the lower the overall rating of a player the lower the valuation and the higher rated the higher the valuation. There is a similar trend for the potential of a player. In contrast, higher values of age have a negative effect on the predicted valuation, the older the player is the lower their valuation. Let’s take a look at how the features impact local observations:

fc24_small <- fc24_model %>%
  slice_head(n=10) %>% 
  na.omit()
  
  fc24_small_prep <- bake(
    prep(fc24_recipe),
    has_role("predictor"),
    new_data = NULL,
    composition = "matrix"
  )

shap <- shapviz(extract_fit_engine(final_res),
                X_pred = fc24_small_prep,
                x = fc24_small)

sv_force(shap,row_id = 1)

Figure 11. Force plot for the first observation.

The force plot above shows more clearly how the value each variable affects the prediction. For this specific observation the predicted valuation is 14.8, which is €2,676,445. This player is James Philip Milner, whose valuation is actually €3,330,000. Let’s take a look at another observation:

sv_force(shap,row_id = 4)

Figure 12. Force plot for the fourth observation.

This is another player aged 30 and that has an overall rating of 78. The predicted value of this player is €13,356,519. The actual valuation is €13,500,000, the prediction wasn’t far off! Let’s take a few random sample of players and output their predictions with a few options for picking model diagnostics plots.

fit_xgb_boost <- boost_tree(
  trees = 1041,
  mtry = 13,
  min_n = 3,
  learn_rate = 0.01
) %>%
  set_engine("xgboost") %>%
  set_mode("regression") %>% 
  fit(value_eur~overall+potential+ # This workflow takes removes NA values and logs the dependent variable
        age+height_cm+weight_kg+weak_foot+skill_moves+
        pace+shooting+passing+dribbling+defending+physic,data = fc24_recipe %>% 
        prep() %>% 
        bake(fc24_training))

set.seed(1234)

fc24_recipe %>% 
  prep() %>% 
 bake(new_data = fc24_testing %>% na.omit()) %>% 
  slice_sample(n=10) -> test_df

explainer_xgb <- DALEX::explain(fit_xgb_boost,
                               data = test_df, 
                               y = test_df$value_eur)

## Preparation of a new explainer is initiated
##   -> model label       :  model_fit  (  default  )
##   -> data              :  10  rows  18  cols 
##   -> data              :  tibble converted into a data.frame 
##   -> target variable   :  10  values 
##   -> predict function  :  yhat.model_fit  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package parsnip , ver. 1.1.1 , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  12.45435 , mean =  14.95609 , max =  17.17497  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.06908138 , mean =  -0.002746482 , max =  0.05827043  
##   A new explainer has been created!

new_options <- ms_options(ms_title="Player Valuation Model Diagnostics (log scale)")

modelStudio::modelStudio(explainer_xgb, new_observation_n=5,options = new_options)

The default plot is the global feature importance, which we have explored in previous figures. There is also a local breakdown which will give you a plot for the force of a specific variable on the predicted value (in log). A good set up for the plots is Global feature importance, Shapley, local breakdown, and Average vs. Target. The dashboard is interactive, so values you select in one plot will affect the visuals in other plots, you can also select the id from the top right hand corner. These give you a good idea of model performance, in addition model metrics are listed in the bottom left corner.

Conformal Predictions

We are able to generate predictions that are accurate, however we do not know the extent to which we can be uncertain about the predictions. In other words can we generate confidence intervals for our predicted values, such that we can provide a prediction with an expected range at a 95% confidence interval. Luckily we can easily do so using the probably package. First, we will need to return to the tuning process, specifying extract = I in the control arguement:

set.seed(2)
xgb_res <- tune_grid(
  xgb_wf,
  resamples = f24_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE, extract = I)
)

This will generate intervals, which will be able to extract from, to an object named test_set_intervals, which has the intervals and predicted values:

best_rmse <- select_best(xgb_res, "rmse")

conf_cv_res <- int_conformal_cv(xgb_res, parameters = best_rmse)

final_xgb <- finalize_workflow(
  xgb_wf,
  best_rmse
)

final_res <- last_fit(final_xgb, fc24_splits)

test_set_pred <- collect_predictions(final_res)
test_set_intervals <- predict(conf_cv_res, fc24_testing)

test_set_intervals

## # A tibble: 566 × 3
##    .pred_lower .pred .pred_upper
##          <dbl> <dbl>       <dbl>
##  1        14.3  14.5        14.6
##  2        14.4  14.6        14.7
##  3        15.5  15.7        15.9
##  4        15.0  15.1        15.3
##  5        13.6  13.8        14.0
##  6        15.3  15.5        15.6
##  7        15.6  15.8        15.9
##  8        16.6  16.8        17.0
##  9        12.8  13.0        13.1
## 10        15.4  15.6        15.8
## # ℹ 556 more rows

The results are visualised below in an interactive manner such that you can explore the data points freely:

Figure 12. Intervals for predicted valuation.

Given the actual outcome variable value is readily available, the coverage can be computed. The confidence interval should be around 95%, lets check this:

coverage <- function(x) {
  x %>% 
    mutate(in_bound = .pred_lower <= value_eur & .pred_upper >= value_eur) %>% 
    summarise(coverage = mean(in_bound) * 100)
}

width <- function(x) {
  x %>% 
    mutate(in_bound = .pred_upper - .pred_lower) %>% 
    summarise(width = mean(in_bound) * 100)
}

coverage(test_set_intervals %>%
  cbind(fc24_testing) %>%
  select(.pred_lower, .pred, .pred_upper, value_eur))

##   coverage
## 1 94.34629

Its around 94.35%! What about width?

width <- function(x) {
  x %>% 
    mutate(in_bound = .pred_upper - .pred_lower) %>% 
    summarise(width = mean(in_bound) * 100)
}
width(test_set_intervals)

## # A tibble: 1 × 1
##   width
##   <dbl>
## 1  34.0

Let’s try to generate adaptive width intervals. This can be done using conformalised quantile regression. This implementation uses random forests for quantile regression:

final_res %>% extract_workflow() -> xgb_fit

quant_int <-
  probably::int_conformal_quantile(
    xgb_fit, 
    train_data = fc24_training,
    cal_data = cal, 
    level = 0.95,
    ntree = 2000)
quant_int

This implementation uses random forests for quantile regression:

test_quant_res <- 
  predict(quant_int, fc24_testing) %>% 
  bind_cols(fc24_testing)

We can see the predictions and intervals:

test_quant_res %>%
  select(.pred_lower, .pred, .pred_upper, value_eur)

## # A tibble: 566 × 4
##    .pred_lower .pred .pred_upper value_eur
##          <dbl> <dbl>       <dbl>     <dbl>
##  1        14.4  14.4        15.5      14.5
##  2        14.5  14.5        15.6      14.3
##  3        14.8  15.6        16.9      15.3
##  4        14.8  15.2        16.2      14.7
##  5        13.4  13.8        14.8      13.8
##  6        14.5  15.5        16.6      15.6
##  7        15.3  15.7        16.7      15.6
##  8        14.8  16.8        17.1      16.8
##  9        12.4  13.0        14.5      12.5
## 10        15.3  15.6        16.4      15.6
## # ℹ 556 more rows

Finally we visualise the results:

Figure 12. Quantile regression forest intervals for predicted valuation.

Does the random forest quantile regression lead to increases in coverage?:

coverage(test_quant_res)

## # A tibble: 1 × 1
##   coverage
##      <dbl>
## 1     96.6

It does! What about width?:

width(test_quant_res)

## # A tibble: 1 × 1
##   width
##   <dbl>
## 1  76.6

It looks like we are sacrificing some width for coverage relative to the previous method, this is important because it will affect the bargaining strategy.

Conclusions and Future Analysis Directions

This project explored a branch of sports analytics concerned with football player valuation. The dataset contains data from the top 5 leagues in Europe. After a brief EDA the data was modeled, to predict player valuation in euros. Finally model diagnostics plots were explored allowing the reader to make their own using an interactive plotting tool.

Future analyses may wish to experiment with the scope of the modeling, that is, to expand the selection of leagues included in the data. Similarly, NA values for the core player attributes were omitted, this means that goalkeepers were excluded from the process, a future opportunity would be to include those entries. Finally, a few things from a pre-processing and model tuning point of view can be done differently. For example, more variables can be included and pre processing steps such as PCA can be used for dimensionality reduction. In terms of tuning, different hyperparameters can be tuned, resulting in better performing models.