Predicting Song Genre - Underground Rap or Dark Trap?

Author

Nils Indreiten

Executive Summary

This project explores Spotify song data from Kaggle. First, the data is explored through exploratory data analysis, uncovering differences in the musical attributes of different genres. The correlation of the variables in the dataset is also explored, finding that attributes such as acousticness and energy are strongly negatively correlated. In order to find groups of related songs across the genres, dimensionality reduction is used. The technique used is Uniform Manifold Approximation and Projection, revealing 4 main song groups when plotted against 2 UMAP components. Second, the dataset is narrowed down to only include underground hip hop and dark techno songs, with the aim of predicting the genre according to the musical attributes. An xgboost model is trained and hyperparameter tuned in order to predict genre. Finally, the project is concluded and recommendations for future research directions are recommended.

Exploratory Data Analysis

Let’s get an understanding of the distributions of the musical attributes across the genres:

Show the code

genres_v2 %>% select(where(is.numeric),
                     song_name,
                     -`Unnamed: 0`,
                     genre ,
                     -time_signature) -> df
  
df %>% select(danceability,
              energy,
              speechiness,
              acousticness,
              liveness,
              valence,
              genre) %>% pivot_longer(!genre, names_to = "attribute") %>% mutate(attribute = fct_reorder(as.factor(attribute), value)) -> to_plot

to_plot %>% mutate(genre = as_factor(genre)) %>%  ggplot(aes(value, attribute)) +
  geom_boxplot(alpha = 0.2) +
  facet_wrap(vars(genre), ncol = 3) +
  labs(y="")+
  theme_minimal()

Figure 1. Difference in musical attributes across genres.

There are a few genres that seem to have particularly low values of acousticness, these include psytrance, trance, and techno. In terms of speechiness rap, underground rap, and hip hop have higher average values compared to other genres. In contrast, when it comes to liveliness, hardstyle and trance distinguish themselves as having higher median values. On the other hand, the distribution of valence is wide across the genres in the sample, however, hip hop, technohouse and pop stand out as having ahigher median values relative to the other genres. The distinction between the danceability of genres is less clear, but, underground rap, technohouse, and rap have median values above 0.75. Finally, hardstyle, psytrance, and trap distinguish themselves as more energetic relative to other genres.

It would be interesting to see how the variables arte correlated to each other, for example, does the duration of the song strongly correlated with liveliness?

Show the code

df %>% 
  select(-song_name, -genre) %>% cor() %>% 
  corrplot(method = "square")

Figure 2. Correlation between musical attributes.

The figure above shows that duration and instrumentalness are strongly positively correlated. In other words the longer the duration of the song the more instrumental it is. Similarly, loudness and energy are strongly positively correlated, suggesting that louder songs have more energy! Focusing on the negatively correlated variables, unsurprisingly acousticness and energy are strongly negatively correlated. Instrumentalness and speechiness are also strongly negatively correlated, suggesting the more instrumental a song the less lyrics it has.

Next, we consider similarities across the musical attributes of the songs in different genres. We do so by using uniform manifold approximation and projection to visualise the data relative to two UMAP components.

Show the code

doParallel::registerDoParallel() # Activate parallel computing
set.seed(1234)
# All genres:
df %>% 
  na.omit() %>% 
  arrange(song_name) %>% 
  group_by(song_name) %>%
  mutate(number=row_number()) %>% 
  ungroup() %>% 
  filter(number == 1) %>% 
  select(-number) %>% 
  mutate(song_name=as_factor(song_name),
         genre=as_factor(genre))->df

all_genre_umap <-
  recipe(genre~.,
         data = df)%>%
  update_role(song_name,genre,new_role = "id") %>% 
  update_role_requirements("id", bake = FALSE) %>% 
  # step_orderNorm(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>% 
  step_umap(all_predictors())

all_genre_umap %>%
  prep() %>%
  bake(new_data = NULL) %>%
  ggplot(aes(UMAP1, UMAP2, label = song_name)) +
  geom_point(aes(color = genre), alpha = 0.7, size = 2) +
  scale_color_brewer(palette = "Spectral")+
  theme_minimal()

Figure 3. UMAP projection of songs in different genres.

The plot above suggests that there may be 4 main groups of songs across all genres, if we use UMAP. The bottom two groups seem to be predominantly be made up of dark trap songs. They also seem to suggest that there is a difference between the songs in the genre. In contrast there are two larger groups towards the top of the plot. The group on the right hand side contains a substantial amount of rnb songs. There is also a group of emo songs towards the bottom right of that group. The group to the left is far less clear, although we also see a large presence of emo songs. In addition, it seems like two subgroups are forming from that larger group, if only ever so slightly, there is a concentration of emo songs across the middle.

Predictive Analytics

This section will focus on underground rap and dark trap songs only. Before creating out predictive model let’s just take a peak at the difference in the danceability of these 2 genres.

Show the code

df %>% filter(genre %in% c("Dark Trap", "Underground Rap")) %>%  
  ggplot(aes(danceability, fill =genre)) + 
  geom_histogram(alpha = 0.8, position = "identity") +
  scale_fill_brewer(palette = "Dark2") +
  scale_y_continuous(expand = c(0.001,0.001))+
  theme_bw()

Figure 4. Distribution of danceability of the dark trap and underground rap genres.

The figure above shows that there are a higher number of songs that are underground rap that have higher values of danceability. Before we begin modeling our dataset let’s create a dataset made up only of underground rap and dark trap.

Now that we have our filtered dataset model_df we can proceed to create the train and test datasets. In addition, we also create the cross validation folds, using 10-fold cross validation.

Show the code

# Create the splits & folds:
set.seed(1234)
music_split <- initial_split(model_df, strata = genre)
music_train <- training(music_split)
music_test <- testing(music_split)
music_folds <- vfold_cv(music_train, strata = genre)

We are using an xgboost model and will tune the tree_depth,min_n,loss_reduction, sample_size, mtry, and the learn rate hyperparameters. The model specification is outlined below:

Show the code

xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  mtry = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

For the feature engineering we will normalise all the predictors:

Show the code

# Create the recipe:

music_recipe <- recipe(genre~.,
                       data = model_df) %>% 
  step_normalize(all_numeric_predictors())

music_wf <-  workflow() %>% 
  add_model(xgb_spec) %>% 
  add_recipe(music_recipe)

In order to generate the range of values to be tested as hyperparameters, we will use a latin hypercube. This is outlined in the step below:

Show the code

xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), music_train),
  learn_rate(),
  size = 30
)

Put all the pieces together, the workflow, 10-fold cross validation folds, the latin hypercube, and finally we specify that the predictions should be saved:

Show the code

doParallel::registerDoParallel() # Activate parallel computing
set.seed(1234)

xgb_results <-  tune_grid(
  music_wf,
  resamples = music_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)

The plot below shows how different values of the hyperparameters affect the area under the curve metric:

Show the code

xgb_results %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  select(mean, mtry:sample_size) %>%
  pivot_longer(mtry:sample_size,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean)) +
  geom_point(alpha = 0.8, show.legend = FALSE, color="midnightblue") +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "AUC")+
  scale_color_brewer(palette = "Spectral")+
  theme_minimal()

Figure 5. Effect of parameter values on the area under the curve.

There are instances when the area under the curve metric drops to 0.72 or just under it for several parameters. For tree depth it is when the value is 3, the mtry when it is 7 and so on. We then select the best model according to the roc_auc metric:

Show the code

# Select best model and finalise workflow:

best_auc_model <- select_best(xgb_results,"roc_auc")

final_xgb <- finalize_workflow(
  music_wf,
  best_auc_model
)

The hyperparameter values for the model with the highest performance are the following:

Show the code

best_auc_model

# A tibble: 1 × 7
   mtry min_n tree_depth learn_rate loss_reduction sample_size .config          
  <int> <int>      <int>      <dbl>          <dbl>       <dbl> <chr>            
1     7     8         13    0.00284    0.000000360       0.329 Preprocessor1_Mo…

The feature importance is shown in the plot below:

Show the code

final_xgb %>%
  fit(data = music_train) %>%
  extract_fit_parsnip() %>%
  vip(geom = "point") +
  theme_minimal()

It seems that instrumentalness is the feature that is most important for predicting whether the genre is dark trap, followed by speechiness and danceability. We fit the model one final time:

Show the code

final_res <- last_fit(final_xgb, music_split)

final_res %>%
  collect_predictions() %>%
  roc_curve(genre, `.pred_Dark Trap`) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_line(linewidth = 1.5, color = "midnightblue") +
  geom_abline(
    lty = 2, alpha = 0.5,
    color = "gray50",
    linewidth = 1.2
  )+theme_minimal()

Figure 7. Area under the curve for final model.

The accuracy of the model is 0.77 and the area under the curve is 0.86.

Show the code

collect_metrics(final_res)

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.786 Preprocessor1_Model1
2 roc_auc  binary         0.864 Preprocessor1_Model1

It would be useful to look at the SHAP values:

Show the code

music_recipe %>% 
  prep()->music_prep

music_shap <- shap.prep(xgb_model = extract_fit_engine(final_res),X_train = bake(
    music_prep,
    has_role("predictor"),
    new_data = music_train,
    composition = "matrix"
)
)

 shap.plot.summary(music_shap)

Show the code

music_small <- model_df %>%
  slice_head(n=10) %>% 
  na.omit()
  
music_small_prep <- bake(
    prep(music_recipe),
    has_role("predictor"),
    new_data = NULL,
    composition = "matrix"
  )

shap <- shapviz(extract_fit_engine(final_res),
                X_pred = music_small_prep,
                x = music_small)

sv_force(shap,row_id = 1)

We can also visualise a confusion matrix:

Show the code

collect_predictions(final_res) %>%
  conf_mat(genre, .pred_class) %>%
  autoplot()+
  theme_minimal()

Figure 10. Confusion matrix of final fit predictions.

At an overall level the model performed well, correctly classifying the categories for the most part. However, the model incorrectly classified a larger number of dark trap songs as underground rap compared to underground rap songs that it incorrectly classified as dark trap.

Towards a more streamlined approach

Let’s take a step back and make some adjustments to the workflows above. The UMAP section is rather convoluted with all the genres. Let’s try to repeat the exercise using only the two genres to see if we can pick out any interesting patterns within those two categories:

Show the code

genres_v2 %>% select(where(is.numeric),
                     song_name,
                     -`Unnamed: 0`,
                     genre ,
                     -time_signature) -> df

doParallel::registerDoParallel() # Activate parallel computing
set.seed(1234)
# All genres:
df %>% 
  na.omit() %>% 
  arrange(song_name) %>% 
  group_by(song_name) %>%
  mutate(number=row_number()) %>% 
  ungroup() %>% 
  filter(number == 1) %>% 
  select(-number) %>% 
  mutate(genre=as_factor(genre)) %>%
  filter(genre %in% c("Underground Rap","Dark Trap"))->df

sub_genre_umap <-
  recipe(genre~.,
         data = df)%>%
  update_role(genre,new_role = "id") %>% 
  update_role_requirements("id", bake = FALSE) %>% 
  # step_orderNorm(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>% 
  step_umap(all_predictors())

sub_genre_umap %>%
  prep() %>%
  bake(new_data = NULL) -> tt

tt %>% 
  ggplot(aes(UMAP1, UMAP2)) +
  geom_point(aes(color = genre), alpha = 0.5, size = 2) +
  scale_color_manual(values = c("purple","orange"))+
  theme_minimal()

Figure 11. UMAP components, Dark Trap and Underground Rap only.

Interestingly, if we only consider the two genres we also see that there are around 4 main clusters of songs when projected across the two UMAP components. The bottom two seem to be predominantly composed of underground rap songs. Let’s consider the components and look at the distribution of the danceability, both by genre and component:

Show the code

df %>% 
  cbind(tt %>% 
          select(-genre)) %>%
  ggplot(aes(UMAP1, fill =genre)) + 
  geom_histogram(alpha = 0.8, position = "identity",bins=40) +
  scale_fill_brewer(palette = "Dark2") +
  scale_y_continuous(expand = c(0.001,0.001))+
  theme_bw() -> p1

df %>% 
  cbind(tt %>% 
          select(-genre)) %>%
  ggplot(aes(UMAP2, fill =genre)) + 
  geom_histogram(alpha = 0.8, position = "identity",bins=40) +
  scale_fill_brewer(palette = "Dark2") +
  scale_y_continuous(expand = c(0.001,0.001))+
  theme_bw() -> p2

plot_grid(p1, p2, labels = c('UMAP1', 'UMAP2'), label_size = 10,align = "v",ncol = 1,hjust = .007)

Figure 12. UMAP distribution across genres.

It looks like the components are picking up two different types of songs within the two genres that are similar and some that are more different. Songs projected across UMAP1 have similar values, i.e. these are the songs we see overlapping more clearly in the boittom two clusters. Whereas UMAP2 shows us some overlap of the songs in the two genres but they both have long tails in opposite directions, in fact we can say that there are more songs in the dark trap genre that don’t share characteristics with underground rap than vice versa.

Another thing we will consider is reworking the workflow using the F-score instead, we will also consider the j-index:

Show the code

df %>%
  filter(genre %in% c("Underground Rap", "Dark Trap")) %>% 
  mutate(genre = as_factor(case_when(genre == "Dark Trap" ~ "Dark Trap", TRUE ~ "Underground Rap")))  %>%  # Re-factor for correct classification. 
  select(-song_name) -> model_df

# Create the splits & folds:
set.seed(1234)
music_split <- initial_split(model_df, strata = genre)
music_train <- training(music_split)
music_test <- testing(music_split)
music_folds <- vfold_cv(music_train, strata = genre)

# Spec
xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  mtry = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

# Create the recipe:

music_recipe <- recipe(genre~.,
                       data = model_df) %>% 
  step_normalize(all_numeric_predictors())

music_wf <-  workflow() %>% 
  add_model(xgb_spec) %>% 
  add_recipe(music_recipe)


xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), music_train),
  learn_rate(),
  size = 30
)

doParallel::registerDoParallel() # Activate parallel computing
set.seed(1234)

metrics <- metric_set(f_meas, accuracy)

xgb_results <-  tune_grid(
  music_wf,
  resamples = music_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE,extract = I),
  metrics = metrics
)

Let’s take a look at how this metric evolves according to the different values in the tuning:

Show the code

xgb_results %>%
  collect_metrics() %>%
  filter(.metric == "f_meas") %>%
  select(mean, mtry:sample_size) %>%
  pivot_longer(mtry:sample_size,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean)) +
  geom_point(alpha = 0.8, show.legend = FALSE, color="midnightblue") +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "F-Score")+
  scale_color_brewer(palette = "Spectral")+
  theme_minimal()

Figure 13. F score performance over parameter values.

Finally let’s select the best model according to F-score:

Show the code

best_f_model <- select_best(xgb_results,"f_meas")

final_xgb <- finalize_workflow(
  music_wf,
  best_f_model
)

final_res <- fit(final_xgb, music_train)

final_res %>%
  predict(new_data = music_test,type="prob") %>% 
  bind_cols(music_test) -> music_genre_pred

We fitted to the training model and then made predictions on the test set. Let’s create a threshold, such that if the predicted probability is over a certain value, then classify it as dark trap otherwise as underground rap:

Show the code

hard_pred_0.5 <- music_genre_pred %>%
  mutate(
    .pred = make_two_class_pred(
      estimate = `.pred_Dark Trap`, 
      levels = levels(genre), 
      threshold = .5
    )
  ) %>%
  select(genre, contains(".pred"))

hard_pred_0.5 %>% 
  count(.truth = genre, .pred)

# A tibble: 4 × 3
  .truth                    .pred     n
  <fct>                <clss_prd> <int>
1 Dark Trap             Dark Trap   768
2 Dark Trap       Underground Rap   288
3 Underground Rap       Dark Trap   172
4 Underground Rap Underground Rap   931

This threshold performs similarly to the first workflow we created, let’s change the threshold to see how that would change the distribution:

Show the code

hard_pred_0.75 <- music_genre_pred %>%
  mutate(
    .pred = make_two_class_pred(
      estimate = `.pred_Dark Trap`, 
      levels = levels(genre), 
      threshold = .75
    )
  ) %>%
  select(genre, contains(".pred"))

hard_pred_0.75 %>% 
  count(.truth = genre, .pred)

# A tibble: 4 × 3
  .truth                    .pred     n
  <fct>                <clss_prd> <int>
1 Dark Trap             Dark Trap   547
2 Dark Trap       Underground Rap   509
3 Underground Rap       Dark Trap    38
4 Underground Rap Underground Rap  1065

This threshold seems to better classify the underground rap songs correctly, but not so well for dark trap, almost splitting that sample in two. There is a tradeoff which can be explored with ther sensitivity and specificity metrics:

Show the code

library(yardstick)

sens(hard_pred_0.5,genre,.pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 sens    binary         0.727

Show the code

spec(hard_pred_0.5, genre, .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 spec    binary         0.844

Show the code

sens(hard_pred_0.75, genre, .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 sens    binary         0.518

Show the code

spec(hard_pred_0.75, genre, .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 spec    binary         0.966

In this example changing the threshold was done at the expense of sensitivty, however it increased specificity. We can use the j_index, to try to find the combination of the metrics that captures this tradeoff:

Show the code

j_index(hard_pred_0.5,genre,.pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 j_index binary         0.571

Show the code

j_index(hard_pred_0.75,genre,.pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 j_index binary         0.484

The maximum value of the j_index is one, which is when there are no false positives and no false negatives. If the threshold increase leads to an increase in specificity but a decrease in sensitivity, this will be shown in the index. In this case it doesn’t, so the change in threshold doesn’t really provide any real benefit.

Show the code

threshold_data <- music_genre_pred %>%
  threshold_perf(genre, `.pred_Dark Trap`, thresholds = seq(0.5, 1, by = 0.0025))

threshold_data %>%
  filter(.threshold %in% c(0.5, 0.6, 0.7))

# A tibble: 9 × 4
  .threshold .metric     .estimator .estimate
       <dbl> <chr>       <chr>          <dbl>
1        0.5 sensitivity binary         0.727
2        0.6 sensitivity binary         0.647
3        0.7 sensitivity binary         0.567
4        0.5 specificity binary         0.844
5        0.6 specificity binary         0.898
6        0.7 specificity binary         0.945
7        0.5 j_index     binary         0.571
8        0.6 j_index     binary         0.545
9        0.7 j_index     binary         0.512

Show the code

threshold_data <- threshold_data %>%
  filter(.metric != "distance") %>%
  mutate(group = case_when(
    .metric == "sens" | .metric == "spec" ~ "1",
    TRUE ~ "2"
  ))

max_j_index_threshold <- threshold_data %>%
  filter(.metric == "j_index") %>%
  filter(.estimate == max(.estimate)) %>%
  pull(.threshold)

ggplot(threshold_data, aes(x = .threshold, y = .estimate, color = .metric, alpha = group)) +
  geom_line(linewidth=1) +
  theme_minimal() +
  scale_color_viridis_d(end = 0.9) +
  scale_alpha_manual(values = c(.4, 1), guide = "none") +
  geom_vline(xintercept = max_j_index_threshold, alpha = .6, color = "grey30", linewidth=1) +
  labs(
    x = "'Good' Threshold\n(above this value is considered 'good')",
    y = "Metric Estimate",
    title = "Performance by varying the threshold",
    subtitle = "Sensitivity or specificity alone might not be enough!\nVertical line = Max J-Index (0.51)"
  )

Figure 14. Maximum J-index, specificity and sensitivity variance trade-off.

In this case the max J-index value is achieved with a threshold of 0.52.

Conclusion and Future Research Directions

This project explores the musical attributes of songs in the sample of Spotify songs, distinguishing genres from each other. Dimensionality reduction was used to find common groups across genres, unveiling 4 main song cohorts, when mapped against 2 UMAP components. The following section focused on predictive analytics, training and building a model to predict the genre of a song according to its musical attributes. The area under the curve for the final hyperparameter tuned model was 0.86. We then reworked the workflow but honed in on the two genres of interest. The f-score was highlighted as a better alternative to the area under the curve metric. Finally, we considered how a changing threshold would affect sensitivity and specificity, using the j-index as a measure of this tradeoff.

Future efforts may wish to work with workflowsets to screen a larger number of model that may lead to superior performance. Similarly, feature engineering may be experimented with to improve model performance. Whilst this project focused only on a two class problem, there is no reason why a multi-class classification model may be developed. In addition, generating confidence intervals for the predicted probability of a class would yield more robust predictions. Finally, the project can be expanded on by bringing in charts rating information and building a model to predict song ratings, this would be a regression problem.