Superbowl

Data Understanding

This report is concerned with Superbowl ad YouTube data, from the years 2000 up until 2020. The data contains a set of metrics for each brand that featured an advert, these include view count, like count, dislike count and so on. The table below summarizes these across all the years:

Show the code

youtube %>% 
  group_by(brand) %>% 
  summarise(`Total Views` = sum(view_count,na.rm = TRUE),
            `Total Likes` = sum(like_count,na.rm = TRUE),
            `Total Dislikes` = sum(dislike_count,na.rm = TRUE),
            `Total Comments`= sum(comment_count,na.rm = TRUE)) %>% 
  ungroup() %>% 
  mutate_if(is.numeric,scales::comma) %>% 
  rename(Brand=brand) %>% 
  gt() %>% 
  tab_header(
    title = md("**Superbowl Ad Youtube Metrics by Brand**"),
    subtitle = "From 2000-2022"
  )

Brand	Total Views	Total Likes	Total Dislikes	Total Comments
Superbowl Ad Youtube Metrics by Brand
From 2000-2022
Bud Light	13,933,757	104,380	13,731	8,571
Budweiser	37,990,605	88,765	18,235	11,881
Coca-Cola	32,377,767	160,231	46,297	2,647
Doritos	196,668,157	326,151	97,146	3,659
E-Trade	1,739,081	2,624	133	312
Hynudai	1,050,352	4,204	289	334
Kia	419,733	2,127	63	150
NFL	36,880,180	224,263	10,541	12,504
Pepsi	2,950,720	14,796	726	1,287
Toyota	1,135,190	5,316	385	533

It seems that Doritos is the best performing brand across all metrics except total comments, where NFL reigns supreme. It might be interesting to do a further deep dive into the data, focusing on the amount of likes relative to the total views for the year 2020. This is shown in the figure below:

Show the code

# Like to view ration
youtube %>% filter(year == 2020) %>%
  select(brand, title, where(is.numeric)) %>%
  na.omit() %>%
  group_by(brand,title) %>%
  summarise(l_to_v_ratio =like_count/view_count) %>%
  mutate(label=scales::percent(l_to_v_ratio,accuracy = 0.01)) %>% 
  ungroup() -> tt
# Create DF for table
tt %>% 
  select(brand,l_to_v_ratio) %>% 
  rename(Brand = brand,
         "Like to views ratio"= l_to_v_ratio)->for_table

# Create Table
reactable(
    for_table,
    pagination = FALSE,
    defaultColDef = colDef(
        cell = data_bars(for_table, 
                         number_fmt = scales::percent_format(accuracy = .11))
    )
)

Bud light has the highest like to views ratio for the year 2020. This advert featured Post Malone and had a total of 47,752 views and 485 likes. In contrast, Kia featured the lowest like to views ratio, with 17,892 views and 78 likes. It might be interesting to look at the fluctuation of views over time, this is shown in the figure below:

Show the code

youtube %>% 
  select(brand,year,view_count) %>%
  rename(Brand = brand) %>% 
  group_by(Brand,year) %>% 
  summarise(Views=sum(view_count,na.rm = TRUE)) %>% 
  ungroup() %>% 
  mutate(year = paste0("'", str_sub(year, 3, 4))) %>% 
  pivot_wider(names_from = year,values_from = Views,values_fill=0) %>% 
select(Brand,`'00`:`'14`,`'15`,`'16`,`'17`,`'18`,`'19`,`'20`)->heatmap

reactable(
  heatmap,
  compact = TRUE,
  pagination = FALSE,
  showSortIcon = FALSE,
  defaultSorted = "'00",
  defaultSortOrder = 'desc',
  defaultColDef = colDef(
    maxWidth = 60,
    align = 'center',
    cell = tooltip(number_fmt = scales::comma),
    style = color_scales(heatmap, show_text = FALSE, span = TRUE,colors = viridis::viridis(2))
  ),
  columns = list(
    country = colDef(
      maxWidth = 175,
      align = 'left'
    )
  )
) %>% 
  add_title('Total Views Per Year')

Total Views Per Year

Doritos in 2012 and Budweiser in 2017 stand out as years of strong viewership numbers.

Exploring variable relationships

In order to perform modelling on the dataset, we should become familiar with the relationships between the data,for instance are like count and view count positively or negatively correlated? The figure below explores these relationships:

Show the code

youtube %>% 
  select(where(is.numeric),-X,-category_id,-favorite_count) %>% 
  na.omit() %>% 
  cor() -> res

corrplot(res, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45) -> p

All the numeric variables in our data set are positively correlated, albeit to different extents. Consider the relationship between dislike count and year, the correlation is very small. In contrast like count and view count are very strongly positively correlated, implying that as the number of views increases so too does the number of likes.

Does view count like count increase with view count?

The relationship between likes and views can be further explored by plotting the two variables, this is shown in the figure below:

Show the code

youtube %>% 
  ggplot(aes(like_count,view_count, color=brand))+
  geom_point(na.rm = TRUE)+
  geom_smooth(na.rm = TRUE, se=FALSE)+
  scale_y_continuous(labels=scales::number_format())+
  scale_x_continuous(labels=scales::number_format())+
  scale_color_manual(values = my_pal)+
  theme_minimal() +
  labs(color="Brand")+
  ylab("View Count")+
  xlab("Like Count")

Although, we can get a grasp of the direction of the correlation, given the different scales of the performance of the different brands, it would make more sense to explore the relationship on a log scale:

Show the code

youtube %>% 
  ggplot(aes(like_count,view_count, color=brand))+
  geom_point(na.rm = TRUE)+
  geom_smooth(na.rm = TRUE, se=FALSE)+
  scale_y_log10(labels = scales::comma)+
  scale_x_log10(labels = scales::comma)+
  scale_color_manual(values = my_pal)+
  theme_minimal() +
  theme(legend.title = element_blank(),
        plot.title.position = 'plot')+
  ylab("View Count")+
  xlab("Like Count")+
  labs(title = "Relationship between views and likes (log scale)")

Looking at our data on a log scale makes it easier to look at the relationship and compare between the different brands. Unsurprisingly the slope of the Doritos brand still stands out as one of the steepest. It might be worth examining the log transformation of the view count variable, in order to gain a better understanding of this affects the distribution, this is done in the figure below:

Show the code

library(highcharter, warn.conflicts = FALSE)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

Show the code

options(highcharter.summarise.inform = FALSE)
hchart(log(youtube$view_count),breaks = 20, name="log(view count)", fill="midnightblue")

We can now proceed to capture the variance of the viewer count by a few of the variables. The summary below outlines the independent variables and the model r^2 values.

Show the code

youtube %>% 
  select(brand, 
         funny,
         show_product_quickly,
         patriotic,
         celebrity, 
         danger, 
         animals,
         use_sex,
         view_count,
         like_count, 
         dislike_count) -> for_modelling

# Define recipe:
youtube_recipe1 <- recipe(view_count~.,data=for_modelling) %>% 
  step_log(all_numeric(),signed = TRUE)
# Specify model:
lm_mod1 <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")
# specify workflow:
youtube_workflow1  <- workflow() %>% 
  add_model(lm_mod1) %>% 
  add_recipe(youtube_recipe1)

youtube_workflow1 %>% 
 fit(data=youtube) -> res1

Show the code

res1 %>% 
  extract_fit_engine() %>% 
  summary()


Call:
stats::lm(formula = ..y ~ ., data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1713 -0.4731 -0.0319  0.5345  1.8772 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               5.8695938  0.2348259  24.996   <2e-16 ***
brandBudweiser           -0.1981327  0.1968346  -1.007   0.3153    
brandCoca-Cola           -0.2307533  0.2391720  -0.965   0.3358    
brandDoritos             -0.2775058  0.2178496  -1.274   0.2042    
brandE-Trade             -0.3692176  0.2713491  -1.361   0.1751    
brandHynudai             -0.4576060  0.2216007  -2.065   0.0402 *  
brandKia                 -0.5392921  0.2838512  -1.900   0.0588 .  
brandNFL                 -0.4830126  0.3469425  -1.392   0.1654    
brandPepsi               -0.4037036  0.2162293  -1.867   0.0633 .  
brandToyota              -0.7102959  0.2977392  -2.386   0.0180 *  
funnyTRUE                 0.2567958  0.1510263   1.700   0.0906 .  
show_product_quicklyTRUE -0.0720586  0.1289002  -0.559   0.5768    
patrioticTRUE            -0.0007454  0.1750340  -0.004   0.9966    
celebrityTRUE            -0.3458089  0.1380389  -2.505   0.0130 *  
dangerTRUE               -0.0462742  0.1297755  -0.357   0.7218    
animalsTRUE              -0.1247735  0.1239087  -1.007   0.3151    
use_sexTRUE               0.2345903  0.1379419   1.701   0.0905 .  
like_count                0.9861287  0.0507802  19.420   <2e-16 ***
dislike_count             0.0668984  0.0593417   1.127   0.2609    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.832 on 206 degrees of freedom
  (22 observations deleted due to missingness)
Multiple R-squared:  0.9176,    Adjusted R-squared:  0.9104 
F-statistic: 127.4 on 18 and 206 DF,  p-value: < 2.2e-16

We can easily filter for the variables that have a significance level that is equal to 0.05 or less by passing the model summary to the tidy() function:

Show the code

res1 %>% 
  extract_fit_engine() %>% 
  summary() %>% 
  tidy() %>% 
  filter(p.value <= 0.05)

# A tibble: 5 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      5.87     0.235      25.0  2.67e-64
2 brandHynudai    -0.458    0.222      -2.07 4.02e- 2
3 brandToyota     -0.710    0.298      -2.39 1.80e- 2
4 celebrityTRUE   -0.346    0.138      -2.51 1.30e- 2
5 like_count       0.986    0.0508     19.4  1.96e-48

According to our model output the only variables with a statistically significant effect on view count are the Hyundai and Toyota brands, whether a celebrity was used in the ad and the like count. All the variables except like count having a negative effect on view count.

Another model can be explored, in this case it might be interesting to explore the interaction between like count and brand. The model summary is shown below:

Show the code

# Workflow with interaction term:
# Define recipe:
youtube_recipe2 <- recipe(view_count~.,data=for_modelling) %>% 
  step_dummy(brand) %>% 
  step_interact(terms = ~ like_count:starts_with("brand"))
# Specify model:
lm_mod2 <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")
# specify workflow:
youtube_workflow  <- workflow() %>% 
  add_model(lm_mod2) %>% 
  add_recipe(youtube_recipe2)

youtube_workflow %>% 
 fit(data=youtube) -> res2

res2 %>% 
  extract_fit_engine() %>% 
  summary()


Call:
stats::lm(formula = ..y ~ ., data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-5460552   -98581     5458   103160  2780431 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   1.973e+05  1.534e+05   1.286  0.19980    
funnyTRUE                     5.446e+02  1.107e+05   0.005  0.99608    
show_product_quicklyTRUE     -1.251e+05  9.281e+04  -1.348  0.17934    
patrioticTRUE                 2.386e+05  1.241e+05   1.922  0.05608 .  
celebrityTRUE                -3.507e+03  9.898e+04  -0.035  0.97177    
dangerTRUE                    3.933e+04  9.230e+04   0.426  0.67047    
animalsTRUE                  -1.311e+05  8.833e+04  -1.484  0.13946    
use_sexTRUE                   8.681e+04  9.866e+04   0.880  0.37997    
like_count                    1.647e+01  7.246e+00   2.273  0.02411 *  
dislike_count                 4.804e+02  2.667e+01  18.011  < 2e-16 ***
brand_Budweiser              -2.393e+05  1.417e+05  -1.689  0.09280 .  
brand_Coca.Cola               6.405e+04  1.744e+05   0.367  0.71379    
brand_Doritos                -3.368e+05  1.509e+05  -2.232  0.02677 *  
brand_E.Trade                -1.751e+05  2.191e+05  -0.799  0.42514    
brand_Hynudai                -1.369e+05  1.744e+05  -0.785  0.43334    
brand_Kia                    -1.895e+05  2.711e+05  -0.699  0.48540    
brand_NFL                     6.765e+04  2.492e+05   0.271  0.78632    
brand_Pepsi                  -1.349e+05  1.612e+05  -0.837  0.40367    
brand_Toyota                 -1.290e+04  2.866e+05  -0.045  0.96415    
like_count_x_brand_Budweiser  3.459e+02  1.353e+01  25.560  < 2e-16 ***
like_count_x_brand_Coca.Cola  2.704e+01  1.018e+01   2.657  0.00852 ** 
like_count_x_brand_Doritos    4.621e+02  8.634e+00  53.517  < 2e-16 ***
like_count_x_brand_E.Trade    7.254e+02  4.764e+02   1.523  0.12946    
like_count_x_brand_Hynudai    1.745e+02  3.803e+02   0.459  0.64687    
like_count_x_brand_Kia        1.635e+02  1.076e+03   0.152  0.87939    
like_count_x_brand_NFL        1.134e+02  7.715e+00  14.700  < 2e-16 ***
like_count_x_brand_Pepsi      7.472e+01  9.710e+01   0.770  0.44248    
like_count_x_brand_Toyota    -2.643e+01  3.636e+02  -0.073  0.94214    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 585300 on 197 degrees of freedom
  (22 observations deleted due to missingness)
Multiple R-squared:  0.998, Adjusted R-squared:  0.9977 
F-statistic:  3555 on 27 and 197 DF,  p-value: < 2.2e-16

Compare the 2 models

However in order to determine which model should be used, we should understand which one performs better, an anova can be used to test whether the more complex model captures more of the variance than the less complex one:

Show the code

# First Model
res2 %>% 
  extract_fit_engine() ->y
# Second Model
res1 %>%
  extract_fit_engine()->x
# ANOVA Test of variance:
anova(x,y)

Analysis of Variance Table

Model 1: ..y ~ brand + funny + show_product_quickly + patriotic + celebrity + 
    danger + animals + use_sex + like_count + dislike_count
Model 2: ..y ~ funny + show_product_quickly + patriotic + celebrity + 
    danger + animals + use_sex + like_count + dislike_count + 
    brand_Budweiser + brand_Coca.Cola + brand_Doritos + brand_E.Trade + 
    brand_Hynudai + brand_Kia + brand_NFL + brand_Pepsi + brand_Toyota + 
    like_count_x_brand_Budweiser + like_count_x_brand_Coca.Cola + 
    like_count_x_brand_Doritos + like_count_x_brand_E.Trade + 
    like_count_x_brand_Hynudai + like_count_x_brand_Kia + like_count_x_brand_NFL + 
    like_count_x_brand_Pepsi + like_count_x_brand_Toyota
  Res.Df        RSS Df   Sum of Sq F Pr(>F)
1    206 1.4300e+02                        
2    197 6.7492e+13  9 -6.7492e+13

Results show that the interaction model is not statistically significantly better, therefore there is no point in using the more complex model, with the interaction term. Similarly, we can compare the models using Akaike’s An Information Criteria, the lower the AIC the better the model. The results are shown below:

Show the code

AIC(x,y)

  df      AIC
x 20  575.899
y 29 6642.582

Since the first model has a much lower AIC, it is the better model. We can proceed to generate bootstrap intervals, for that model, the confidence intervals being outlined below:

Show the code

# Bake data set:
youtube_bootstrap <- recipe(view_count~.,data=for_modelling) %>% 
  step_log(all_numeric(),signed=TRUE) %>% 
  prep() %>% 
  bake(new_data=NULL)
# generate bootstrapped intervals
doParallel::registerDoParallel()
reg_intervals(view_count~., 
              data = youtube_bootstrap,
              times=1000) -> bootstrapped_model

Show the code

bootstrapped_model

# A tibble: 18 × 6
   term                      .lower .estimate   .upper .alpha .method  
   <chr>                      <dbl>     <dbl>    <dbl>  <dbl> <chr>    
 1 animalsTRUE              -0.358   -0.129    0.117     0.05 student-t
 2 brandBudweiser           -0.631   -0.183    0.180     0.05 student-t
 3 brandCoca-Cola           -0.679   -0.226    0.191     0.05 student-t
 4 brandDoritos             -0.657   -0.275    0.0930    0.05 student-t
 5 brandE-Trade             -1.03    -0.359    0.329     0.05 student-t
 6 brandHynudai             -0.929   -0.456    0.00965   0.05 student-t
 7 brandKia                 -1.04    -0.544   -0.0343    0.05 student-t
 8 brandNFL                 -1.05    -0.478    0.0230    0.05 student-t
 9 brandPepsi               -0.823   -0.412    0.0195    0.05 student-t
10 brandToyota              -1.32    -0.703   -0.205     0.05 student-t
11 celebrityTRUE            -0.596   -0.345   -0.112     0.05 student-t
12 dangerTRUE               -0.289   -0.0485   0.188     0.05 student-t
13 dislike_count            -0.0710   0.0665   0.194     0.05 student-t
14 funnyTRUE                -0.0458   0.260    0.557     0.05 student-t
15 like_count                0.872    0.986    1.12      0.05 student-t
16 patrioticTRUE            -0.386   -0.00938  0.376     0.05 student-t
17 show_product_quicklyTRUE -0.320   -0.0712   0.187     0.05 student-t
18 use_sexTRUE              -0.0299   0.233    0.517     0.05 student-t

The results from the bootstrap, further validate the previous findings, one key distinction is that Kia has a statistically significant effect on view count and Hyundai doesn’t.

Conclusion, Recommendations and Future Directions

Key takeaways

This report explored data pertaining to Superbowl ad performance, on a variety of metrics. A few visualisations were provided to further elaborate on the fluctuation of metrics over time. Thereafter, the relationship between the variables were explored, with an emphasis on view count and like count. Finally modelling was performed to examine which variable had a statistically significant effect on view count, the Toyota and Hyundai brands, along with the use of celebrity and like count seemed to have statistically significant effect on view count. A bootstrap exercise showed that Kia rather than Hyundai has a statistically significant effect on view count. Given that the bootstrap results show slightly different results than the non-bootstrapped ones, it is recommended that a more granular analysis considering brand effect and their potential interactions with different variables be considered.

Recommendations and Future Directions

As per the results, it is recommended not to use celebrities in adverts as this will negatively influence view count. The like count will be as strong indicator of view counts and therefore should be closely monitored by employing social listening. It is recommended that this research exercise is performed on a longitudinal basis and across a greater range of years, in order to pick out other effects on view count.

Future analysis might involve engaging in predictive analytics with the use of non parametric statistical techniques, with the aim of predicting viewership numbers.