Superbowl
Data Understanding
This report is concerned with Superbowl ad YouTube data, from the years 2000 up until 2020. The data contains a set of metrics for each brand that featured an advert, these include view count, like count, dislike count and so on. The table below summarizes these across all the years:
Show the code
youtube %>%
group_by(brand) %>%
summarise(`Total Views` = sum(view_count,na.rm = TRUE),
`Total Likes` = sum(like_count,na.rm = TRUE),
`Total Dislikes` = sum(dislike_count,na.rm = TRUE),
`Total Comments`= sum(comment_count,na.rm = TRUE)) %>%
ungroup() %>%
mutate_if(is.numeric,scales::comma) %>%
rename(Brand=brand) %>%
gt() %>%
tab_header(
title = md("**Superbowl Ad Youtube Metrics by Brand**"),
subtitle = "From 2000-2022"
)| Superbowl Ad Youtube Metrics by Brand | ||||
| From 2000-2022 | ||||
| Brand | Total Views | Total Likes | Total Dislikes | Total Comments |
|---|---|---|---|---|
| Bud Light | 13,933,757 | 104,380 | 13,731 | 8,571 |
| Budweiser | 37,990,605 | 88,765 | 18,235 | 11,881 |
| Coca-Cola | 32,377,767 | 160,231 | 46,297 | 2,647 |
| Doritos | 196,668,157 | 326,151 | 97,146 | 3,659 |
| E-Trade | 1,739,081 | 2,624 | 133 | 312 |
| Hynudai | 1,050,352 | 4,204 | 289 | 334 |
| Kia | 419,733 | 2,127 | 63 | 150 |
| NFL | 36,880,180 | 224,263 | 10,541 | 12,504 |
| Pepsi | 2,950,720 | 14,796 | 726 | 1,287 |
| Toyota | 1,135,190 | 5,316 | 385 | 533 |
It seems that Doritos is the best performing brand across all metrics except total comments, where NFL reigns supreme. It might be interesting to do a further deep dive into the data, focusing on the amount of likes relative to the total views for the year 2020. This is shown in the figure below:
Show the code
# Like to view ration
youtube %>% filter(year == 2020) %>%
select(brand, title, where(is.numeric)) %>%
na.omit() %>%
group_by(brand,title) %>%
summarise(l_to_v_ratio =like_count/view_count) %>%
mutate(label=scales::percent(l_to_v_ratio,accuracy = 0.01)) %>%
ungroup() -> tt
# Create DF for table
tt %>%
select(brand,l_to_v_ratio) %>%
rename(Brand = brand,
"Like to views ratio"= l_to_v_ratio)->for_table
# Create Table
reactable(
for_table,
pagination = FALSE,
defaultColDef = colDef(
cell = data_bars(for_table,
number_fmt = scales::percent_format(accuracy = .11))
)
)Bud light has the highest like to views ratio for the year 2020. This advert featured Post Malone and had a total of 47,752 views and 485 likes. In contrast, Kia featured the lowest like to views ratio, with 17,892 views and 78 likes. It might be interesting to look at the fluctuation of views over time, this is shown in the figure below:
Show the code
youtube %>%
select(brand,year,view_count) %>%
rename(Brand = brand) %>%
group_by(Brand,year) %>%
summarise(Views=sum(view_count,na.rm = TRUE)) %>%
ungroup() %>%
mutate(year = paste0("'", str_sub(year, 3, 4))) %>%
pivot_wider(names_from = year,values_from = Views,values_fill=0) %>%
select(Brand,`'00`:`'14`,`'15`,`'16`,`'17`,`'18`,`'19`,`'20`)->heatmap
reactable(
heatmap,
compact = TRUE,
pagination = FALSE,
showSortIcon = FALSE,
defaultSorted = "'00",
defaultSortOrder = 'desc',
defaultColDef = colDef(
maxWidth = 60,
align = 'center',
cell = tooltip(number_fmt = scales::comma),
style = color_scales(heatmap, show_text = FALSE, span = TRUE,colors = viridis::viridis(2))
),
columns = list(
country = colDef(
maxWidth = 175,
align = 'left'
)
)
) %>%
add_title('Total Views Per Year')Total Views Per Year
Doritos in 2012 and Budweiser in 2017 stand out as years of strong viewership numbers.
Exploring variable relationships
In order to perform modelling on the dataset, we should become familiar with the relationships between the data,for instance are like count and view count positively or negatively correlated? The figure below explores these relationships:
Show the code
youtube %>%
select(where(is.numeric),-X,-category_id,-favorite_count) %>%
na.omit() %>%
cor() -> res
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45) -> p
All the numeric variables in our data set are positively correlated, albeit to different extents. Consider the relationship between dislike count and year, the correlation is very small. In contrast like count and view count are very strongly positively correlated, implying that as the number of views increases so too does the number of likes.
Does view count like count increase with view count?
The relationship between likes and views can be further explored by plotting the two variables, this is shown in the figure below:
Show the code
youtube %>%
ggplot(aes(like_count,view_count, color=brand))+
geom_point(na.rm = TRUE)+
geom_smooth(na.rm = TRUE, se=FALSE)+
scale_y_continuous(labels=scales::number_format())+
scale_x_continuous(labels=scales::number_format())+
scale_color_manual(values = my_pal)+
theme_minimal() +
labs(color="Brand")+
ylab("View Count")+
xlab("Like Count")
Although, we can get a grasp of the direction of the correlation, given the different scales of the performance of the different brands, it would make more sense to explore the relationship on a log scale:
Show the code
youtube %>%
ggplot(aes(like_count,view_count, color=brand))+
geom_point(na.rm = TRUE)+
geom_smooth(na.rm = TRUE, se=FALSE)+
scale_y_log10(labels = scales::comma)+
scale_x_log10(labels = scales::comma)+
scale_color_manual(values = my_pal)+
theme_minimal() +
theme(legend.title = element_blank(),
plot.title.position = 'plot')+
ylab("View Count")+
xlab("Like Count")+
labs(title = "Relationship between views and likes (log scale)")
Looking at our data on a log scale makes it easier to look at the relationship and compare between the different brands. Unsurprisingly the slope of the Doritos brand still stands out as one of the steepest. It might be worth examining the log transformation of the view count variable, in order to gain a better understanding of this affects the distribution, this is done in the figure below:
Show the code
library(highcharter, warn.conflicts = FALSE)Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Show the code
options(highcharter.summarise.inform = FALSE)
hchart(log(youtube$view_count),breaks = 20, name="log(view count)", fill="midnightblue")We can now proceed to capture the variance of the viewer count by a few of the variables. The summary below outlines the independent variables and the model r^2 values.
Show the code
youtube %>%
select(brand,
funny,
show_product_quickly,
patriotic,
celebrity,
danger,
animals,
use_sex,
view_count,
like_count,
dislike_count) -> for_modelling
# Define recipe:
youtube_recipe1 <- recipe(view_count~.,data=for_modelling) %>%
step_log(all_numeric(),signed = TRUE)
# Specify model:
lm_mod1 <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
# specify workflow:
youtube_workflow1 <- workflow() %>%
add_model(lm_mod1) %>%
add_recipe(youtube_recipe1)
youtube_workflow1 %>%
fit(data=youtube) -> res1Show the code
res1 %>%
extract_fit_engine() %>%
summary()
Call:
stats::lm(formula = ..y ~ ., data = data)
Residuals:
Min 1Q Median 3Q Max
-3.1713 -0.4731 -0.0319 0.5345 1.8772
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.8695938 0.2348259 24.996 <2e-16 ***
brandBudweiser -0.1981327 0.1968346 -1.007 0.3153
brandCoca-Cola -0.2307533 0.2391720 -0.965 0.3358
brandDoritos -0.2775058 0.2178496 -1.274 0.2042
brandE-Trade -0.3692176 0.2713491 -1.361 0.1751
brandHynudai -0.4576060 0.2216007 -2.065 0.0402 *
brandKia -0.5392921 0.2838512 -1.900 0.0588 .
brandNFL -0.4830126 0.3469425 -1.392 0.1654
brandPepsi -0.4037036 0.2162293 -1.867 0.0633 .
brandToyota -0.7102959 0.2977392 -2.386 0.0180 *
funnyTRUE 0.2567958 0.1510263 1.700 0.0906 .
show_product_quicklyTRUE -0.0720586 0.1289002 -0.559 0.5768
patrioticTRUE -0.0007454 0.1750340 -0.004 0.9966
celebrityTRUE -0.3458089 0.1380389 -2.505 0.0130 *
dangerTRUE -0.0462742 0.1297755 -0.357 0.7218
animalsTRUE -0.1247735 0.1239087 -1.007 0.3151
use_sexTRUE 0.2345903 0.1379419 1.701 0.0905 .
like_count 0.9861287 0.0507802 19.420 <2e-16 ***
dislike_count 0.0668984 0.0593417 1.127 0.2609
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.832 on 206 degrees of freedom
(22 observations deleted due to missingness)
Multiple R-squared: 0.9176, Adjusted R-squared: 0.9104
F-statistic: 127.4 on 18 and 206 DF, p-value: < 2.2e-16
We can easily filter for the variables that have a significance level that is equal to 0.05 or less by passing the model summary to the tidy() function:
Show the code
res1 %>%
extract_fit_engine() %>%
summary() %>%
tidy() %>%
filter(p.value <= 0.05)# A tibble: 5 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 5.87 0.235 25.0 2.67e-64
2 brandHynudai -0.458 0.222 -2.07 4.02e- 2
3 brandToyota -0.710 0.298 -2.39 1.80e- 2
4 celebrityTRUE -0.346 0.138 -2.51 1.30e- 2
5 like_count 0.986 0.0508 19.4 1.96e-48
According to our model output the only variables with a statistically significant effect on view count are the Hyundai and Toyota brands, whether a celebrity was used in the ad and the like count. All the variables except like count having a negative effect on view count.
Another model can be explored, in this case it might be interesting to explore the interaction between like count and brand. The model summary is shown below:
Show the code
# Workflow with interaction term:
# Define recipe:
youtube_recipe2 <- recipe(view_count~.,data=for_modelling) %>%
step_dummy(brand) %>%
step_interact(terms = ~ like_count:starts_with("brand"))
# Specify model:
lm_mod2 <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
# specify workflow:
youtube_workflow <- workflow() %>%
add_model(lm_mod2) %>%
add_recipe(youtube_recipe2)
youtube_workflow %>%
fit(data=youtube) -> res2
res2 %>%
extract_fit_engine() %>%
summary()
Call:
stats::lm(formula = ..y ~ ., data = data)
Residuals:
Min 1Q Median 3Q Max
-5460552 -98581 5458 103160 2780431
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.973e+05 1.534e+05 1.286 0.19980
funnyTRUE 5.446e+02 1.107e+05 0.005 0.99608
show_product_quicklyTRUE -1.251e+05 9.281e+04 -1.348 0.17934
patrioticTRUE 2.386e+05 1.241e+05 1.922 0.05608 .
celebrityTRUE -3.507e+03 9.898e+04 -0.035 0.97177
dangerTRUE 3.933e+04 9.230e+04 0.426 0.67047
animalsTRUE -1.311e+05 8.833e+04 -1.484 0.13946
use_sexTRUE 8.681e+04 9.866e+04 0.880 0.37997
like_count 1.647e+01 7.246e+00 2.273 0.02411 *
dislike_count 4.804e+02 2.667e+01 18.011 < 2e-16 ***
brand_Budweiser -2.393e+05 1.417e+05 -1.689 0.09280 .
brand_Coca.Cola 6.405e+04 1.744e+05 0.367 0.71379
brand_Doritos -3.368e+05 1.509e+05 -2.232 0.02677 *
brand_E.Trade -1.751e+05 2.191e+05 -0.799 0.42514
brand_Hynudai -1.369e+05 1.744e+05 -0.785 0.43334
brand_Kia -1.895e+05 2.711e+05 -0.699 0.48540
brand_NFL 6.765e+04 2.492e+05 0.271 0.78632
brand_Pepsi -1.349e+05 1.612e+05 -0.837 0.40367
brand_Toyota -1.290e+04 2.866e+05 -0.045 0.96415
like_count_x_brand_Budweiser 3.459e+02 1.353e+01 25.560 < 2e-16 ***
like_count_x_brand_Coca.Cola 2.704e+01 1.018e+01 2.657 0.00852 **
like_count_x_brand_Doritos 4.621e+02 8.634e+00 53.517 < 2e-16 ***
like_count_x_brand_E.Trade 7.254e+02 4.764e+02 1.523 0.12946
like_count_x_brand_Hynudai 1.745e+02 3.803e+02 0.459 0.64687
like_count_x_brand_Kia 1.635e+02 1.076e+03 0.152 0.87939
like_count_x_brand_NFL 1.134e+02 7.715e+00 14.700 < 2e-16 ***
like_count_x_brand_Pepsi 7.472e+01 9.710e+01 0.770 0.44248
like_count_x_brand_Toyota -2.643e+01 3.636e+02 -0.073 0.94214
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 585300 on 197 degrees of freedom
(22 observations deleted due to missingness)
Multiple R-squared: 0.998, Adjusted R-squared: 0.9977
F-statistic: 3555 on 27 and 197 DF, p-value: < 2.2e-16
Compare the 2 models
However in order to determine which model should be used, we should understand which one performs better, an anova can be used to test whether the more complex model captures more of the variance than the less complex one:
Show the code
# First Model
res2 %>%
extract_fit_engine() ->y
# Second Model
res1 %>%
extract_fit_engine()->x
# ANOVA Test of variance:
anova(x,y)Analysis of Variance Table
Model 1: ..y ~ brand + funny + show_product_quickly + patriotic + celebrity +
danger + animals + use_sex + like_count + dislike_count
Model 2: ..y ~ funny + show_product_quickly + patriotic + celebrity +
danger + animals + use_sex + like_count + dislike_count +
brand_Budweiser + brand_Coca.Cola + brand_Doritos + brand_E.Trade +
brand_Hynudai + brand_Kia + brand_NFL + brand_Pepsi + brand_Toyota +
like_count_x_brand_Budweiser + like_count_x_brand_Coca.Cola +
like_count_x_brand_Doritos + like_count_x_brand_E.Trade +
like_count_x_brand_Hynudai + like_count_x_brand_Kia + like_count_x_brand_NFL +
like_count_x_brand_Pepsi + like_count_x_brand_Toyota
Res.Df RSS Df Sum of Sq F Pr(>F)
1 206 1.4300e+02
2 197 6.7492e+13 9 -6.7492e+13
Results show that the interaction model is not statistically significantly better, therefore there is no point in using the more complex model, with the interaction term. Similarly, we can compare the models using Akaike’s An Information Criteria, the lower the AIC the better the model. The results are shown below:
Show the code
AIC(x,y) df AIC
x 20 575.899
y 29 6642.582
Since the first model has a much lower AIC, it is the better model. We can proceed to generate bootstrap intervals, for that model, the confidence intervals being outlined below:
Show the code
# Bake data set:
youtube_bootstrap <- recipe(view_count~.,data=for_modelling) %>%
step_log(all_numeric(),signed=TRUE) %>%
prep() %>%
bake(new_data=NULL)
# generate bootstrapped intervals
doParallel::registerDoParallel()
reg_intervals(view_count~.,
data = youtube_bootstrap,
times=1000) -> bootstrapped_modelShow the code
bootstrapped_model # A tibble: 18 × 6
term .lower .estimate .upper .alpha .method
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 animalsTRUE -0.358 -0.129 0.117 0.05 student-t
2 brandBudweiser -0.631 -0.183 0.180 0.05 student-t
3 brandCoca-Cola -0.679 -0.226 0.191 0.05 student-t
4 brandDoritos -0.657 -0.275 0.0930 0.05 student-t
5 brandE-Trade -1.03 -0.359 0.329 0.05 student-t
6 brandHynudai -0.929 -0.456 0.00965 0.05 student-t
7 brandKia -1.04 -0.544 -0.0343 0.05 student-t
8 brandNFL -1.05 -0.478 0.0230 0.05 student-t
9 brandPepsi -0.823 -0.412 0.0195 0.05 student-t
10 brandToyota -1.32 -0.703 -0.205 0.05 student-t
11 celebrityTRUE -0.596 -0.345 -0.112 0.05 student-t
12 dangerTRUE -0.289 -0.0485 0.188 0.05 student-t
13 dislike_count -0.0710 0.0665 0.194 0.05 student-t
14 funnyTRUE -0.0458 0.260 0.557 0.05 student-t
15 like_count 0.872 0.986 1.12 0.05 student-t
16 patrioticTRUE -0.386 -0.00938 0.376 0.05 student-t
17 show_product_quicklyTRUE -0.320 -0.0712 0.187 0.05 student-t
18 use_sexTRUE -0.0299 0.233 0.517 0.05 student-t
The results from the bootstrap, further validate the previous findings, one key distinction is that Kia has a statistically significant effect on view count and Hyundai doesn’t.
Conclusion, Recommendations and Future Directions
Key takeaways
This report explored data pertaining to Superbowl ad performance, on a variety of metrics. A few visualisations were provided to further elaborate on the fluctuation of metrics over time. Thereafter, the relationship between the variables were explored, with an emphasis on view count and like count. Finally modelling was performed to examine which variable had a statistically significant effect on view count, the Toyota and Hyundai brands, along with the use of celebrity and like count seemed to have statistically significant effect on view count. A bootstrap exercise showed that Kia rather than Hyundai has a statistically significant effect on view count. Given that the bootstrap results show slightly different results than the non-bootstrapped ones, it is recommended that a more granular analysis considering brand effect and their potential interactions with different variables be considered.
Recommendations and Future Directions
As per the results, it is recommended not to use celebrities in adverts as this will negatively influence view count. The like count will be as strong indicator of view counts and therefore should be closely monitored by employing social listening. It is recommended that this research exercise is performed on a longitudinal basis and across a greater range of years, in order to pick out other effects on view count.
Future analysis might involve engaging in predictive analytics with the use of non parametric statistical techniques, with the aim of predicting viewership numbers.