Exploring Orange Juice Sales - Dominick’s Grocery Store

Author

Nils Indreiten

Overview

This report is concerned with the sales data for orange juice from the Grocery Store Dominick’s, a Chicago store chain. The data were collected in the 1990’s and is available from the Kilts Center at The University of Chicago’s Booth School of Business. The data contains variables about weekly prices as well as sales for three respective brands, and a variable to denote whether the brand was advertised that week. Lets take a look at the data:

Data summary
Name OJ
Number of rows 28947
Number of columns 4
_______________________
Column type frequency:
character 1
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
brand 0 1 9 11 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
sales 0 1 17312.21 27477.66 64.00 4864.00 8384.00 17408.00 716416.00 ▇▁▁▁▁
price 0 1 2.28 0.65 0.52 1.79 2.17 2.73 3.87 ▁▆▇▅▂
feat 0 1 0.24 0.43 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂

It might be interesting to visualize the price broken down by brand

Show the code
# Store the log conversions in new df for easier plotting
 OJ |> 
  mutate(price=log(price),
         move= log(sales)) ->OJ2
# Log price
OJ2 %>% 
  ggplot(aes(brand,price)) +
  geom_boxplot(fill="indianred") +
  geom_hline(yintercept = 0)+
  labs(y = "Log(price)")+
  theme_light()

Equally the log sales by log price for each brand :

Show the code
# Log salesXprice
OJ2 %>% 
  ggplot(aes(price,move, color=brand))+
  geom_jitter()+
  scale_color_manual(values = pal)+
  labs(x="Log(price)",
       y="Log(sales)",
       color="Brand")+
  theme_light()

Modeling

Lets create a simple regression model sales as a function of price and brand:

Show the code
reg <- glm(log(sales)~brand+log(price), data=OJ)
reg |> 
  tidy()
# A tibble: 4 × 5
  term             estimate std.error statistic p.value
  <chr>               <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)        10.8      0.0145     745.        0
2 brandminute.maid    0.870    0.0129      67.3       0
3 brandtropicana      1.53     0.0163      93.8       0
4 log(price)         -3.14     0.0229    -137.        0
Show the code
# visualise:
OJ2 |> ggplot(aes(price,move, group=brand,color=brand))+
  geom_point(color="black")+
  stat_smooth(method="lm")+
  ylab("log(sales)")+
  xlab("log(price)")+
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

The first thing to note is that \(\beta\)=-3.1 for the log(price) effect. In other words sales drop by around 3% for every 1% increase in price. The second is the effect sizes for the brand type, i.e. a dummy variable is created to account for the effect of the respective brand. However, it would be interesting to see the interaction effect between brand and log price. This can be defined using the * term in the input formula:

Show the code
interaction_reg_model <- lm(log(sales)~ log(price)*brand, data =OJ)
beta <- coef(interaction_reg_model)
interaction_reg_model |> 
  tidy()
# A tibble: 6 × 5
  term                        estimate std.error statistic   p.value
  <chr>                          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)                  11.0       0.0207   529.    0        
2 log(price)                   -3.38      0.0362   -93.3   0        
3 brandminute.maid              0.888     0.0416    21.4   1.34e-100
4 brandtropicana                0.962     0.0464    20.7   1.13e- 94
5 log(price):brandminute.maid   0.0568    0.0573     0.991 3.22e-  1
6 log(price):brandtropicana     0.666     0.0535    12.4   1.97e- 35

The interaction term can be added to the slope to get the elasticities for the other brands. These are the terms containing the log(price):brand interaction:

Show the code
interaction_reg_model |> 
  tidy() |> 
  filter(str_detect(term, 'log'))
# A tibble: 3 × 5
  term                        estimate std.error statistic  p.value
  <chr>                          <dbl>     <dbl>     <dbl>    <dbl>
1 log(price)                   -3.38      0.0362   -93.3   0       
2 log(price):brandminute.maid   0.0568    0.0573     0.991 3.22e- 1
3 log(price):brandtropicana     0.666     0.0535    12.4   1.97e-35

It now becomes evident that the Tropicana customers are less sensitive than the others (-2.7) compared to -3.3 (log(price)). Take this in contrast to the -3.1 price sensitivity in the previous model. In other words the -3.1 price sensitivity was an average across the two distinct populations of consumers.

Our data set contains some information regarding whether the brand was promoted in an in-door display promo or a flier ad during the week the sales data was recorded. This allows for the exploration of the relationship between price and sales. The ads may positively impact sales at all price points, therefore impacting price sensitivity, and both of these effects can occur for each of the respective brands. This can be expressed in terms of a three way interaction between price, brand, and feat. This interaction model is defined below:

Show the code
ojreg <- glm(log(sales)~log(price)*brand*feat, data = OJ)
ojreg |> 
  tidy()
# A tibble: 12 × 5
   term                             estimate std.error statistic   p.value
   <chr>                               <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)                       10.4       0.0234    446.   0        
 2 log(price)                        -2.77      0.0388    -71.4  0        
 3 brandminute.maid                   0.0472    0.0466      1.01 3.11e-  1
 4 brandtropicana                     0.708     0.0508     13.9  5.26e- 44
 5 feat                               1.09      0.0381     28.7  6.70e-179
 6 log(price):brandminute.maid        0.783     0.0614     12.8  3.90e- 37
 7 log(price):brandtropicana          0.736     0.0568     12.9  3.17e- 38
 8 log(price):feat                   -0.471     0.0741     -6.35 2.17e- 10
 9 brandminute.maid:feat              1.17      0.0820     14.3  2.65e- 46
10 brandtropicana:feat                0.785     0.0987      7.95 1.90e- 15
11 log(price):brandminute.maid:feat  -1.11      0.122      -9.07 1.22e- 19
12 log(price):brandtropicana:feat    -0.986     0.124      -7.95 2.00e- 15

The brand specific elasticities are now the following:

Advertised Dominicks MinuteMaid Tropicana
Not Featured -2.8 -2.0 -2.0
Featured -3.2 -3.6 -3.5

This reveals that being featured leads to more price sensitivity. The Minute Maid and Tropicana elasticities drop from -2.0 to below -3.5 when features. In contrast, Dominick’s feature results in a drop from -2.8 to -3.2. One possible explanation is that being featured can increase market reach beyond brand loyalists, i.e. to include consumers that are more price sensitive than those making repeat purchases on a weekly basis. Alternatively, another explanation to why this might be happening might be due to the fact that featured products are often also discounted, this alludes that the demand curve may be nonlinear,the average consumer is more price sensitive at lower price points. However, the reality likely is that there is a combination of these effects. The keen observer might have noticed that relative to the model including the interaction between price and brand, in the three way interaction model, Minute Maid and Tropicana have almost identical elasticities. This is due to the fact that the one way interaction model, was confounding the effects of being featured and the brands.

The figure above shows that Minute Maid was featured more relative to Tropicana. Given that being featured leads to more price sensitivity, Minute Maid will appear more price sensitive when not accounting for the effect of being featured.

Bootstrapping

It might be interesting to get a confidence intervals for the coefficients and increase the certainty of their effect sizes using bootstrap:

Show the code
doParallel::registerDoParallel()
reg_intervals(log(sales)~log(price)*brand*feat, 
              data = OJ,
              times=1000) -> bootstrapped_model

We now have the bootstrapped estimated effect size and an upper and lower confidence interval:

Show the code
bootstrapped_model
# A tibble: 11 × 6
   term                                 .lower .estimate .upper .alpha .method  
   <chr>                                 <dbl>     <dbl>  <dbl>  <dbl> <chr>    
 1 brandminute.maid                    -0.0412    0.0464  0.141   0.05 student-t
 2 brandminute.maid:featAds             0.991     1.17    1.36    0.05 student-t
 3 brandtropicana                       0.596     0.706   0.813   0.05 student-t
 4 brandtropicana:featAds               0.603     0.787   0.961   0.05 student-t
 5 featAds                              0.999     1.09    1.19    0.05 student-t
 6 log(price)                          -2.86     -2.77   -2.68    0.05 student-t
 7 log(price):brandminute.maid          0.653     0.783   0.910   0.05 student-t
 8 log(price):brandminute.maid:featAds -1.44     -1.11   -0.789   0.05 student-t
 9 log(price):brandtropicana            0.608     0.737   0.857   0.05 student-t
10 log(price):brandtropicana:featAds   -1.26     -0.986  -0.704   0.05 student-t
11 log(price):featAds                  -0.696    -0.473  -0.242   0.05 student-t

We can visualize the bootstrapped effect size for each and their respective confidence intervals:

Show the code
bootstrapped_model %>% 
  ggplot(aes(.estimate, term,fill="midnightblue"))+
  geom_errorbar(linewidth=1.4,alpha = 0.7,
                aes(xmin=.lower, xmax=.upper))+
  geom_point()+
  theme_light()+
    theme(legend.position = "none")+
  labs(title = "Boostrapped Model Terms Confidence Intervals")

Conclusion

This report revisits data from the University of Chicago’s Booth School of Business. It explores data relating to orange juice brands, with the aim of establishing price elasticity and their effect on sales. Furthermore, the analysis uncovers interaction effects between price, brand, and being included in advertising/promotion. Finally, bootstrapped confidence intervals were developed for additional robustness.