Customer Conversion - A CRM Perspective

Objectives:

The objective of this report is to provide a concise overview on a re-join campaign performance. The key KPIs considered are: conversion %, uplift vs control, and incremental conversions. In addition the report engages with some predictive analytics techniques to predict conversion.

The sample

The sample contains a few variables, lets take a quick look:

glimpse(gym_group)
## Rows: 60,486
## Columns: 14
## $ Dummy.GUID              <chr> "00004331-1feb-7f48-8a15-3c9ead0f8fcf", "00015…
## $ cancellation.date       <date> 2020-10-16, 2019-07-25, 2019-07-21, 2019-12-0…
## $ Gender                  <fct> MALE, FEMALE, FEMALE, MALE, MALE, FEMALE, MALE…
## $ Age                     <int> 32, 20, 17, 38, 32, 22, 23, 30, 29, 23, 29, 30…
## $ Cohort                  <fct> Control, Control, Control, Treated, Control, T…
## $ Member.Value            <fct> High, low, low, High, low, High, low, low, Hig…
## $ Campaign.Start.Date     <date> 2021-04-05, 2021-04-05, 2021-04-05, 2021-04-0…
## $ Campaign.End.Date       <date> 2021-06-01, 2021-06-01, 2021-06-01, 2021-06-0…
## $ Email                   <int> 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1…
## $ SMS                     <int> 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1…
## $ Converted               <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Conversion.Date         <date> NA, NA, NA, 2021-04-22, NA, NA, NA, NA, NA, N…
## $ First.Transaction.Value <dbl> 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ongoing.Monthly.Amount  <dbl> 0.00, 0.00, 0.00, 15.99, 0.00, 0.00, 0.00, 0.0…

Exploratory Data Analysis:

It would be interested to get an impression of the number of conversions per cohort throughout the duration of the campaign:

# DIfference beteeen the number of conversions between the control and treated groups
conversions_cohort <- gym_group %>% filter(Converted == 1) %>% select(Conversion.Date, Cohort) %>% group_by(Conversion.Date, Cohort) %>%  summarise(count=n())
conversions_cohort %>% ggplot(aes(Conversion.Date, count, color=Cohort))+ geom_line()+theme_classic()+scale_y_continuous(limits = c(0,300))+ylab("Count")+xlab("")+ggtitle("Difference in number of conversions between cohorts")

The distribution of age by gender will give us an idea of some of the demographic variable of the sample:

# histogram of age bby gender
gym_group %>% ggplot(aes(Age, after_stat(density), fill=Gender))+geom_histogram(alpha=.5, position="identity", binwidth = 1)+scale_x_continuous(limits = c(10,90))+scale_y_continuous(expand = c(0,0))+theme_minimal()

KPIs

Treated conversions

The number of conversions

treated <- gym_group %>% filter(Cohort == "Treated")
df <- tribble(~Conversion, ~Count, "Conversion", 4211, "No conversion", 38607)
knitr::kable(df)
Conversion Count
Conversion 4211
No conversion 38607
p1 <- df %>% ggplot(aes(Conversion, Count, fill=Conversion))+geom_bar(stat = "identity",width=1)+theme_minimal()+theme(legend.position = "none")+scale_y_continuous(limits=c(0,40000),expand = c(0,0))+xlab("")+geom_text(aes(label=Count),vjust=0)+ggtitle("Treated")

Conversion rate of the cohorts:

df4 <- tribble(~Cohort, ~Conversion_rate,
               "Treated", 0.10907,
               "Control",0.081339)
kable(df4)
Cohort Conversion_rate
Treated 0.10907
Control 0.08134

Therefore the conversion rate incremental uplift is 2.78% (0.027731).

Incremental conversions:

The disparity between the control and treated cohorts is shown below:

df3 <- tribble(~Variable, ~ Rate, "Control", 1.00, "Treated", 1.34)
df3 %>% ggplot(aes(Variable, Rate))+geom_bar(stat="identity", fill="midnightblue")+theme_classic()+ geom_hline(yintercept = 1.00, linetype='dashed', col='grey')+ scale_y_continuous(limits = c(0.00,1.5), expand = c(0,0))+xlab("")+ylab("Percent")+ggtitle("Conversion Rate & Incremental Uplift")

An alternative visualisation using a waterfall chart, making it clearer that the uplift is 0.34:

categories <- c("Control", "Uplift")
amount <- c(1.00, 0.34)
df4 <- data.frame(categories, amount)
waterfall(df4, 
          calc_total=TRUE, 
          total_axis_text = "Treated",
          total_rect_text_color="black",
          total_rect_color="goldenrod1") +
  scale_y_continuous(limits = c(0,1.5)) +
  labs(title = "Conversion Rate & Incremental Uplift", 
       y="Percent", 
       x="") +
  theme_minimal() 

In other words this is the incremental difference in performance between the two cohorts. This takes into account the conversion activity that would have taken place anyway regardless of marketing activity.

Test for statistical significance

We can test for statistical significance to confirm the findings:

library(MASS)
chisq.test(table(gym_group$Converted, gym_group$Cohort))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(gym_group$Converted, gym_group$Cohort)
## X-squared = 80.1, df = 1, p-value <2e-16

We observe that the Pearson Chi-Squared statistic, \(X^2\)(1)= 80.1, corresponding to a p-value< 0.05. Therefore we have overwhelming evidence to reject the null hypothesis and thus there is strong evidence to suggest an association between treatment and conversion.

# check assumptions, which are met when reasonable sample size, less than 20% if cells have an expected count less than 5, and none an expected count less than 1. 
result <- chisq.test(table(gym_group$Converted, gym_group$Cohort))
result$expected
##    
##     Control Treated
##   0 16049.8 38896.2
##   1  1618.2  3921.8
# none of the expected frequencies are less than 5 so the chi-squared test is valid. 

Two-Way ANOVA

The difference in conversion between groups can be tested using a two-way ANOVA analysis, in this case the Cohort and Member.Value groups:

ano <- gym_group %>% dplyr::select(Converted, Cohort, Member.Value)
two.way <- aov(Converted~Cohort+Member.Value, data = ano)
tidy(two.way)
## # A tibble: 3 × 6
##   term            df   sumsq  meansq statistic    p.value
##   <chr>        <dbl>   <dbl>   <dbl>     <dbl>      <dbl>
## 1 Cohort           1    6.69  6.69        81.9  1.48e- 19
## 2 Member.Value     1   85.7  85.7       1049.   3.71e-228
## 3 Residuals    60483 4940.    0.0817      NA   NA

A Tukey HSD post-hoc test confirms that the difference within the groups is statistically significant:

tuke.two.way <- TukeyHSD(two.way)
tuke.two.way
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Converted ~ Cohort + Member.Value, data = ano)
## 
## $Cohort
##                     diff      lwr      upr p adj
## Treated-Control 0.023126 0.018117 0.028134     0
## 
## $Member.Value
##               diff       lwr       upr p adj
## low-High -0.076474 -0.081102 -0.071846     0

Groupwise differences

It might be interesting to find out between which pairs of the variables difference is statistically significant:

tukey.plot <- aov(Converted~Cohort:Member.Value, data = ano)
tuke.plot.test<- TukeyHSD(tukey.plot)
plot_with_gg <- as.data.frame(tuke.plot.test[["Cohort:Member.Value"]])
plot_with_gg <- rownames_to_column(plot_with_gg)
plot_with_gg %>% ggplot(aes(diff, rowname, color=rowname)) +geom_point()+geom_errorbar(aes(xmin=lwr,xmax=upr)) +theme_minimal()+ylab("")+xlab("")+scale_x_continuous(limits = c(-0.11, 0.05))+theme(legend.position = "none")+ggtitle("Difference in conversions (95% confidence interval)")

As the plot shows there is no statistically significant difference between the conversion of low member value in the treated and control cohorts.

Predictive Analytics

It is possible to engage with some predictive analytics using machine learning models, for this case lets try to predict Conversion, using Gender, Age, Cohort, Member.Value, Email, and SMS, as predictors. Lets create a new dataset containing this information:

# create dataframe for model:
for_modelling <- gym_group %>% dplyr::select(Converted, Gender, Age, Cohort, Member.Value, Email, SMS)
for_modelling %>% head()
##   Converted Gender Age  Cohort Member.Value Email SMS
## 1         0   MALE  32 Control         High     1   1
## 2         0 FEMALE  20 Control          low     1   0
## 3         0 FEMALE  17 Control          low     0   1
## 4         1   MALE  38 Treated         High     0   1
## 5         0   MALE  32 Control          low     0   1
## 6         0 FEMALE  22 Treated         High     1   1
for_modelling$Converted <-factor(for_modelling$Converted, labels= c("not converted","converted"))

Data splitting and resampling

In order to train and test the model, we need to split the dataset into testing and training sets:

set.seed(123)
splits <- initial_split(for_modelling,strata = Converted)
gym_training <- training(splits)
gym_testing <- testing(splits)

Create validation set

A single validation set is created to validate performance:

set.seed(234)
val_set <- validation_split(gym_training, strata = Converted, prop = 0.80)
val_set
## # Validation Set Split (0.8/0.2)  using stratification 
## # A tibble: 1 × 2
##   splits               id        
##   <list>               <chr>     
## 1 <split [36291/9073]> validation

Building the models

Logistic Regression

Specifying the model:

lr_mod <- 
  logistic_reg(penalty = tune(), mixture = 1) %>% 
  set_engine("glmnet")

Specifying the recipe for the model:

lr_recipe <- 
  recipe(Converted~.,data = gym_training) %>% 
  step_dummy(all_nominal(),-all_outcomes()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors())

Create the workflow:

lr_workflow <- 
  workflow() %>% 
  add_model(lr_mod) %>% 
  add_recipe(lr_recipe)

Create tuning grid:

lr_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))
lr_reg_grid %>% top_n(-5)
## Selecting by penalty
## # A tibble: 5 × 1
##    penalty
##      <dbl>
## 1 0.0001  
## 2 0.000127
## 3 0.000161
## 4 0.000204
## 5 0.000259

Train and tune the model:

lr_res <- lr_workflow %>%   tune_grid(val_set,
            grid = lr_reg_grid,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(roc_auc))

Visualising the effect of the penalty parameter on the area under the ROC curve:

lr_plot <- 
  lr_res %>% 
  collect_metrics() %>% 
  ggplot(aes(x = penalty, y = mean)) + 
  geom_point() + 
  geom_line() + 
  ylab("Area under the ROC Curve") +
  scale_x_log10(labels = scales::label_number())+theme_light()

lr_plot 

Identify the top performing model:

top_models <-
  lr_res %>% 
  show_best("roc_auc", n = 15) %>% 
  arrange(penalty) 
top_models
## # A tibble: 15 × 7
##     penalty .metric .estimator  mean     n std_err .config              
##       <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1 0.0001   roc_auc binary     0.644     1      NA Preprocessor1_Model01
##  2 0.000127 roc_auc binary     0.644     1      NA Preprocessor1_Model02
##  3 0.000161 roc_auc binary     0.644     1      NA Preprocessor1_Model03
##  4 0.000204 roc_auc binary     0.644     1      NA Preprocessor1_Model04
##  5 0.000259 roc_auc binary     0.644     1      NA Preprocessor1_Model05
##  6 0.000329 roc_auc binary     0.644     1      NA Preprocessor1_Model06
##  7 0.000418 roc_auc binary     0.644     1      NA Preprocessor1_Model07
##  8 0.000530 roc_auc binary     0.644     1      NA Preprocessor1_Model08
##  9 0.000672 roc_auc binary     0.644     1      NA Preprocessor1_Model09
## 10 0.000853 roc_auc binary     0.644     1      NA Preprocessor1_Model10
## 11 0.00108  roc_auc binary     0.643     1      NA Preprocessor1_Model11
## 12 0.00137  roc_auc binary     0.643     1      NA Preprocessor1_Model12
## 13 0.00174  roc_auc binary     0.643     1      NA Preprocessor1_Model13
## 14 0.00728  roc_auc binary     0.643     1      NA Preprocessor1_Model19
## 15 0.00924  roc_auc binary     0.643     1      NA Preprocessor1_Model20

Select values of best model and visualise roc curve:

lr_best <- 
  lr_res %>% 
  collect_metrics() %>% 
  arrange(penalty) %>% 
  slice(12)
lr_best
## # A tibble: 1 × 7
##   penalty .metric .estimator  mean     n std_err .config              
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1 0.00137 roc_auc binary     0.643     1      NA Preprocessor1_Model12

Plot the ROC curve:

lr_auc <- 
  lr_res %>% 
  collect_predictions(parameters = lr_best) %>% 
  roc_curve(Converted,  `.pred_not converted`) %>% 
  mutate(model = "Logistic Regression")

autoplot(lr_auc)

Random Forest Model

Activate parallel computing:

cores <- parallel::detectCores()

Specify the model:

rf_mod <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_engine("ranger", num.threads = cores) %>% 
  set_mode("classification")

Specify the recipe:

rf_recipe <- 
  recipe(Converted ~ ., data = gym_training) 

Build the workflow:

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)

Train and tune the model:

set.seed(234)
rf_res <- 
  rf_workflow %>% 
  tune_grid(val_set,
            grid = 25,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(roc_auc))
## i Creating pre-processing data to finalize unknown parameter: mtry

Top 5 performing models:

rf_res %>% 
  show_best(metric = "roc_auc")
## # A tibble: 5 × 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     1    19 roc_auc binary     0.646     1      NA Preprocessor1_Model19
## 2     1     6 roc_auc binary     0.646     1      NA Preprocessor1_Model18
## 3     1     7 roc_auc binary     0.646     1      NA Preprocessor1_Model05
## 4     2    12 roc_auc binary     0.646     1      NA Preprocessor1_Model20
## 5     2    23 roc_auc binary     0.646     1      NA Preprocessor1_Model10

Effect of parameters on area under the ROC curve:

autoplot(rf_res)+theme_light()

Select the parameters of the best performing model:

rf_best <- 
  rf_res %>% 
  select_best(metric = "roc_auc")
rf_best
## # A tibble: 1 × 3
##    mtry min_n .config              
##   <int> <int> <chr>                
## 1     1    19 Preprocessor1_Model19
rf_res %>% 
  collect_predictions()
## # A tibble: 226,825 × 8
##    id         `.pred_not converted` .pred_converted  .row  mtry min_n Converted 
##    <chr>                      <dbl>           <dbl> <int> <int> <int> <fct>     
##  1 validation                 0.887          0.113     11     5    28 not conve…
##  2 validation                 0.875          0.125     14     5    28 not conve…
##  3 validation                 0.843          0.157     22     5    28 not conve…
##  4 validation                 0.956          0.0437    23     5    28 not conve…
##  5 validation                 0.927          0.0726    24     5    28 converted 
##  6 validation                 0.967          0.0327    31     5    28 converted 
##  7 validation                 0.833          0.167     35     5    28 not conve…
##  8 validation                 0.808          0.192     38     5    28 not conve…
##  9 validation                 0.922          0.0779    39     5    28 not conve…
## 10 validation                 0.747          0.253     46     5    28 not conve…
## # ℹ 226,815 more rows
## # ℹ 1 more variable: .config <chr>

Create dataframe for comparison with logistic regression model:

rf_auc <- 
  rf_res %>% 
  collect_predictions(parameters = rf_best) %>% 
  roc_curve(Converted,  `.pred_not converted`) %>% 
  mutate(model = "Random Forest")

Compare ROC for both models:

bind_rows(rf_auc, lr_auc) %>% 
  ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + 
  geom_path(lwd = 1.5, alpha = 0.8) +
  geom_abline(lty = 3) + 
  coord_equal() + 
  scale_color_viridis_d(option = "plasma", end = .6)+theme_light()+theme(legend.position = "top")

Performance isn’t great for either model, lets go with the random forest for the final fit.

The last fit

for the final fit we repeat the previous steps:

last_rf_mod <- 
  rand_forest(mtry = 1, min_n = 19, trees = 1000) %>% 
  set_engine("ranger", num.threads = cores, importance = "impurity") %>% 
  set_mode("classification")
# the last workflow
last_rf_workflow <- 
  rf_workflow %>% 
  update_model(last_rf_mod)

# the last fit
set.seed(345)
last_rf_fit <- 
  last_rf_workflow %>% 
  last_fit(splits)

last_rf_fit
## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits                id             .metrics .notes   .predictions .workflow 
##   <list>                <chr>          <list>   <list>   <list>       <list>    
## 1 <split [45364/15122]> train/test sp… <tibble> <tibble> <tibble>     <workflow>
last_rf_fit %>% collect_metrics()
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.906 Preprocessor1_Model1
## 2 roc_auc  binary         0.646 Preprocessor1_Model1

Variable importance for conversion:

last_rf_fit %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 5)+theme_light()
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## ℹ Please use `extract_fit_parsnip()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

This suggests that the most important variables determining conversion are Member.Value, Age, and Cohort.

Finally we can visualise the ROC curve for the final fitted model:

last_rf_fit %>% 
  collect_predictions() %>% 
  roc_curve(Converted, '.pred_not converted') %>% 
  autoplot()

Conclusion and Future directions

The report presents some key findings, particularly conversion rates and incremental uplift. It also engages in some predictive modelling using a logistic regression and random forest model. Although performance was not very good, it provides a good foundation for future research. Future analysis may wish to engage in testing whether the difference between different variables is statistically significant, for instance, within and between the conversion of ex-members receiving SMS or email communications, and/or considering the difference between the gender variable. As for the predictive models, further models need to be developed and the hyper parameters further modified, in addition, k-fold cross validation can be adopted.

Session Info

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ranger_0.15.1      glmnet_4.1-7       Matrix_1.6-1.1     MASS_7.3-60       
##  [5] waterfalls_1.0.0   vip_0.3.2          knitr_1.44         patchwork_1.1.3   
##  [9] yardstick_1.2.0    workflowsets_1.0.1 workflows_1.1.3    tune_1.1.2        
## [13] rsample_1.2.0      recipes_1.0.8      parsnip_1.1.1      modeldata_1.2.0   
## [17] infer_1.0.5        dials_1.2.0        scales_1.2.1       broom_1.0.5       
## [21] tidymodels_1.1.1   lubridate_1.9.3    forcats_1.0.0      stringr_1.5.0     
## [25] dplyr_1.1.3        purrr_1.0.2        readr_2.1.4        tidyr_1.3.0       
## [29] tibble_3.2.1       ggplot2_3.4.4      tidyverse_2.0.0   
## 
## loaded via a namespace (and not attached):
##  [1] gridExtra_2.3       rlang_1.1.1         magrittr_2.0.3     
##  [4] furrr_0.3.1         compiler_4.3.2      vctrs_0.6.4        
##  [7] lhs_1.1.6           crayon_1.5.2        pkgconfig_2.0.3    
## [10] shape_1.4.6         fastmap_1.1.1       backports_1.4.1    
## [13] ellipsis_0.3.2      rmdformats_1.0.4    labeling_0.4.3     
## [16] utf8_1.2.3          rmarkdown_2.25      prodlim_2023.08.28 
## [19] tzdb_0.4.0          xfun_0.40           cachem_1.0.8       
## [22] jsonlite_1.8.7      parallel_4.3.2      R6_2.5.1           
## [25] bslib_0.5.1         stringi_1.7.12      parallelly_1.36.0  
## [28] rpart_4.1.21        jquerylib_0.1.4     Rcpp_1.0.11        
## [31] bookdown_0.36       iterators_1.0.14    future.apply_1.11.0
## [34] pacman_0.5.1        splines_4.3.2       nnet_7.3-19        
## [37] timechange_0.2.0    tidyselect_1.2.0    rstudioapi_0.15.0  
## [40] yaml_2.3.7          timeDate_4022.108   codetools_0.2-19   
## [43] listenv_0.9.0       lattice_0.21-9      withr_2.5.1        
## [46] evaluate_0.22       future_1.33.0       survival_3.5-7     
## [49] pillar_1.9.0        foreach_1.5.2       generics_0.1.3     
## [52] hms_1.1.3           munsell_0.5.0       globals_0.16.2     
## [55] class_7.3-22        glue_1.6.2          tools_4.3.2        
## [58] data.table_1.14.8   gower_1.0.1         grid_4.3.2         
## [61] ipred_0.9-14        colorspace_2.1-0    cli_3.6.1          
## [64] DiceDesign_1.9      fansi_1.0.5         viridisLite_0.4.2  
## [67] lava_1.7.2.1        gtable_0.3.4        GPfit_1.0-8        
## [70] sass_0.4.7          digest_0.6.33       farver_2.1.1       
## [73] htmltools_0.5.6.1   lifecycle_1.0.3     hardhat_1.3.0