Customer Conversion - A CRM Perspective
Objectives:
The objective of this report is to provide a concise overview on a re-join campaign performance. The key KPIs considered are: conversion %, uplift vs control, and incremental conversions. In addition the report engages with some predictive analytics techniques to predict conversion.
The sample
The sample contains a few variables, lets take a quick look:
## Rows: 60,486
## Columns: 14
## $ Dummy.GUID <chr> "00004331-1feb-7f48-8a15-3c9ead0f8fcf", "00015…
## $ cancellation.date <date> 2020-10-16, 2019-07-25, 2019-07-21, 2019-12-0…
## $ Gender <fct> MALE, FEMALE, FEMALE, MALE, MALE, FEMALE, MALE…
## $ Age <int> 32, 20, 17, 38, 32, 22, 23, 30, 29, 23, 29, 30…
## $ Cohort <fct> Control, Control, Control, Treated, Control, T…
## $ Member.Value <fct> High, low, low, High, low, High, low, low, Hig…
## $ Campaign.Start.Date <date> 2021-04-05, 2021-04-05, 2021-04-05, 2021-04-0…
## $ Campaign.End.Date <date> 2021-06-01, 2021-06-01, 2021-06-01, 2021-06-0…
## $ Email <int> 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1…
## $ SMS <int> 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1…
## $ Converted <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Conversion.Date <date> NA, NA, NA, 2021-04-22, NA, NA, NA, NA, NA, N…
## $ First.Transaction.Value <dbl> 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ongoing.Monthly.Amount <dbl> 0.00, 0.00, 0.00, 15.99, 0.00, 0.00, 0.00, 0.0…
Exploratory Data Analysis:
It would be interested to get an impression of the number of conversions per cohort throughout the duration of the campaign:
# DIfference beteeen the number of conversions between the control and treated groups
conversions_cohort <- gym_group %>% filter(Converted == 1) %>% select(Conversion.Date, Cohort) %>% group_by(Conversion.Date, Cohort) %>% summarise(count=n())
conversions_cohort %>% ggplot(aes(Conversion.Date, count, color=Cohort))+ geom_line()+theme_classic()+scale_y_continuous(limits = c(0,300))+ylab("Count")+xlab("")+ggtitle("Difference in number of conversions between cohorts")The distribution of age by gender will give us an idea of some of the demographic variable of the sample:
# histogram of age bby gender
gym_group %>% ggplot(aes(Age, after_stat(density), fill=Gender))+geom_histogram(alpha=.5, position="identity", binwidth = 1)+scale_x_continuous(limits = c(10,90))+scale_y_continuous(expand = c(0,0))+theme_minimal()KPIs
Treated conversions
The number of conversions
treated <- gym_group %>% filter(Cohort == "Treated")
df <- tribble(~Conversion, ~Count, "Conversion", 4211, "No conversion", 38607)
knitr::kable(df)| Conversion | Count |
|---|---|
| Conversion | 4211 |
| No conversion | 38607 |
Conversion rate of the cohorts:
| Cohort | Conversion_rate |
|---|---|
| Treated | 0.10907 |
| Control | 0.08134 |
Therefore the conversion rate incremental uplift is 2.78% (0.027731).
Incremental conversions:
The disparity between the control and treated cohorts is shown below:
df3 <- tribble(~Variable, ~ Rate, "Control", 1.00, "Treated", 1.34)
df3 %>% ggplot(aes(Variable, Rate))+geom_bar(stat="identity", fill="midnightblue")+theme_classic()+ geom_hline(yintercept = 1.00, linetype='dashed', col='grey')+ scale_y_continuous(limits = c(0.00,1.5), expand = c(0,0))+xlab("")+ylab("Percent")+ggtitle("Conversion Rate & Incremental Uplift")An alternative visualisation using a waterfall chart, making it clearer that the uplift is 0.34:
categories <- c("Control", "Uplift")
amount <- c(1.00, 0.34)
df4 <- data.frame(categories, amount)
waterfall(df4,
calc_total=TRUE,
total_axis_text = "Treated",
total_rect_text_color="black",
total_rect_color="goldenrod1") +
scale_y_continuous(limits = c(0,1.5)) +
labs(title = "Conversion Rate & Incremental Uplift",
y="Percent",
x="") +
theme_minimal() In other words this is the incremental difference in performance between the two cohorts. This takes into account the conversion activity that would have taken place anyway regardless of marketing activity.
Test for statistical significance
We can test for statistical significance to confirm the findings:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(gym_group$Converted, gym_group$Cohort)
## X-squared = 80.1, df = 1, p-value <2e-16
We observe that the Pearson Chi-Squared statistic, \(X^2\)(1)= 80.1, corresponding to a p-value< 0.05. Therefore we have overwhelming evidence to reject the null hypothesis and thus there is strong evidence to suggest an association between treatment and conversion.
# check assumptions, which are met when reasonable sample size, less than 20% if cells have an expected count less than 5, and none an expected count less than 1.
result <- chisq.test(table(gym_group$Converted, gym_group$Cohort))
result$expected##
## Control Treated
## 0 16049.8 38896.2
## 1 1618.2 3921.8
Two-Way ANOVA
The difference in conversion between groups can be tested using a two-way ANOVA analysis, in this case the Cohort and Member.Value groups:
ano <- gym_group %>% dplyr::select(Converted, Cohort, Member.Value)
two.way <- aov(Converted~Cohort+Member.Value, data = ano)
tidy(two.way)## # A tibble: 3 × 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cohort 1 6.69 6.69 81.9 1.48e- 19
## 2 Member.Value 1 85.7 85.7 1049. 3.71e-228
## 3 Residuals 60483 4940. 0.0817 NA NA
A Tukey HSD post-hoc test confirms that the difference within the groups is statistically significant:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Converted ~ Cohort + Member.Value, data = ano)
##
## $Cohort
## diff lwr upr p adj
## Treated-Control 0.023126 0.018117 0.028134 0
##
## $Member.Value
## diff lwr upr p adj
## low-High -0.076474 -0.081102 -0.071846 0
Groupwise differences
It might be interesting to find out between which pairs of the variables difference is statistically significant:
tukey.plot <- aov(Converted~Cohort:Member.Value, data = ano)
tuke.plot.test<- TukeyHSD(tukey.plot)
plot_with_gg <- as.data.frame(tuke.plot.test[["Cohort:Member.Value"]])
plot_with_gg <- rownames_to_column(plot_with_gg)
plot_with_gg %>% ggplot(aes(diff, rowname, color=rowname)) +geom_point()+geom_errorbar(aes(xmin=lwr,xmax=upr)) +theme_minimal()+ylab("")+xlab("")+scale_x_continuous(limits = c(-0.11, 0.05))+theme(legend.position = "none")+ggtitle("Difference in conversions (95% confidence interval)")As the plot shows there is no statistically significant difference between the conversion of low member value in the treated and control cohorts.
Predictive Analytics
It is possible to engage with some predictive analytics using machine learning models, for this case lets try to predict Conversion, using Gender, Age, Cohort, Member.Value, Email, and SMS, as predictors. Lets create a new dataset containing this information:
# create dataframe for model:
for_modelling <- gym_group %>% dplyr::select(Converted, Gender, Age, Cohort, Member.Value, Email, SMS)
for_modelling %>% head()## Converted Gender Age Cohort Member.Value Email SMS
## 1 0 MALE 32 Control High 1 1
## 2 0 FEMALE 20 Control low 1 0
## 3 0 FEMALE 17 Control low 0 1
## 4 1 MALE 38 Treated High 0 1
## 5 0 MALE 32 Control low 0 1
## 6 0 FEMALE 22 Treated High 1 1
Data splitting and resampling
In order to train and test the model, we need to split the dataset into testing and training sets:
Create validation set
A single validation set is created to validate performance:
## # Validation Set Split (0.8/0.2) using stratification
## # A tibble: 1 × 2
## splits id
## <list> <chr>
## 1 <split [36291/9073]> validation
Building the models
Logistic Regression
Specifying the model:
Specifying the recipe for the model:
lr_recipe <-
recipe(Converted~.,data = gym_training) %>%
step_dummy(all_nominal(),-all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())Create the workflow:
Create tuning grid:
## Selecting by penalty
## # A tibble: 5 × 1
## penalty
## <dbl>
## 1 0.0001
## 2 0.000127
## 3 0.000161
## 4 0.000204
## 5 0.000259
Train and tune the model:
lr_res <- lr_workflow %>% tune_grid(val_set,
grid = lr_reg_grid,
control = control_grid(save_pred = TRUE),
metrics = metric_set(roc_auc))Visualising the effect of the penalty parameter on the area under the ROC curve:
lr_plot <-
lr_res %>%
collect_metrics() %>%
ggplot(aes(x = penalty, y = mean)) +
geom_point() +
geom_line() +
ylab("Area under the ROC Curve") +
scale_x_log10(labels = scales::label_number())+theme_light()
lr_plot Identify the top performing model:
## # A tibble: 15 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0001 roc_auc binary 0.644 1 NA Preprocessor1_Model01
## 2 0.000127 roc_auc binary 0.644 1 NA Preprocessor1_Model02
## 3 0.000161 roc_auc binary 0.644 1 NA Preprocessor1_Model03
## 4 0.000204 roc_auc binary 0.644 1 NA Preprocessor1_Model04
## 5 0.000259 roc_auc binary 0.644 1 NA Preprocessor1_Model05
## 6 0.000329 roc_auc binary 0.644 1 NA Preprocessor1_Model06
## 7 0.000418 roc_auc binary 0.644 1 NA Preprocessor1_Model07
## 8 0.000530 roc_auc binary 0.644 1 NA Preprocessor1_Model08
## 9 0.000672 roc_auc binary 0.644 1 NA Preprocessor1_Model09
## 10 0.000853 roc_auc binary 0.644 1 NA Preprocessor1_Model10
## 11 0.00108 roc_auc binary 0.643 1 NA Preprocessor1_Model11
## 12 0.00137 roc_auc binary 0.643 1 NA Preprocessor1_Model12
## 13 0.00174 roc_auc binary 0.643 1 NA Preprocessor1_Model13
## 14 0.00728 roc_auc binary 0.643 1 NA Preprocessor1_Model19
## 15 0.00924 roc_auc binary 0.643 1 NA Preprocessor1_Model20
Select values of best model and visualise roc curve:
## # A tibble: 1 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.00137 roc_auc binary 0.643 1 NA Preprocessor1_Model12
Plot the ROC curve:
lr_auc <-
lr_res %>%
collect_predictions(parameters = lr_best) %>%
roc_curve(Converted, `.pred_not converted`) %>%
mutate(model = "Logistic Regression")
autoplot(lr_auc)Random Forest Model
Activate parallel computing:
Specify the model:
rf_mod <-
rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
set_engine("ranger", num.threads = cores) %>%
set_mode("classification")Specify the recipe:
Build the workflow:
Train and tune the model:
set.seed(234)
rf_res <-
rf_workflow %>%
tune_grid(val_set,
grid = 25,
control = control_grid(save_pred = TRUE),
metrics = metric_set(roc_auc))## i Creating pre-processing data to finalize unknown parameter: mtry
Top 5 performing models:
## # A tibble: 5 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 19 roc_auc binary 0.646 1 NA Preprocessor1_Model19
## 2 1 6 roc_auc binary 0.646 1 NA Preprocessor1_Model18
## 3 1 7 roc_auc binary 0.646 1 NA Preprocessor1_Model05
## 4 2 12 roc_auc binary 0.646 1 NA Preprocessor1_Model20
## 5 2 23 roc_auc binary 0.646 1 NA Preprocessor1_Model10
Effect of parameters on area under the ROC curve:
Select the parameters of the best performing model:
## # A tibble: 1 × 3
## mtry min_n .config
## <int> <int> <chr>
## 1 1 19 Preprocessor1_Model19
## # A tibble: 226,825 × 8
## id `.pred_not converted` .pred_converted .row mtry min_n Converted
## <chr> <dbl> <dbl> <int> <int> <int> <fct>
## 1 validation 0.887 0.113 11 5 28 not conve…
## 2 validation 0.875 0.125 14 5 28 not conve…
## 3 validation 0.843 0.157 22 5 28 not conve…
## 4 validation 0.956 0.0437 23 5 28 not conve…
## 5 validation 0.927 0.0726 24 5 28 converted
## 6 validation 0.967 0.0327 31 5 28 converted
## 7 validation 0.833 0.167 35 5 28 not conve…
## 8 validation 0.808 0.192 38 5 28 not conve…
## 9 validation 0.922 0.0779 39 5 28 not conve…
## 10 validation 0.747 0.253 46 5 28 not conve…
## # ℹ 226,815 more rows
## # ℹ 1 more variable: .config <chr>
Create dataframe for comparison with logistic regression model:
rf_auc <-
rf_res %>%
collect_predictions(parameters = rf_best) %>%
roc_curve(Converted, `.pred_not converted`) %>%
mutate(model = "Random Forest")Compare ROC for both models:
bind_rows(rf_auc, lr_auc) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) +
geom_path(lwd = 1.5, alpha = 0.8) +
geom_abline(lty = 3) +
coord_equal() +
scale_color_viridis_d(option = "plasma", end = .6)+theme_light()+theme(legend.position = "top")Performance isn’t great for either model, lets go with the random forest for the final fit.
The last fit
for the final fit we repeat the previous steps:
last_rf_mod <-
rand_forest(mtry = 1, min_n = 19, trees = 1000) %>%
set_engine("ranger", num.threads = cores, importance = "impurity") %>%
set_mode("classification")
# the last workflow
last_rf_workflow <-
rf_workflow %>%
update_model(last_rf_mod)
# the last fit
set.seed(345)
last_rf_fit <-
last_rf_workflow %>%
last_fit(splits)
last_rf_fit## # Resampling results
## # Manual resampling
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [45364/15122]> train/test sp… <tibble> <tibble> <tibble> <workflow>
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.906 Preprocessor1_Model1
## 2 roc_auc binary 0.646 Preprocessor1_Model1
Variable importance for conversion:
last_rf_fit %>%
pluck(".workflow", 1) %>%
pull_workflow_fit() %>%
vip(num_features = 5)+theme_light()## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## ℹ Please use `extract_fit_parsnip()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
This suggests that the most important variables determining conversion are Member.Value, Age, and Cohort.
Finally we can visualise the ROC curve for the final fitted model:
last_rf_fit %>%
collect_predictions() %>%
roc_curve(Converted, '.pred_not converted') %>%
autoplot()Conclusion and Future directions
The report presents some key findings, particularly conversion rates and incremental uplift. It also engages in some predictive modelling using a logistic regression and random forest model. Although performance was not very good, it provides a good foundation for future research. Future analysis may wish to engage in testing whether the difference between different variables is statistically significant, for instance, within and between the conversion of ex-members receiving SMS or email communications, and/or considering the difference between the gender variable. As for the predictive models, further models need to be developed and the hyper parameters further modified, in addition, k-fold cross validation can be adopted.
Session Info
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ranger_0.15.1 glmnet_4.1-7 Matrix_1.6-1.1 MASS_7.3-60
## [5] waterfalls_1.0.0 vip_0.3.2 knitr_1.44 patchwork_1.1.3
## [9] yardstick_1.2.0 workflowsets_1.0.1 workflows_1.1.3 tune_1.1.2
## [13] rsample_1.2.0 recipes_1.0.8 parsnip_1.1.1 modeldata_1.2.0
## [17] infer_1.0.5 dials_1.2.0 scales_1.2.1 broom_1.0.5
## [21] tidymodels_1.1.1 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.0
## [25] dplyr_1.1.3 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
## [29] tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gridExtra_2.3 rlang_1.1.1 magrittr_2.0.3
## [4] furrr_0.3.1 compiler_4.3.2 vctrs_0.6.4
## [7] lhs_1.1.6 crayon_1.5.2 pkgconfig_2.0.3
## [10] shape_1.4.6 fastmap_1.1.1 backports_1.4.1
## [13] ellipsis_0.3.2 rmdformats_1.0.4 labeling_0.4.3
## [16] utf8_1.2.3 rmarkdown_2.25 prodlim_2023.08.28
## [19] tzdb_0.4.0 xfun_0.40 cachem_1.0.8
## [22] jsonlite_1.8.7 parallel_4.3.2 R6_2.5.1
## [25] bslib_0.5.1 stringi_1.7.12 parallelly_1.36.0
## [28] rpart_4.1.21 jquerylib_0.1.4 Rcpp_1.0.11
## [31] bookdown_0.36 iterators_1.0.14 future.apply_1.11.0
## [34] pacman_0.5.1 splines_4.3.2 nnet_7.3-19
## [37] timechange_0.2.0 tidyselect_1.2.0 rstudioapi_0.15.0
## [40] yaml_2.3.7 timeDate_4022.108 codetools_0.2-19
## [43] listenv_0.9.0 lattice_0.21-9 withr_2.5.1
## [46] evaluate_0.22 future_1.33.0 survival_3.5-7
## [49] pillar_1.9.0 foreach_1.5.2 generics_0.1.3
## [52] hms_1.1.3 munsell_0.5.0 globals_0.16.2
## [55] class_7.3-22 glue_1.6.2 tools_4.3.2
## [58] data.table_1.14.8 gower_1.0.1 grid_4.3.2
## [61] ipred_0.9-14 colorspace_2.1-0 cli_3.6.1
## [64] DiceDesign_1.9 fansi_1.0.5 viridisLite_0.4.2
## [67] lava_1.7.2.1 gtable_0.3.4 GPfit_1.0-8
## [70] sass_0.4.7 digest_0.6.33 farver_2.1.1
## [73] htmltools_0.5.6.1 lifecycle_1.0.3 hardhat_1.3.0