Customer Lifetime Value - A Supermarket Case Study

Author

Nils Indreiten

Project Objective

The objective of this project is to come up with a few methods for determining customer lifetime value in the context of a supermarket business. The dataset is from Kaggle, and is concerned with supermarket transactions. First the data is explored in a brief exploratory data analysis. Second, a basic method for determining customer lifetime value is applied. Third, a probabilistic approach adopted. Finally, we address the task using machine learning, framing the problem in terms of classification & regression, and relate back to the business context.

Exploratory Data Analysis

Understanding the Data

Let’s take a brief look at the dataset:

Show the code
transactions_df = transactions_df.rename(columns={'Customer ID':'customer_id','Purchase Date':'purchase_date'})

transactions_df.glimpse()
<class 'pandas.core.frame.DataFrame'>: 14059 rows of 6 columns
Transaction:       int64             [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
customer_id:       int64             [7223, 7841, 8374, 9619, 1900, 6696 ...
purchase_date:     datetime64[ns]    [Timestamp('2011-12-18 00:00:00'),  ...
Units Sold:        int64             [5, 5, 3, 4, 4, 3, 4, 6, 1, 2, 3, 5 ...
Revenue:           float64           [2738.0, 1490.0, 552.0, 444.0000000 ...
Product Category:  object            ['Snack Foods', 'Vegetables', 'Snac ...

We have a few columns included in the dataset. The transaction number and customer_id which are unique identifiers for a transaction a consumer has with the supermarket. We also have the purchase date, which is a time stamp that denotes when the customer transacted. We also have the mix of the transaction, in terms of units sold, revenue and the category of the product associated with the transaction. What is the total revenue across the whole time range of the dataset?

Show the code
df = transactions_df.copy()
total_revenue = df['Revenue'].sum()
print(f'Total Revenue: ${total_revenue:,.2f}')
Total Revenue: $18,283,043.00

Next, we breakdown how the $18.2M are distributed over time at a monthly level:

Show the code
df[['purchase_date','Revenue']] \
.summarize_by_time(
    date_column='purchase_date',
    value_column='Revenue',
    agg_func='sum',
    freq = 'M'
) \
.plot_timeseries('purchase_date','Revenue')

Figure 1. Revenue Over Time.

It looks like there was a peak in sales in March of 2012, after which the sales trend was somewhat steady. Towards the end of 2012 and beginning of 2013 sales increased drastically and stabilized, following a steady trend, before decreasing substantially in December 2013.

A Descriptive Customer Lifetime Value Approach

We can create a very simple customer lifetime value approach using recency, frequency and monetary value, or RFM for short. Recency refers to a temporal element of the customer’s behavior patterns, frequency to how often they transact and monetary value typically refers to the average sale value across all the customer’s transactions. Aggregating at the customer ID level we get the following output for the RFM logic:

Show the code
customer_sales_1_df = df \
.groupby(['customer_id']) \
.agg(
    time_diff = ('purchase_date',lambda x: (x.max()-x.min()).days),
    frequency = ('Transaction','nunique'),
    total_sales = ('Revenue','sum'),
    avg_sales = ('Revenue','mean')
) \
.reset_index()
customer_sales_1_df
customer_id time_diff frequency total_sales avg_sales
0 3 0 1 2930.0 2930.000000
1 8 132 4 6472.0 1618.000000
2 9 40 2 5166.0 2583.000000
3 20 602 3 5927.0 1975.666667
4 24 92 2 870.0 435.000000
... ... ... ... ... ...
5399 10273 167 4 5179.0 1294.750000
5400 10274 0 1 1152.0 1152.000000
5401 10277 31 2 2224.0 1112.000000
5402 10278 0 1 598.0 598.000000
5403 10280 134 4 5641.0 1410.250000

5404 rows × 5 columns

Where time_diff denotes the range of time between the oldest and most recent customer transactions. Frequency, the number of transactions, total_sales and avg_sales, the total amount the customer has spent and the average they spend per transaction respectively. The business has provided us some constraints, in the form of a 15% profit margin, an average customer lifetime of 5 years, and a definition of churn as a customer who has only transacted once. With this in mind we can calculate the customer lifetime value using the following formula:

(average sales * average furchase frequency)/((max(tenure)/365)* customer lifetime)

Show the code
profit_margin = 0.15
customer_lifetime = 5 # years

summary_stats['customer_lifetime_value'] = (summary_stats['average_sales'] * summary_stats['average_purchase_freq'] / (summary_stats['max_days'] / 365) * customer_lifetime) * profit_margin

summary_stats = pd.DataFrame([summary_stats])
summary_stats
average_sales average_purchase_freq churn_rate max_days customer_lifetime_value
0 1278.744766 2.601591 0.42228 734 1240.740697

The output suggests that we can expect to make $1,240 across the lifetime of the customer (5 years as specified by the business constraints). We divided the tenure by 365 to get a per year figure and then we multiplied that by the lifetime to get a sense of how much profit we expect across that average lifetime.

We can further expand on this method to get a more nuanced breakdown of this if we split the customer base into cohorts. We can do this by the start period of the customer transacting at the supermarket, looking at the month-year combination, the table bellow shows the RFM methodology and provides a customer lifetime value at the cohort level:

Show the code
profit_margin = 0.15
customer_lifetime = 5 

df['Min_Purchase_Date'] = df\
    .groupby('customer_id')['purchase_date']\
    .transform('min')

df ['Min_Purchase_Date'] = df['Min_Purchase_Date'].dt.to_period('M')

df = df.rename(columns={'Min_Purchase_Date':'start_month'})

cohort_df = df\
    .groupby(['start_month','customer_id'])\
    .agg(
        total_sales=('Revenue','sum'),
        time_days = ('purchase_date', lambda x: (x.max() - x.min()).days),
        frequency = ('Transaction','nunique'),
        avg_sales = ('Revenue','mean')
    )\
    .reset_index()

# Summarise the cohort:
summary_2_df = cohort_df \
.groupby('start_month') \
.agg(
    cohort_size = ('customer_id','nunique'),
    average_sales = ('avg_sales','mean'),
    average_purchase_freq = ('frequency','mean'),
    churn_rate = ('frequency', lambda x: 1 - (x > 1).sum() / len(x)),
    max_days = ('time_days','max')
) \
.reset_index()
summary_2_df['clv_lifetime_value'] = (summary_2_df['average_sales'] * summary_2_df['average_purchase_freq'] * customer_lifetime / (summary_2_df['max_days'] / 365)) * profit_margin

summary_2_df.round(2)
start_month cohort_size average_sales average_purchase_freq churn_rate max_days clv_lifetime_value
0 2011-12 34 1205.79 2.76 0.26 734 1243.31
1 2012-01 226 1315.77 4.29 0.17 715 2162.18
2 2012-02 267 1277.05 3.43 0.21 688 1745.15
3 2012-03 339 1291.23 3.55 0.27 664 1890.67
4 2012-04 292 1266.72 3.11 0.29 620 1737.27
5 2012-05 240 1238.58 2.81 0.32 602 1584.07
6 2012-06 205 1320.91 2.81 0.33 554 1833.95
7 2012-07 180 1259.57 2.74 0.33 522 1809.18
8 2012-08 171 1277.13 2.47 0.37 491 1757.22
9 2012-09 167 1305.62 2.15 0.41 459 1673.92
10 2012-10 179 1285.86 2.16 0.42 425 1790.67
11 2012-11 157 1270.72 1.92 0.47 413 1614.81
12 2012-12 221 1250.61 2.68 0.37 372 2469.42
13 2013-01 348 1297.40 3.92 0.30 352 3954.77
14 2013-02 378 1321.35 3.22 0.32 325 3589.24
15 2013-03 285 1281.85 2.75 0.36 289 3344.39
16 2013-04 294 1285.82 2.13 0.47 259 2898.38
17 2013-05 259 1270.59 2.05 0.50 233 3066.30
18 2013-06 218 1216.60 1.65 0.64 195 2812.58
19 2013-07 213 1320.94 1.54 0.63 159 3512.82
20 2013-08 168 1363.67 1.46 0.66 126 4320.65
21 2013-09 183 1160.15 1.36 0.74 109 3964.53
22 2013-10 158 1244.49 1.18 0.83 51 7906.05
23 2013-11 170 1297.01 1.21 0.82 53 8078.43
24 2013-12 52 1108.22 1.15 0.87 26 13463.41

The output above emphasizes the idea that groups of customers with similar habits (when they started transacting). In addition different cohorts have different churn rates an lifetime values. For instance, consider the bottom three cohorts, who started transacting either in 2013-10, 2013-11 or 2013-12. They all have substantially higher customer lifetime value relative to the average, however, all three also have very high churn rates 80%+. From a business point of view this would suggest that these are customers that we want to keep engaged such that they continue generating value and especially due to their high likelihood of churning. Why might this be the case? The figure below visualizes the customer lifetime value over the max tenure of a customer cohort.

Show the code
sns.lineplot(data=summary_2_df,x="max_days",y="clv_lifetime_value")
plt.show()

Figure 2. Customer Lifetime Value Over Tenure.

Overall it seems that customers with a shorter tenure (at the point of this data extraction) have a higher customer lifetime value. This likely indicates there are a higher proportion of customers transacting for large amounts in the early stages of their tenure. Otherwise the trend decreases very rapidly from that point onward, peaking once again somewhere between 300-400 days, and continuing to decrease after that.

Probabilistic Models

We can further build upon upon the descriptive models using probabilistic models. By employing probabilistic methods, we can predict future customer lifetime value given a specific horizon. We can pass the transaction data to the summary_data_from_transaction_data() function from the lifetimes library. It summarizes the dataset at the customer ID level using RFM principles:

Show the code
summary_3_df = lf.utils.summary_data_from_transaction_data(
    df,
    customer_id_col    = 'customer_id',
    datetime_col       = 'purchase_date',
    monetary_value_col = 'Revenue',
)
summary_3_df
frequency recency T monetary_value
customer_id
3 0.0 0.0 274.0 0.000000
8 3.0 132.0 222.0 1211.666667
9 1.0 40.0 91.0 944.000000
20 2.0 602.0 733.0 2331.500000
24 1.0 92.0 140.0 372.000000
... ... ... ... ...
10273 3.0 167.0 390.0 1364.333333
10274 0.0 0.0 330.0 0.000000
10277 1.0 31.0 73.0 1109.000000
10278 0.0 0.0 253.0 0.000000
10280 3.0 134.0 192.0 1500.333333

5404 rows × 4 columns

The output should be familiar as it follows a similar logic to the previous section. In this case T refers to the timespan/tenure and moneytary_value to the average transaction value.

The first model we will build is a BG/NBD model. It is able to predict a customer’s repeat purchases. It should be noted it can only predict the future transactions and customer churn rates.

Show the code
# We need to remove zeros:
summary_3_df = summary_3_df[summary_3_df['monetary_value'] > 0]

bfg = lf.BetaGeoFitter(penalizer_coef=0.15)
bfg.fit(summary_3_df['frequency'], summary_3_df['recency'], summary_3_df['T'])
bfg.summary
coef se(coef) lower 95% bound upper 95% bound
r 0.895053 0.018981 0.857850 0.932256
alpha 87.940675 3.152234 81.762297 94.119053
a 0.105262 0.008900 0.087818 0.122706
b 0.300551 0.020964 0.259461 0.341641

We can further analyse the outputs from the model taking into account historical frequency and recency:

Show the code
summary_3_df['probability_alive'] = bfg.conditional_probability_alive(summary_3_df['frequency'], summary_3_df['recency'], summary_3_df['T'])

plot_probability_alive_matrix(bfg,cmap='viridis')
plt.show()

Figure 3. Is the Customer Alive?

We can see that for the most part as the customer tenure (difference between first and last transaction) increases and a customer has bought multiple times, then the probability of the customer being alive is high. This holds true for the most part, except for somewhere between 0-2.5 historical transactions, where the probability of being alive falls to nearly 50%. Similarly, if a customer has a lower historical frequency and their recency is lower the probability of being alive is also high.

In contrast if we consider a customer with a historical frequency of 5 and a difference between the first and most recent transaction of 200 days, then the probability of that customer being alive is very low.

Next, we can use a gamma-gamma model to predict the average transaction value by customer.

Show the code
ggf = lf.GammaGammaFitter(penalizer_coef=0.1)
ggf.fit(summary_3_df['frequency'], summary_3_df['monetary_value'])

summary_3_df['predicted_avg_sales'] = ggf.conditional_expected_average_profit(summary_3_df['frequency'], summary_3_df['monetary_value'])

summary_3_df['predicted_clv_3mo'] = ggf.customer_lifetime_value(
bfg,
summary_3_df['frequency'],
summary_3_df['recency'],
summary_3_df['T'],
summary_3_df['monetary_value'],
time=3,
freq='D',
discount_rate=0.01)

summary_3_df['predicted_clv_3mo'] = summary_3_df['predicted_clv_3mo']*profit_margin
summary_3_df.round(2)
frequency recency T monetary_value probability_alive predicted_avg_sales predicted_clv_3mo
customer_id
8 3.0 132.0 222.0 1211.67 0.85 1735.12 241.07
9 1.0 40.0 91.0 944.00 0.60 9900.54 799.81
20 2.0 602.0 733.0 2331.50 0.88 4257.53 173.70
24 1.0 92.0 140.0 372.00 0.65 3906.76 267.97
26 2.0 141.0 451.0 1350.50 0.51 2466.45 88.07
... ... ... ... ... ... ... ...
10266 1.0 221.0 367.0 488.00 0.58 5122.28 160.17
10271 2.0 231.0 370.0 789.00 0.81 1441.29 96.50
10273 3.0 167.0 390.0 1364.33 0.65 1953.69 136.01
10277 1.0 31.0 73.0 1109.00 0.62 11629.51 1066.33
10280 3.0 134.0 192.0 1500.33 0.90 2148.40 347.96

3115 rows × 7 columns

By setting a 3 month horizon we are able to predict the customer lifetime value over the next three months. This calculation also takes into account churn and therefore this prediction does not suggest that this will continue further into the future, rather that it may be lower than that given the business constraints applied. Futhermore, this gives us an indication for how much we can spend per customer and the potential impact on profitability.

Machine Learning

Whilst the probabilistic methods are great, they don’t provide a way to understand the relationship between the variables. Machine learning methods will allow to predict and also explores how the variables drive the outcomes. In this context we can frame the problem in 2 main ways,either a classification or regression problem.

First let’s partition the data such that we have a training and test data set for our models. We do so by specifying a 90 day cut-off period, the last 90 days from the last day in the dataset become the test set, and is assigned to temporal_out_df:

Show the code
n_days = 90
max_date = df['purchase_date'].max()
cut_off = max_date - pd.to_timedelta(n_days, unit='D')


temporal_in_df = df[df['purchase_date'] < cut_off]

temporal_out_df = df[df['purchase_date'] >= cut_off]\
.query('customer_id in @temporal_in_df.customer_id')

temporal_out_df
Transaction customer_id purchase_date Units Sold Revenue Product Category start_month
12114 12115 4467 2013-10-02 3 3167.0 Kitchen Products 2012-04
12115 12116 7185 2013-10-02 2 1161.0 Snack Foods 2013-05
12116 12117 9352 2013-10-02 4 744.0 Cold Remedies 2013-03
12117 12118 5588 2013-10-02 4 1991.0 Vegetables 2013-05
12118 12119 2439 2013-10-02 4 740.0 Electrical 2013-01
... ... ... ... ... ... ... ...
14052 14053 7234 2013-12-28 5 680.0 Side Dishes 2012-02
14053 14054 2032 2013-12-29 5 1450.0 Paper Products 2011-12
14054 14055 9102 2013-12-29 3 964.0 Baking Goods 2012-06
14057 14058 6153 2013-12-31 2 400.0 Dairy 2012-04
14058 14059 3656 2013-12-31 5 2553.0 Electrical 2013-04

1498 rows × 7 columns

Next, we need to create the target variables, which are the total 90 day sales value by customer ID, purchase date, and a flag for whether they purchased in the 90 day period. We also create a range of other features, namely the recency, frequency, and moneytary variables. Additionally we create a range of other features; such as, last month and last 2 week transaction flags, as well as the last two weeks total sales value:

Show the code
targets_df = temporal_out_df[['customer_id','purchase_date','Revenue']]\
.groupby('customer_id')\
.sum('Revenue')\
.rename({'Revenue':'sales_90_value'},axis=1)\
.assign(sales_90_flag=1)

# Create independent variables in the 
# temporal in data:

# Recency First:
max_date = temporal_in_df['purchase_date'].max()

recency_features_df = temporal_in_df \
[['customer_id','purchase_date']] \
.groupby('customer_id') \
.apply(
    lambda x: int((x['purchase_date'].max() - max_date) / pd.to_timedelta(1, "D"))
) \
.to_frame() \
.set_axis(["recency"], axis=1)

# Frequency:

frequency_features_df = temporal_in_df \
[['customer_id','purchase_date']] \
.groupby('customer_id') \
.count() \
.set_axis(['frequency'], axis=1)

# Moneytary:

moneytary_features = temporal_in_df \
.groupby('customer_id') \
.aggregate(
    {
        'Revenue':["sum","mean"]
    }
) \
.set_axis(['Revenue_sum','Revenue_mean'],axis=1)

# Getting creative with other features:

# Transacted within last month

cutoff_28d = cut_off - pd.to_timedelta(28, unit='D')

transactions_last_month_df = temporal_in_df[['customer_id','purchase_date']] \
.drop_duplicates() \
.query('purchase_date >= @cutoff_28d') \
.groupby('customer_id') \
.size() \
.to_frame() \
.set_axis(['transactions_last_month'],axis=1)

# Last 2 weeks transaction

cutoff_14d = cut_off - pd.to_timedelta(14, unit='D')

transactions_last_2_weeks_df = temporal_in_df[['customer_id','purchase_date']] \
.drop_duplicates() \
.query('purchase_date >= @cutoff_14d') \
.groupby('customer_id') \
.size() \
.to_frame() \
.set_axis(['transactions_last_2_weeks'],axis=1)

# Spend in the last 2 weeks

sales_last_2_weeks = temporal_in_df[['customer_id','purchase_date','Revenue']] \
.drop_duplicates() \
.query('purchase_date >= @cutoff_14d') \
.groupby('customer_id') \
.sum('Revenue') \
.set_axis(['sales_last_2_weeks'],axis=1)

# Bringing everything together

features_df = pd.concat(
    [   recency_features_df,
        frequency_features_df,
        moneytary_features,
        transactions_last_month_df,
        transactions_last_2_weeks_df,
        sales_last_2_weeks
    ],
    axis=1
)\
.merge(
    targets_df,
    left_index=True,
    right_index=True,
    how='left'
)\
.fillna(0)

features_df
recency frequency Revenue_sum Revenue_mean transactions_last_month transactions_last_2_weeks sales_last_2_weeks sales_90_value sales_90_flag
customer_id
3 -183 1 2930.0 2930.000000 0.0 0.0 0.0 0.0 0.0
8 -54 3 4614.0 1538.000000 0.0 0.0 0.0 1858.0 1.0
9 0 1 4222.0 4222.000000 1.0 1.0 4222.0 944.0 1.0
20 -40 3 5927.0 1975.666667 0.0 0.0 0.0 0.0 0.0
24 -49 1 498.0 498.000000 0.0 0.0 0.0 372.0 1.0
... ... ... ... ... ... ... ... ... ...
10271 -48 3 2779.0 926.333333 0.0 0.0 0.0 0.0 0.0
10273 -132 4 5179.0 1294.750000 0.0 0.0 0.0 0.0 0.0
10274 -239 1 1152.0 1152.000000 0.0 0.0 0.0 0.0 0.0
10278 -162 1 598.0 598.000000 0.0 0.0 0.0 0.0 0.0
10280 -32 3 2693.0 897.666667 0.0 0.0 0.0 2948.0 1.0

5028 rows × 9 columns

This data frame makes the foundation for our machine learning models. First we consider a regression problem.

Predicting the 90 Day Customer Sales Value - A Regression Problem

Now that we have our data frame and a continuous variable (sales_90_flag), we can create a regression model, to try to predict the 90 day value of a customer. Using the Pycaret library we can easily set up a model specification and get the outputs from a 10-fold cross validated model:

Show the code
reg_setup = reg.setup(
data=features_df.drop('sales_90_flag',axis=1),
target= 'sales_90_value',
train_size=0.8,
normalize=True,
session_id=123,
verbose=True,
log_experiment=False
)

xgb_reg_model = reg.create_model('xgboost',verbose=True,round=2)

reg_predictions_df = reg.predict_model(xgb_reg_model, data = features_df,round=2) \
    .sort_values('prediction_label', ascending = False)
  Description Value
0 Session id 123
1 Target sales_90_value
2 Target type Regression
3 Original data shape (5028, 8)
4 Transformed data shape (5028, 8)
5 Transformed train set shape (4022, 8)
6 Transformed test set shape (1006, 8)
7 Numeric features 7
8 Preprocess True
9 Imputation type simple
10 Numeric imputation mean
11 Categorical imputation mode
12 Normalize True
13 Normalize method zscore
14 Fold Generator KFold
15 Fold Number 10
16 CPU Jobs -1
17 Use GPU False
18 Log Experiment False
19 Experiment Name reg-default-name
20 USI f057
  MAE MSE RMSE R2 RMSLE MAPE
Fold            
0 545.76 803952.75 896.63 -0.28 4.76 0.97
1 641.73 1340662.50 1157.87 -0.19 4.82 0.94
2 632.47 1174455.25 1083.72 -0.10 4.89 1.05
3 629.62 1489372.38 1220.40 -0.12 4.96 0.85
4 587.92 1069070.25 1033.96 -0.05 4.89 0.83
5 573.85 1196977.38 1094.06 -0.12 4.66 0.77
6 545.08 1018096.81 1009.01 -0.05 4.86 0.75
7 600.80 983871.62 991.90 -0.11 4.91 0.83
8 642.33 1251685.75 1118.79 -0.08 4.97 0.80
9 549.94 912833.44 955.42 -0.30 4.89 0.78
Mean 594.95 1124097.81 1056.18 -0.14 4.86 0.86
Std 38.09 196342.78 92.67 0.08 0.09 0.09
  Model MAE MSE RMSE R2 RMSLE MAPE
0 Extreme Gradient Boosting 325.85 366740.81 605.59 0.63 4.37 0.48

We get an xgboost model that has an MAE of $325.85 and an RMSE of $605.

Predicting Whether a Customer Will Purchase in the Next 90 Days - a Classification Problem

Next let’s specify an xgboost classification model using the same library:

Show the code
clf_setup = clf.setup(
    data=features_df.drop('sales_90_value',axis=1),
    target='sales_90_flag',
    train_size=0.8,
    session_id=123,
    verbose=True,
    log_experiment=False,
)

xgb_clf_model = clf.create_model('xgboost',round=2)

clf_predictions_df = clf.predict_model(xgb_clf_model, data = features_df, raw_score=True,round=2) \
.sort_values('prediction_score_1', ascending = False)

# clf_setup.plot_model(xgb_clf_model, plot='auc')
  Description Value
0 Session id 123
1 Target sales_90_flag
2 Target type Binary
3 Original data shape (5028, 8)
4 Transformed data shape (5028, 8)
5 Transformed train set shape (4022, 8)
6 Transformed test set shape (1006, 8)
7 Numeric features 7
8 Preprocess True
9 Imputation type simple
10 Numeric imputation mean
11 Categorical imputation mode
12 Fold Generator StratifiedKFold
13 Fold Number 10
14 CPU Jobs -1
15 Use GPU False
16 Log Experiment False
17 Experiment Name clf-default-name
18 USI 78de
  Accuracy AUC Recall Prec. F1 Kappa MCC
Fold              
0 0.77 0.58 0.15 0.39 0.22 0.12 0.14
1 0.78 0.55 0.15 0.41 0.22 0.12 0.14
2 0.81 0.69 0.27 0.57 0.37 0.27 0.30
3 0.79 0.58 0.19 0.52 0.28 0.19 0.22
4 0.74 0.59 0.14 0.26 0.18 0.04 0.05
5 0.79 0.62 0.26 0.48 0.34 0.22 0.24
6 0.78 0.63 0.19 0.46 0.27 0.17 0.19
7 0.77 0.60 0.20 0.40 0.27 0.15 0.16
8 0.79 0.58 0.17 0.47 0.25 0.15 0.18
9 0.78 0.62 0.15 0.43 0.23 0.13 0.16
Mean 0.78 0.60 0.19 0.44 0.26 0.16 0.18
Std 0.02 0.04 0.04 0.08 0.05 0.06 0.06
  Model Accuracy AUC Recall Prec. F1 Kappa MCC
0 Extreme Gradient Boosting 0.87 0.89 0.46 0.86 0.60 0.53 0.57

We get a 10-fold cross validated model that achieves 0.87 accuracy.

Interpreting Machine Learning Outputs

It would be interesting to understand the drivers of the respective models, i.e. what drives the total 90 day customer spend of the regression model and equally what drives whether a customer will purchase within that 90 day window. First, we consider the regression model:

Show the code
reg.interpret_model(xgb_reg_model)

Figure 4. Regression Model SHAP Values.

In terms of importance, the most important feature influencing the 90 day sales value is how frequent the customer has transacted. The more frequent a customer has shopped, the higher the 90 day sales value will be. Similarly, the next most important feature is the total amount spent prior to the 90 day period, followed by the average transaction revenue. Although, the direction of the effect is less clear for the average revenue variable.

Next, Let’s consider the classification model:

Show the code
clf.interpret_model(xgb_clf_model)

Figure 5. Classification Model SHAP Values.

In contrast to the regression model, the most important feature for the classification model is the recency of the customer, higher values leading to a higher probability of transacting within the next 90 days. Like in the regression example the total revenue and average revenue are the next most important variables. The most important from the regression (frequency) is now the 4th most important, following a similar pattern in terms of effect on the dependent variable.

Relating the Findings Back to the Business Context

Let’s try to summarize the findings up until now. How much total revenue are we predicting?

Show the code
total_pred_rev = reg_predictions_df \
['prediction_label'] \
.sum()
print(f'Total Predicted Revenue: ${total_pred_rev:,.2f}')
Total Predicted Revenue: $1,898,346.75

For the next 90 days we are predicting a revenue of $1.8M. We consider the customers for which we predict will generate the top 20 largest revenue for our prediction period and explore what the basket composition is. If we want to increase the frequency of those customers, we might consider the following:

Show the code
top_20_custs = reg_predictions_df.head(20).index.to_list()

transactions_df \
.query('customer_id in @top_20_custs') \
.groupby('Product Category') \
.size() \
.to_frame() \
.set_axis(['count'], axis = 1) \
.sort_values('count', ascending = False) \
.head(10)
count
Product Category
Vegetables 21
Dairy 16
Snack Foods 16
Fruit 14
Breakfast Foods 12
Baking Goods 12
Meat 10
Pain Relievers 9
Electrical 9
Specialty 7

By overlaying the product information we are able to understand what product categories should be marketed to those customers in order to increase purchase frequency. The top 5 categories being, vegetables, dairy, snack foods, fruits & breakfast foods. Similarly, if our objective is to increase the size of the customer’s transactions, we might consider the following breakdown:

Show the code
transactions_df \
    [['customer_id', 'Product Category', 'Revenue']] \
    .query('customer_id in @top_20_custs') \
    .groupby('Product Category') \
    .sum() \
    .sort_values('Revenue', ascending = False) \
    .head(10)
customer_id Revenue
Product Category
Vegetables 90287 33129.0
Fruit 61137 25198.0
Snack Foods 79100 24952.0
Dairy 87121 23978.0
Electrical 40645 17432.0
Pain Relievers 59123 16664.0
Meat 44975 16032.0
Kitchen Products 25148 15359.0
Baking Goods 39131 14290.0
Candy 20687 12817.0

Where we have the product category that would lead the most revenue for the specific customer ID. With this information, depending on what the business objective is and how the 90 day horizon is framed, we have a robust foundation for justifying respective marketing strategies.

Finally, let’s consider the impact of a 10% increase in revenue over a year:

Show the code
total_pred_rev = reg_predictions_df \
['prediction_label'] \
.sum()
(total_pred_rev*0.10*4).round(2)
759338.7

If we consider the 90 day horizon, roughly equivalent to 1 quarter, then we can expect an increase of $759,338 in revenue per year.

Conclusions and Future Research Directions

This project explores a time series data set concerned with supermarket transactions between January 2012 and December 2013. The data exists at a customer transaction level and is thus useful for customer lifetime value analysis. Next three main approaches to calculate the customer lifetime value were explored, a descriptive approach, a probabilistic and a machine learning approach. Underpinning all of these methods are the RFM (recency,frequency, moneytary) principles.

The machine learning approach was expanded upon by interpreting their outputs, enabling a better understanding of the drivers of the 90 sales revenue or probability of purchasing with 90 days. The importance of the variables varied depending on the framing of the problem (classification or regression). For predicting the 90 day sales revenue the most important variable was frequency, in contrast for whether the customer would purchase within the next 90 days, it was recency. Focusing on the top 20 customers predicted to generate most revenue, we try to understand what category of products to focus on, depending on whether we want to increase purchase frequency or basket value.

Future iterations of this analysis may wish to expand the dataset in two notable aspects. The first is the time range of the dataset. An analysis of a longer time range would help make the findings in this project more robust. The second, would be to increase the granularity of the product information, to the SKU level. This will make the relevance of the analysis to the business far more valuable, enabling marketers to identify what specific products rather than product categories would be best to engage with to achieve certain outcomes. Furthermore, some of the predictions of the regression model indicate some customer will generate losses (less than 0). Some further analysis is required to handle these specific customers and understand what is causing the model to predict these values, as well as to calibrate the models such that they don’t predict negative values given limited business sense. Finally, another way this analysis could be made more robust is with the additional of other types of information, whether customer demographic information or promotional activities throughout the time range in the data.

Nevertheless, this analysis provides some useful and promising insights for the lifetime value of the customers.