E-commerce Data Analysis Part Three - RFM Customer Segmentation

Introduction:

In Part Three of the E-Commerce data series, we will create customer segmentations based on RFM method and the Jenks natural breaks algorithm. Before we jumping into the analytical part, we would adopt the same step to clean the data as the part one article. In this part of RFM Customer Segmentaion, a RFM score will be assigned to each customer.

We would use this RFM score to decide which group of segement the customer is belong to:

  • Engaged
  • Not-Engaged
  • Best

For RFM clustering, instead of using kmeans, we will use Jenks natural breaks algorithm. For the purposes of this article, I will use jenkspy from Matthieu Viry.

Table Of Content:

  • Data Wrangling
  • RFM Method
    • Recency (When is customer’s most recent purchase?)
    • Frequency (How many times customer purchased on-line from your platform?)
    • Monetary (How much customer spend on-line during this time period?)
  • Caculate RFM scores

  • Data Visualization 2D Scatter Plot.

Data Wrangling - Exploratory Data Analysis

# Setting up the environment
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
from operator import attrgetter
import seaborn as sns
import matplotlib.colors as mcolors
# Import Data
data = pd.read_csv("OnlineRetail.csv",engine="python")

# Change InvoiceDate data type
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])
# Convert to InvoiceDate hours to the same mid-night time. 
data['date']=data['InvoiceDate'].dt.normalize()

# Summerize the null value in dataframe. 
print(data.isnull().sum())
# Missing values percentage
missing_percentage = data.isnull().sum() / data.shape[0] * 100

# How often a record without description is also missing unit price information and customerID
data[data.Description.isnull()].CustomerID.isnull().value_counts()
data[data.Description.isnull()].UnitPrice.isnull().value_counts()
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
date                0
dtype: int64





False    1454
Name: UnitPrice, dtype: int64

It’s better to change $InvoiceDate$ column in to $datatime64(ns)$ types for future data analysis on monthly revenue growth. In this project, we would ignore the hours and minutes within $InvoiceDate$ column, only the day,month,year will be stored in the dataframe.

Around 25% of rows wihtin $CustomerID$ column is missing values, this would impose a question mark on the implementation of digital tracking system for any business. We normally would look into more details but in this case, we do not have any visibilities on them. Let’s continue to explore the missing values in $Description$ and $UnitPrice$ Column.

data.loc[data.CustomerID.isnull(), ["UnitPrice", "Quantity"]].describe()
UnitPrice Quantity
count 135080.000000 135080.000000
mean 8.076577 1.995573
std 151.900816 66.696153
min -11062.060000 -9600.000000
25% 1.630000 1.000000
50% 3.290000 1.000000
75% 5.450000 3.000000
max 17836.460000 5568.000000

It is fair to say ‘UnitePirce’ and ‘Quantity’ column shows extreme outliers. As we need to use historical purhcase price and quantity to make sale forcast in the future, these outliers can be disruptive. Therefore, let us dorp of the null values. Besides dropping all the missing values in description column. How about the hidden values? like ‘nan’ instead of ‘NaN’. In the next following steps we will drop any rows with ‘nan’ and ‘’ strings in it.

Data Wrangling - Data Cleaning

# Can we find any hidden Null values? "nan"-Strings? in Description
data.loc[data.Description.isnull()==False, "lowercase_descriptions"] =  data.loc[
    data.Description.isnull()==False,"Description"].apply(lambda l: l.lower())

data.lowercase_descriptions.dropna().apply(
    lambda l: np.where("nan" in l, True, False)).value_counts()

# How about "" in Description?
data.lowercase_descriptions.dropna().apply(
    lambda l: np.where("" == l, True, False)).value_counts()
False    540455
Name: lowercase_descriptions, dtype: int64
# Transform "nan" toward "NaN"
data.loc[data.lowercase_descriptions.isnull()==False, "lowercase_descriptions"] = data.loc[
    data.lowercase_descriptions.isnull()==False, "lowercase_descriptions"
].apply(lambda l: np.where("nan" in l, None, l))

# Verified all the 'nan' changed into 'Nan'
data.lowercase_descriptions.dropna().apply(lambda l: np.where("nan" in l, True, False)).value_counts()
False    539724
Name: lowercase_descriptions, dtype: int64
#drop all the null values and hidden-null values
data.loc[(data.CustomerID.isnull()==False)&(data.lowercase_descriptions.isnull()==False)].info()

# data without any null values
data_clean = data.loc[(data.CustomerID.isnull()==False)&(data.lowercase_descriptions.isnull()==False)].copy()
data_clean.isnull().sum().sum()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 406223 entries, 0 to 541908
Data columns (total 10 columns):
InvoiceNo                 406223 non-null object
StockCode                 406223 non-null object
Description               406223 non-null object
Quantity                  406223 non-null int64
InvoiceDate               406223 non-null datetime64[ns]
UnitPrice                 406223 non-null float64
CustomerID                406223 non-null float64
Country                   406223 non-null object
date                      406223 non-null datetime64[ns]
lowercase_descriptions    406223 non-null object
dtypes: datetime64[ns](2), float64(2), int64(1), object(5)
memory usage: 34.1+ MB





0
# Only pick the date from 12-01-2010 to 11-30-2011. 
data_clean = data_clean[data_clean.InvoiceDate.between('12-01-2010', '11-30-2011')]

RFM Method - Recency

import jenkspy
from scipy import stats
from scipy.stats import norm, skew

# Recency
rfm_data = data_clean.copy()
rfm_user = pd.DataFrame(data_clean['CustomerID'].unique())
rfm_user.columns = ['CustomerID']

recency_max_date = data_clean.groupby('CustomerID').InvoiceDate.max().reset_index()
recency_max_date.columns = ['CustomerID','MaxDate']

recency_max_date['Recency'] = (recency_max_date['MaxDate'].max() - recency_max_date['MaxDate']).dt.days
rfm_user = pd.merge(rfm_user, recency_max_date[['CustomerID','Recency']], on='CustomerID')

# Jenks optimization or natural breaks
breaks = jenkspy.jenks_breaks(rfm_user['Recency'], nb_class=4)
rfm_user['Recency_Cluster'] = pd.cut(rfm_user['Recency'], bins=breaks, labels=['class_1', 'class_2', 'class_3', 'class_4'], include_lowest=True)

rfm_user
CustomerID Recency Recency_Cluster
0 17850.0 292 class_4
1 13047.0 21 class_1
2 12583.0 8 class_1
3 13748.0 85 class_2
4 15100.0 320 class_4
... ... ... ...
4321 16479.0 0 class_1
4322 13428.0 0 class_1
4323 15790.0 0 class_1
4324 13521.0 0 class_1
4325 14349.0 0 class_1

4326 rows × 3 columns

# Visualization
data =  rfm_user.Recency
plt.figure(figsize=(16,5))

plt.subplot(1,2,1)
sns.distplot(data, bins=30)
(mu, sigma) = norm.fit(data)
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best')
plt.ylabel('Frequency')
plt.title('Recency Distribution (All Customers)')

plt.subplot(1,2,2)
sns.boxplot(data)
plt.title('Recency Boxplot (All Customers)')
plt.show()

stats.skew(data)
rfm_user.groupby('Recency_Cluster')['Recency'].describe()
 mu = 90.27 and sigma = 99.10

count mean std min 25% 50% 75% max
Recency_Cluster
class_1 2167.0 17.795570 13.145651 0.0 7.0 15.0 27.00 47.0
class_2 983.0 77.385554 23.060994 48.0 57.0 71.0 96.00 128.0
class_3 648.0 181.538580 32.047710 130.0 154.0 179.0 208.00 240.0
class_4 528.0 299.731061 39.306363 242.0 265.0 296.0 342.25 363.0

From the distribution chart and bar-chart, we know the average recency for all the customers is 91.05 days and the standard deviation is 100.75. If we look into the details of class 1 recency_cluster group which accounts for around 50% customers, the average recency is only 17 days.

We have added one function to our code which is order_cluster( ). Fisher-Jenks algorithm assigns clusters as numbers but we need to have a numerical type of data to caculate the RFM score in the later chapter. order_cluster() method does this for us and our new dataframe looks much neater:

#function for ordering cluster numbers
def order_cluster(cluster_field_name, target_field_name,df,ascending):
    new_cluster_field_name = 'new_' + cluster_field_name
    df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
    df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
    df_new['index'] = df_new.index
    df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
    df_final = df_final.drop([cluster_field_name],axis=1)
    df_final = df_final.rename(columns={"index":cluster_field_name})
    return df_final

rfm_user = order_cluster('Recency_Cluster', 'Recency', rfm_user,False)
rfm_user.groupby('Recency_Cluster')['Recency'].describe()
count mean std min 25% 50% 75% max
Recency_Cluster
0 528.0 299.731061 39.306363 242.0 265.0 296.0 342.25 363.0
1 648.0 181.538580 32.047710 130.0 154.0 179.0 208.00 240.0
2 983.0 77.385554 23.060994 48.0 57.0 71.0 96.00 128.0
3 2167.0 17.795570 13.145651 0.0 7.0 15.0 27.00 47.0

RFM Method - Frequency

# Frequency
rfm_frequency = data_clean.groupby('CustomerID').InvoiceDate.count().reset_index()
rfm_frequency.columns = ['CustomerID','Frequency']
rfm_user = pd.merge(rfm_user, rfm_frequency, on='CustomerID')

# Jenks optimization or natural breaks
breaks = jenkspy.jenks_breaks(rfm_user['Frequency'], nb_class=4)
rfm_user['Frequency_Cluster'] = pd.cut(rfm_user['Frequency'], bins=breaks, labels=['class_1', 'class_2', 'class_3', 'class_4'], include_lowest=True)
rfm_user = order_cluster('Frequency_Cluster','Frequency',rfm_user,False)

# Visualization
data =  rfm_user.Frequency
plt.figure(figsize=(16,5))

plt.subplot(1,2,1)
sns.distplot(data, bins=30)
(mu, sigma) = norm.fit(data)
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best')
plt.ylabel('Frequency')
plt.title('Frequency Distribution (All Customers)')

plt.subplot(1,2,2)
sns.boxplot(data)
plt.title('Frequency Boxplot (All Customers)')
plt.show()

stats.skew(data)

rfm_user.groupby('Frequency_Cluster')['Frequency'].describe()
 mu = 89.32 and sigma = 219.66

count mean std min 25% 50% 75% max
Frequency_Cluster
0 4.0 5492.500000 1436.706999 4332.0 4523.25 5070.0 6039.25 7498.0
1 23.0 1322.260870 489.405968 838.0 991.50 1077.0 1538.00 2644.0
2 454.0 321.682819 126.384678 186.0 221.00 283.5 389.50 814.0
3 3845.0 48.887906 44.178125 1.0 15.00 33.0 72.00 185.0

However, We can easily spot there are outliers which is extreme high on frequncy and spent a big amount of money. This kind of abnormal acitivities needs more research and these group of customer might have big potentional. But one possible explaination is the ingeniuous orders created by bots to inflate the numbers.

RFM Method - Monetary

# Revenue
# Create Revenue Column
data_clean['Revenue'] = data_clean['UnitPrice'] * data_clean['Quantity']


rfm_revenue = data_clean.groupby('CustomerID').Revenue.sum().reset_index()
rfm_user = pd.merge(rfm_user, rfm_revenue[['CustomerID', 'Revenue']], on='CustomerID')

# Jenks optimization or natural breaks
breaks = jenkspy.jenks_breaks(rfm_user['Revenue'], nb_class=4)
rfm_user['Revenue_Cluster'] = pd.cut(rfm_user['Revenue'], bins=breaks, labels=['class_1', 'class_2', 'class_3', 'class_4'], include_lowest=True)
rfm_user = order_cluster('Revenue_Cluster','Revenue',rfm_user,False)

# Visualization
data =  rfm_user.Revenue
plt.figure(figsize=(16,5))

plt.subplot(1,2,1)
sns.distplot(data, bins=30)
(mu, sigma) = norm.fit(data)
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best')
plt.ylabel('Frequency')
plt.title('Frequency Distribution (All Users)')

plt.subplot(1,2,2)
sns.boxplot(data)
plt.title('Frequency Boxplot (All Users)')
plt.show()

stats.skew(data)

rfm_user.groupby('Revenue_Cluster')['Revenue'].describe()
 mu = 1827.76 and sigma = 7936.05

count mean std min 25% 50% 75% max
Revenue_Cluster
0 3.0 232830.956667 42177.542186 185919.77 215436.3600 244952.950 256286.5500 267620.15
1 4.0 110415.265000 20565.272334 80676.46 104636.6875 118139.640 123918.2175 124705.32
2 33.0 35337.012727 13585.641527 18769.32 25079.6000 30722.530 49957.4800 60490.10
3 4286.0 1306.718804 1947.475884 -4287.63 287.4475 622.925 1504.5175 17588.26

Caculate RFM Score

import numpy as np
rfm_user['RFM_Score'] = rfm_user['Recency_Cluster']+rfm_user['Frequency_Cluster']+rfm_user['Revenue_Cluster']
rfm_user.groupby('RFM_Score').agg({'Recency': 'mean', 'Frequency': 'mean', 'Revenue': 'mean'})
Recency Frequency Revenue
RFM_Score
4 3.500000 3775.000000 196162.735000
5 66.888889 2231.777778 83110.871111
6 293.765683 44.636531 1030.785775
7 167.531557 76.617111 1546.853844
8 58.872645 126.197438 2195.281365
9 19.113676 64.904212 1264.753480

Customer Segments - Best, Engaged, Not Engaged

# rfm_user['Segment'] = rfm_user['Segment'].astype(str)
rfm_user['Segment'] = 'Not-Engaged'
rfm_user.loc[rfm_user['RFM_Score'] > 6,'Segment'] = 'Engaged' 
rfm_user.loc[rfm_user['RFM_Score'] > 7,'Segment'] = 'Best' 

#creating new cluster dataframe
ltv_cluster = rfm_user.copy()
seg = ltv_cluster.groupby('Segment')['Revenue'].describe().reset_index()
seg = seg.iloc[:,[0,1,2]]
seg.columns = ['Segment', 'Customers', 'Revenue(avg)']
seg['Segments'] = pd.Series(['Best', 'Engaged', 'Not Engaged'])
segments = seg[['Segments', 'Customers', 'Revenue(avg)']]
segments
Segments Customers Revenue(avg)
0 Best 3060.0 1668.286325
1 Engaged 713.0 1546.853844
2 Not Engaged 553.0 3072.349367

Data Visualization

# Exclude CustomerID with Extreme-High Frequency Purchase records / Bot Traffic
rfm_user = rfm_user[rfm_user['Frequency']<2000]
plot = rfm_user.copy()

# Exclude Return Orders 
plot.loc[plot['Revenue'] < 0, 'Revenue'] = 0


# 2D Visualization
sns.set(style="white")
sns.relplot(x="Recency", y="Frequency", hue="Segment", size="Revenue",sizes=(40, 400), alpha=.5, palette="muted",height=6, aspect=2, data=plot)
sns.despine(left=True, bottom=True);

Conclusion:

From the scatter plot above, the size of the dots indicate how much people spend on-line shopping within the platform. In general, we can see most of our customers which belong to best group purchased on-line within 150 days with frequency lower than 1000 times. Not-engaged customers only had purchase record more than 250 days.