E-commerce Data Analysis Part Three - RFM Customer Segmentation

17 Nov 2020

Introduction:

In Part Three of the E-Commerce data series, we will create customer segmentations based on RFM method and the Jenks natural breaks algorithm. Before we jumping into the analytical part, we would adopt the same step to clean the data as the part one article. In this part of RFM Customer Segmentaion, a RFM score will be assigned to each customer.

We would use this RFM score to decide which group of segement the customer is belong to:

Engaged
Not-Engaged
Best

For RFM clustering, instead of using kmeans, we will use Jenks natural breaks algorithm. For the purposes of this article, I will use jenkspy from Matthieu Viry.

Table Of Content:

Data Wrangling
RFM Method
- Recency (When is customer’s most recent purchase?)
- Frequency (How many times customer purchased on-line from your platform?)
- Monetary (How much customer spend on-line during this time period?)
Caculate RFM scores
Data Visualization 2D Scatter Plot.

Data Wrangling - Exploratory Data Analysis

# Setting up the environment
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
from operator import attrgetter
import seaborn as sns
import matplotlib.colors as mcolors
# Import Data
data = pd.read_csv("OnlineRetail.csv",engine="python")

# Change InvoiceDate data type
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])
# Convert to InvoiceDate hours to the same mid-night time. 
data['date']=data['InvoiceDate'].dt.normalize()

# Summerize the null value in dataframe. 
print(data.isnull().sum())
# Missing values percentage
missing_percentage = data.isnull().sum() / data.shape[0] * 100

# How often a record without description is also missing unit price information and customerID
data[data.Description.isnull()].CustomerID.isnull().value_counts()
data[data.Description.isnull()].UnitPrice.isnull().value_counts()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
date                0
dtype: int64





False    1454
Name: UnitPrice, dtype: int64

It’s better to change $InvoiceDate$ column in to $datatime64(ns)$ types for future data analysis on monthly revenue growth. In this project, we would ignore the hours and minutes within $InvoiceDate$ column, only the day,month,year will be stored in the dataframe.

Around 25% of rows wihtin $CustomerID$ column is missing values, this would impose a question mark on the implementation of digital tracking system for any business. We normally would look into more details but in this case, we do not have any visibilities on them. Let’s continue to explore the missing values in $Description$ and $UnitPrice$ Column.

data.loc[data.CustomerID.isnull(), ["UnitPrice", "Quantity"]].describe()

	UnitPrice	Quantity
count	135080.000000	135080.000000
mean	8.076577	1.995573
std	151.900816	66.696153
min	-11062.060000	-9600.000000
25%	1.630000	1.000000
50%	3.290000	1.000000
75%	5.450000	3.000000
max	17836.460000	5568.000000

It is fair to say ‘UnitePirce’ and ‘Quantity’ column shows extreme outliers. As we need to use historical purhcase price and quantity to make sale forcast in the future, these outliers can be disruptive. Therefore, let us dorp of the null values. Besides dropping all the missing values in description column. How about the hidden values? like ‘nan’ instead of ‘NaN’. In the next following steps we will drop any rows with ‘nan’ and ‘’ strings in it.

Data Wrangling - Data Cleaning

# Can we find any hidden Null values? "nan"-Strings? in Description
data.loc[data.Description.isnull()==False, "lowercase_descriptions"] =  data.loc[
    data.Description.isnull()==False,"Description"].apply(lambda l: l.lower())

data.lowercase_descriptions.dropna().apply(
    lambda l: np.where("nan" in l, True, False)).value_counts()

# How about "" in Description?
data.lowercase_descriptions.dropna().apply(
    lambda l: np.where("" == l, True, False)).value_counts()

False    540455
Name: lowercase_descriptions, dtype: int64

# Transform "nan" toward "NaN"
data.loc[data.lowercase_descriptions.isnull()==False, "lowercase_descriptions"] = data.loc[
    data.lowercase_descriptions.isnull()==False, "lowercase_descriptions"
].apply(lambda l: np.where("nan" in l, None, l))

# Verified all the 'nan' changed into 'Nan'
data.lowercase_descriptions.dropna().apply(lambda l: np.where("nan" in l, True, False)).value_counts()

False    539724
Name: lowercase_descriptions, dtype: int64

#drop all the null values and hidden-null values
data.loc[(data.CustomerID.isnull()==False)&(data.lowercase_descriptions.isnull()==False)].info()

# data without any null values
data_clean = data.loc[(data.CustomerID.isnull()==False)&(data.lowercase_descriptions.isnull()==False)].copy()
data_clean.isnull().sum().sum()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 406223 entries, 0 to 541908
Data columns (total 10 columns):
InvoiceNo                 406223 non-null object
StockCode                 406223 non-null object
Description               406223 non-null object
Quantity                  406223 non-null int64
InvoiceDate               406223 non-null datetime64[ns]
UnitPrice                 406223 non-null float64
CustomerID                406223 non-null float64
Country                   406223 non-null object
date                      406223 non-null datetime64[ns]
lowercase_descriptions    406223 non-null object
dtypes: datetime64[ns](2), float64(2), int64(1), object(5)
memory usage: 34.1+ MB





0

# Only pick the date from 12-01-2010 to 11-30-2011. 
data_clean = data_clean[data_clean.InvoiceDate.between('12-01-2010', '11-30-2011')]

RFM Method - Recency

import jenkspy
from scipy import stats
from scipy.stats import norm, skew

# Recency
rfm_data = data_clean.copy()
rfm_user = pd.DataFrame(data_clean['CustomerID'].unique())
rfm_user.columns = ['CustomerID']

recency_max_date = data_clean.groupby('CustomerID').InvoiceDate.max().reset_index()
recency_max_date.columns = ['CustomerID','MaxDate']

recency_max_date['Recency'] = (recency_max_date['MaxDate'].max() - recency_max_date['MaxDate']).dt.days
rfm_user = pd.merge(rfm_user, recency_max_date[['CustomerID','Recency']], on='CustomerID')

# Jenks optimization or natural breaks
breaks = jenkspy.jenks_breaks(rfm_user['Recency'], nb_class=4)
rfm_user['Recency_Cluster'] = pd.cut(rfm_user['Recency'], bins=breaks, labels=['class_1', 'class_2', 'class_3', 'class_4'], include_lowest=True)

rfm_user

	CustomerID	Recency	Recency_Cluster
0	17850.0	292	class_4
1	13047.0	21	class_1
2	12583.0	8	class_1
3	13748.0	85	class_2
4	15100.0	320	class_4
...	...	...	...
4321	16479.0	0	class_1
4322	13428.0	0	class_1
4323	15790.0	0	class_1
4324	13521.0	0	class_1
4325	14349.0	0	class_1

4326 rows × 3 columns

# Visualization
data =  rfm_user.Recency
plt.figure(figsize=(16,5))

plt.subplot(1,2,1)
sns.distplot(data, bins=30)
(mu, sigma) = norm.fit(data)
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best')
plt.ylabel('Frequency')
plt.title('Recency Distribution (All Customers)')

plt.subplot(1,2,2)
sns.boxplot(data)
plt.title('Recency Boxplot (All Customers)')
plt.show()

stats.skew(data)
rfm_user.groupby('Recency_Cluster')['Recency'].describe()

 mu = 90.27 and sigma = 99.10

	count	mean	std	min	25%	50%	75%	max
Recency_Cluster
class_1	2167.0	17.795570	13.145651	0.0	7.0	15.0	27.00	47.0
class_2	983.0	77.385554	23.060994	48.0	57.0	71.0	96.00	128.0
class_3	648.0	181.538580	32.047710	130.0	154.0	179.0	208.00	240.0
class_4	528.0	299.731061	39.306363	242.0	265.0	296.0	342.25	363.0

From the distribution chart and bar-chart, we know the average recency for all the customers is 91.05 days and the standard deviation is 100.75. If we look into the details of class 1 recency_cluster group which accounts for around 50% customers, the average recency is only 17 days.

We have added one function to our code which is order_cluster( ). Fisher-Jenks algorithm assigns clusters as numbers but we need to have a numerical type of data to caculate the RFM score in the later chapter. order_cluster() method does this for us and our new dataframe looks much neater:

#function for ordering cluster numbers
def order_cluster(cluster_field_name, target_field_name,df,ascending):
    new_cluster_field_name = 'new_' + cluster_field_name
    df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
    df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
    df_new['index'] = df_new.index
    df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
    df_final = df_final.drop([cluster_field_name],axis=1)
    df_final = df_final.rename(columns={"index":cluster_field_name})
    return df_final

rfm_user = order_cluster('Recency_Cluster', 'Recency', rfm_user,False)
rfm_user.groupby('Recency_Cluster')['Recency'].describe()

	count	mean	std	min	25%	50%	75%	max
Recency_Cluster
0	528.0	299.731061	39.306363	242.0	265.0	296.0	342.25	363.0
1	648.0	181.538580	32.047710	130.0	154.0	179.0	208.00	240.0
2	983.0	77.385554	23.060994	48.0	57.0	71.0	96.00	128.0
3	2167.0	17.795570	13.145651	0.0	7.0	15.0	27.00	47.0

RFM Method - Frequency

# Frequency
rfm_frequency = data_clean.groupby('CustomerID').InvoiceDate.count().reset_index()
rfm_frequency.columns = ['CustomerID','Frequency']
rfm_user = pd.merge(rfm_user, rfm_frequency, on='CustomerID')

# Jenks optimization or natural breaks
breaks = jenkspy.jenks_breaks(rfm_user['Frequency'], nb_class=4)
rfm_user['Frequency_Cluster'] = pd.cut(rfm_user['Frequency'], bins=breaks, labels=['class_1', 'class_2', 'class_3', 'class_4'], include_lowest=True)
rfm_user = order_cluster('Frequency_Cluster','Frequency',rfm_user,False)

# Visualization
data =  rfm_user.Frequency
plt.figure(figsize=(16,5))

plt.subplot(1,2,1)
sns.distplot(data, bins=30)
(mu, sigma) = norm.fit(data)
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best')
plt.ylabel('Frequency')
plt.title('Frequency Distribution (All Customers)')

plt.subplot(1,2,2)
sns.boxplot(data)
plt.title('Frequency Boxplot (All Customers)')
plt.show()

stats.skew(data)

rfm_user.groupby('Frequency_Cluster')['Frequency'].describe()

 mu = 89.32 and sigma = 219.66

	count	mean	std	min	25%	50%	75%	max
Frequency_Cluster
0	4.0	5492.500000	1436.706999	4332.0	4523.25	5070.0	6039.25	7498.0
1	23.0	1322.260870	489.405968	838.0	991.50	1077.0	1538.00	2644.0
2	454.0	321.682819	126.384678	186.0	221.00	283.5	389.50	814.0
3	3845.0	48.887906	44.178125	1.0	15.00	33.0	72.00	185.0

However, We can easily spot there are outliers which is extreme high on frequncy and spent a big amount of money. This kind of abnormal acitivities needs more research and these group of customer might have big potentional. But one possible explaination is the ingeniuous orders created by bots to inflate the numbers.

RFM Method - Monetary

# Revenue
# Create Revenue Column
data_clean['Revenue'] = data_clean['UnitPrice'] * data_clean['Quantity']


rfm_revenue = data_clean.groupby('CustomerID').Revenue.sum().reset_index()
rfm_user = pd.merge(rfm_user, rfm_revenue[['CustomerID', 'Revenue']], on='CustomerID')

# Jenks optimization or natural breaks
breaks = jenkspy.jenks_breaks(rfm_user['Revenue'], nb_class=4)
rfm_user['Revenue_Cluster'] = pd.cut(rfm_user['Revenue'], bins=breaks, labels=['class_1', 'class_2', 'class_3', 'class_4'], include_lowest=True)
rfm_user = order_cluster('Revenue_Cluster','Revenue',rfm_user,False)

# Visualization
data =  rfm_user.Revenue
plt.figure(figsize=(16,5))

plt.subplot(1,2,1)
sns.distplot(data, bins=30)
(mu, sigma) = norm.fit(data)
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best')
plt.ylabel('Frequency')
plt.title('Frequency Distribution (All Users)')

plt.subplot(1,2,2)
sns.boxplot(data)
plt.title('Frequency Boxplot (All Users)')
plt.show()

stats.skew(data)

rfm_user.groupby('Revenue_Cluster')['Revenue'].describe()

 mu = 1827.76 and sigma = 7936.05

	count	mean	std	min	25%	50%	75%	max
Revenue_Cluster
0	3.0	232830.956667	42177.542186	185919.77	215436.3600	244952.950	256286.5500	267620.15
1	4.0	110415.265000	20565.272334	80676.46	104636.6875	118139.640	123918.2175	124705.32
2	33.0	35337.012727	13585.641527	18769.32	25079.6000	30722.530	49957.4800	60490.10
3	4286.0	1306.718804	1947.475884	-4287.63	287.4475	622.925	1504.5175	17588.26

Caculate RFM Score

import numpy as np
rfm_user['RFM_Score'] = rfm_user['Recency_Cluster']+rfm_user['Frequency_Cluster']+rfm_user['Revenue_Cluster']

rfm_user.groupby('RFM_Score').agg({'Recency': 'mean', 'Frequency': 'mean', 'Revenue': 'mean'})

	Recency	Frequency	Revenue
RFM_Score
4	3.500000	3775.000000	196162.735000
5	66.888889	2231.777778	83110.871111
6	293.765683	44.636531	1030.785775
7	167.531557	76.617111	1546.853844
8	58.872645	126.197438	2195.281365
9	19.113676	64.904212	1264.753480

Customer Segments - Best, Engaged, Not Engaged

# rfm_user['Segment'] = rfm_user['Segment'].astype(str)
rfm_user['Segment'] = 'Not-Engaged'
rfm_user.loc[rfm_user['RFM_Score'] > 6,'Segment'] = 'Engaged' 
rfm_user.loc[rfm_user['RFM_Score'] > 7,'Segment'] = 'Best' 

#creating new cluster dataframe
ltv_cluster = rfm_user.copy()
seg = ltv_cluster.groupby('Segment')['Revenue'].describe().reset_index()
seg = seg.iloc[:,[0,1,2]]
seg.columns = ['Segment', 'Customers', 'Revenue(avg)']
seg['Segments'] = pd.Series(['Best', 'Engaged', 'Not Engaged'])
segments = seg[['Segments', 'Customers', 'Revenue(avg)']]
segments

	Segments	Customers	Revenue(avg)
0	Best	3060.0	1668.286325
1	Engaged	713.0	1546.853844
2	Not Engaged	553.0	3072.349367

Data Visualization

# Exclude CustomerID with Extreme-High Frequency Purchase records / Bot Traffic
rfm_user = rfm_user[rfm_user['Frequency']<2000]
plot = rfm_user.copy()

# Exclude Return Orders 
plot.loc[plot['Revenue'] < 0, 'Revenue'] = 0


# 2D Visualization
sns.set(style="white")
sns.relplot(x="Recency", y="Frequency", hue="Segment", size="Revenue",sizes=(40, 400), alpha=.5, palette="muted",height=6, aspect=2, data=plot)
sns.despine(left=True, bottom=True);

Conclusion:

From the scatter plot above, the size of the dots indicate how much people spend on-line shopping within the platform. In general, we can see most of our customers which belong to best group purchased on-line within 150 days with frequency lower than 1000 times. Not-engaged customers only had purchase record more than 250 days.

Data Empowerment