E-commerce Data Analysis Part Two - Customer Cohort Analysis

12 Nov 2020

Introduction:

In E-commerce Data Series Part Two, we will mainly focus on customer cohort analysis. In part one, we already went through data cleaning process, so we would follow the same steps here. There are many benefits to adopt cohort analysis. In my personal opinion, the main goal we are trying to achieve in Cohort Analysis is form a better understanding of how customer behavior have affect on business.

In this Ecommerce case, we would like to know:

How much revenue new customers bring to the business compared to the existing customers?
What’s the customer retention rate for different cohort group?

Why these two question is so important to this business? Because it’s more expensive to acquire a new customer than keeping a warm customer who pruchased items on the business. We would love to see our customer base expanding over time and re-curring revenue increasing.

Table Of Content:

Data Wrangling
New vs. Existing Customer
Cohort Analysis

Data Wrangling - Exploratory Data Analysis

# Setting up the environment
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
from operator import attrgetter
import seaborn as sns
import matplotlib.colors as mcolors

# Import Data
data = pd.read_csv("OnlineRetail.csv",engine="python")
data.info()

# Change InvoiceDate data type
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])
# Convert to InvoiceDate hours to the same mid-night time. 
data['date']=data['InvoiceDate'].dt.normalize()

# Summerize the null value in dataframe. 
print(data.isnull().sum())
# Missing values percentage
missing_percentage = data.isnull().sum() / data.shape[0] * 100

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null object
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
date                0
dtype: int64

data.loc[data.CustomerID.isnull(), ["UnitPrice", "Quantity"]].describe()

	UnitPrice	Quantity
count	135080.000000	135080.000000
mean	8.076577	1.995573
std	151.900816	66.696153
min	-11062.060000	-9600.000000
25%	1.630000	1.000000
50%	3.290000	1.000000
75%	5.450000	3.000000
max	17836.460000	5568.000000

It is fair to say ‘UnitePirce’ and ‘Quantity’ column shows extreme outliers. As we need to use historical purhcase price and quantity to make sale forcast in the future, these outliers can be disruptive. Therefore, let us dorp of the null values. Besides dropping all the missing values in description column. How about the hidden values? like ‘nan’ instead of ‘NaN’. In the next following steps we will drop any rows with ‘nan’ and ‘’ strings in it.

Data Cleaning - Dropping Rows With Missing Values

# Can we find any hidden Null values? "nan"-Strings? in Description
data.loc[data.Description.isnull()==False, "lowercase_descriptions"] =  data.loc[
    data.Description.isnull()==False,"Description"].apply(lambda l: l.lower())

data.lowercase_descriptions.dropna().apply(
    lambda l: np.where("nan" in l, True, False)).value_counts()

False    539724
True        731
Name: lowercase_descriptions, dtype: int64

# How about "" in Description?
data.lowercase_descriptions.dropna().apply(
    lambda l: np.where("" == l, True, False)).value_counts()

False    540455
Name: lowercase_descriptions, dtype: int64

# Transform "nan" toward "NaN"
data.loc[data.lowercase_descriptions.isnull()==False, "lowercase_descriptions"] = data.loc[
    data.lowercase_descriptions.isnull()==False, "lowercase_descriptions"
].apply(lambda l: np.where("nan" in l, None, l))

# Verified all the 'nan' changed into 'Nan'
data.lowercase_descriptions.dropna().apply(lambda l: np.where("nan" in l, True, False)).value_counts()

False    539724
Name: lowercase_descriptions, dtype: int64

#drop all the null values and hidden-null values
data.loc[(data.CustomerID.isnull()==False)&(data.lowercase_descriptions.isnull()==False)].info()

# data without any null values
data_clean = data.loc[(data.CustomerID.isnull()==False)&(data.lowercase_descriptions.isnull()==False)].copy()
data_clean.isnull().sum().sum()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 406223 entries, 0 to 541908
Data columns (total 10 columns):
InvoiceNo                 406223 non-null object
StockCode                 406223 non-null object
Description               406223 non-null object
Quantity                  406223 non-null int64
InvoiceDate               406223 non-null datetime64[ns]
UnitPrice                 406223 non-null float64
CustomerID                406223 non-null float64
Country                   406223 non-null object
date                      406223 non-null datetime64[ns]
lowercase_descriptions    406223 non-null object
dtypes: datetime64[ns](2), float64(2), int64(1), object(5)
memory usage: 34.1+ MB





0

New Customer Vs Existing

In this Project, we defined the Existing customers as the group of records with lastest $FirstConversion$ later than the $FirstConversionYearMonth$. In other words, a existing customer should have purchase history with InvoiceDate later than their first purchase date. Since we only have part of the Dec data, a lower than average monthly revenue number from Dec 2011 was expected.

# Create Revenue Column
data_clean['Revenue'] = data_clean['UnitPrice'] * data_clean['Quantity']

# Create YearMonth Column 
data_clean['YearMonth'] = data_clean['InvoiceDate'].apply(lambda date: date.year*100+date.month)

# New Vs. Existing
first_conversion = data_clean.groupby('CustomerID').agg({'InvoiceDate': lambda x: x.min()}).reset_index()
first_conversion.columns = ['CustomerID','FirstConversion']

# Change the Strings in FirstConversion Column into 'YearMonth' format number. 
first_conversion['FirstConversionYearMonth'] = first_conversion['FirstConversion'].apply\
(lambda date: date.year*100 + date.month)

first_conversion.head()

	CustomerID	FirstConversion	FirstConversionYearMonth
0	12346.0	2011-01-18 10:01:00	201101
1	12347.0	2010-12-07 14:57:00	201012
2	12348.0	2010-12-16 19:09:00	201012
3	12349.0	2011-11-21 09:51:00	201111
4	12350.0	2011-02-02 16:01:00	201102

# Merge first_conversion with data_clean
df_customer = pd.merge(data_clean, first_conversion, on='CustomerID')
df_customer.head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	date	lowercase_descriptions	Revenue	YearMonth	FirstConversion	FirstConversionYearMonth
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom	2010-12-01	white hanging heart t-light holder	15.30	201012	2010-12-01 08:26:00	201012
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	2010-12-01	white metal lantern	20.34	201012	2010-12-01 08:26:00	201012
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom	2010-12-01	cream cupid hearts coat hanger	22.00	201012	2010-12-01 08:26:00	201012
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	2010-12-01	knitted union flag hot water bottle	20.34	201012	2010-12-01 08:26:00	201012
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	2010-12-01	red woolly hottie white heart.	20.34	201012	2010-12-01 08:26:00	201012

df_customer['customer_type'] = 'New'
df_customer.loc[df_customer['YearMonth']>df_customer['FirstConversionYearMonth'],'customer_type']='Existing'
df_customer.loc[df_customer['customer_type']=='Existing']      

df_customer_revenue = df_customer.groupby(['customer_type','YearMonth']).agg({'Revenue':'sum'}).reset_index()


sns.catplot(x="YearMonth", y="Revenue", data=df_customer_revenue, kind="point", aspect=2.5, \
            palette="YlGnBu",hue= "customer_type")

plt.ylabel('Revenue')
plt.xlabel('')
sns.despine(left=True, bottom=True)
plt.title('Monthly Revenue (New vs Existing)');

Cohort Analysis:

In this project, we define the cohort group as the customer who purchase on-line within the same months. To translate this idea into cohort analysis, this means we need to group people by their ‘CustomerID’ and ‘InvoiceDate’.

# create two variables: month of order and cohort
data_clean['order_month'] = data_clean['InvoiceDate'].dt.to_period('M')
data_clean['cohort'] = data_clean.groupby('CustomerID')['InvoiceDate'] \
                       .transform('min') \
                       .dt.to_period('M')
data_clean

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	date	lowercase_descriptions	Revenue	YearMonth	order_month	cohort
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom	2010-12-01	white hanging heart t-light holder	15.30	201012	2010-12	2010-12
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	2010-12-01	white metal lantern	20.34	201012	2010-12	2010-12
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom	2010-12-01	cream cupid hearts coat hanger	22.00	201012	2010-12	2010-12
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	2010-12-01	knitted union flag hot water bottle	20.34	201012	2010-12	2010-12
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	2010-12-01	red woolly hottie white heart.	20.34	201012	2010-12	2010-12
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
541904	581587	22613	PACK OF 20 SPACEBOY NAPKINS	12	2011-12-09 12:50:00	0.85	12680.0	France	2011-12-09	pack of 20 spaceboy napkins	10.20	201112	2011-12	2011-08
541905	581587	22899	CHILDREN'S APRON DOLLY GIRL	6	2011-12-09 12:50:00	2.10	12680.0	France	2011-12-09	children's apron dolly girl	12.60	201112	2011-12	2011-08
541906	581587	23254	CHILDRENS CUTLERY DOLLY GIRL	4	2011-12-09 12:50:00	4.15	12680.0	France	2011-12-09	childrens cutlery dolly girl	16.60	201112	2011-12	2011-08
541907	581587	23255	CHILDRENS CUTLERY CIRCUS PARADE	4	2011-12-09 12:50:00	4.15	12680.0	France	2011-12-09	childrens cutlery circus parade	16.60	201112	2011-12	2011-08
541908	581587	22138	BAKING SET 9 PIECE RETROSPOT	3	2011-12-09 12:50:00	4.95	12680.0	France	2011-12-09	baking set 9 piece retrospot	14.85	201112	2011-12	2011-08

406223 rows × 14 columns

# add total number of customers for every time period. 
df_cohort = data_clean.groupby(['cohort', 'order_month']) \
              .agg(n_customers=('CustomerID', 'nunique')) \
              .reset_index(drop=False)

# add an indicator for periods (months since first purchase)
df_cohort['period_number'] = (df_cohort.order_month - df_cohort.cohort).\
                              apply(attrgetter('n'))
df_cohort.head(5)

	cohort	order_month	n_customers	period_number
0	2010-12	2010-12	948	0
1	2010-12	2011-01	362	1
2	2010-12	2011-02	317	2
3	2010-12	2011-03	367	3
4	2010-12	2011-04	341	4

# pivot the data into a form of the matrix
cohort_pivot = df_cohort.pivot_table(index='cohort',
                                     columns = 'period_number',
                                     values = 'n_customers')
cohort_pivot

period_number	0	1	2	3	4	5	6	7	8	9	10	11	12
cohort
2010-12	948.0	362.0	317.0	367.0	341.0	376.0	360.0	336.0	336.0	374.0	354.0	474.0	260.0
2011-01	421.0	101.0	119.0	102.0	138.0	126.0	110.0	108.0	131.0	146.0	155.0	63.0	NaN
2011-02	380.0	94.0	73.0	106.0	102.0	94.0	97.0	107.0	98.0	119.0	34.0	NaN	NaN
2011-03	440.0	84.0	112.0	96.0	102.0	78.0	116.0	105.0	127.0	39.0	NaN	NaN	NaN
2011-04	299.0	68.0	66.0	63.0	62.0	71.0	69.0	78.0	25.0	NaN	NaN	NaN	NaN
2011-05	279.0	66.0	48.0	48.0	60.0	68.0	74.0	29.0	NaN	NaN	NaN	NaN	NaN
2011-06	235.0	49.0	44.0	64.0	58.0	79.0	24.0	NaN	NaN	NaN	NaN	NaN	NaN
2011-07	191.0	40.0	39.0	44.0	52.0	22.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-08	167.0	42.0	42.0	42.0	23.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-09	298.0	89.0	97.0	36.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-10	352.0	93.0	46.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-11	321.0	43.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-12	41.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

# divide by the cohort size (month 0) to obtain retention as %
cohort_size = cohort_pivot.iloc[:,0]
retention_matrix = cohort_pivot.divide(cohort_size, axis = 0)

# plot the rentention matrix
with sns.axes_style("white"):
    fig, ax = plt.subplots(1, 2, figsize=(12, 8), sharey=True, \
                           gridspec_kw={'width_ratios': [1, 11]})
    
    # retention matrix
    sns.heatmap(retention_matrix, 
                mask=retention_matrix.isnull(), 
                annot=True, 
                fmt='.0%', 
                cmap='RdYlGn', 
                ax=ax[1])
    ax[1].set_title('Monthly Cohorts: User Retention', fontsize=16)
    ax[1].set(xlabel='# of periods',
              ylabel='')

    
    # cohort size
    cohort_size_df = pd.DataFrame(cohort_size).rename(columns={0: 'cohort_size'})
    white_cmap = mcolors.ListedColormap(['white'])
    sns.heatmap(cohort_size_df, 
                annot=True, 
                cbar=False, 
                fmt='g', 
                cmap=white_cmap, 
                ax=ax[0])
    
    fig.tight_layout()

Data Empowerment

E-commerce Data Analysis Part Two - Customer Cohort Analysis

Introduction:

Table Of Content:

Data Wrangling - Exploratory Data Analysis

Data Cleaning - Dropping Rows With Missing Values

New Customer Vs Existing

Cohort Analysis:

Related posts

E-commerce Data Analysis Part Three - RFM Customer Segmentation 17 Nov 2020

E-commerce Data Analysis Part One - Data Wrangling & Monthly Revenue 10 Nov 2020

Digital Marketing Channels Attribution Models 03 Nov 2020