E-commerce Data Analysis Part Two - Customer Cohort Analysis

Introduction:

In E-commerce Data Series Part Two, we will mainly focus on customer cohort analysis. In part one, we already went through data cleaning process, so we would follow the same steps here. There are many benefits to adopt cohort analysis. In my personal opinion, the main goal we are trying to achieve in Cohort Analysis is form a better understanding of how customer behavior have affect on business.

In this Ecommerce case, we would like to know:

  1. How much revenue new customers bring to the business compared to the existing customers?
  2. What’s the customer retention rate for different cohort group?

Why these two question is so important to this business? Because it’s more expensive to acquire a new customer than keeping a warm customer who pruchased items on the business. We would love to see our customer base expanding over time and re-curring revenue increasing.

Table Of Content:

  • Data Wrangling
  • New vs. Existing Customer
  • Cohort Analysis

Data Wrangling - Exploratory Data Analysis

# Setting up the environment
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
from operator import attrgetter
import seaborn as sns
import matplotlib.colors as mcolors

# Import Data
data = pd.read_csv("OnlineRetail.csv",engine="python")
data.info()

# Change InvoiceDate data type
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])
# Convert to InvoiceDate hours to the same mid-night time. 
data['date']=data['InvoiceDate'].dt.normalize()

# Summerize the null value in dataframe. 
print(data.isnull().sum())
# Missing values percentage
missing_percentage = data.isnull().sum() / data.shape[0] * 100

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null object
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
date                0
dtype: int64
data.loc[data.CustomerID.isnull(), ["UnitPrice", "Quantity"]].describe()
UnitPrice Quantity
count 135080.000000 135080.000000
mean 8.076577 1.995573
std 151.900816 66.696153
min -11062.060000 -9600.000000
25% 1.630000 1.000000
50% 3.290000 1.000000
75% 5.450000 3.000000
max 17836.460000 5568.000000

It is fair to say ‘UnitePirce’ and ‘Quantity’ column shows extreme outliers. As we need to use historical purhcase price and quantity to make sale forcast in the future, these outliers can be disruptive. Therefore, let us dorp of the null values. Besides dropping all the missing values in description column. How about the hidden values? like ‘nan’ instead of ‘NaN’. In the next following steps we will drop any rows with ‘nan’ and ‘’ strings in it.

Data Cleaning - Dropping Rows With Missing Values

# Can we find any hidden Null values? "nan"-Strings? in Description
data.loc[data.Description.isnull()==False, "lowercase_descriptions"] =  data.loc[
    data.Description.isnull()==False,"Description"].apply(lambda l: l.lower())

data.lowercase_descriptions.dropna().apply(
    lambda l: np.where("nan" in l, True, False)).value_counts()
False    539724
True        731
Name: lowercase_descriptions, dtype: int64
# How about "" in Description?
data.lowercase_descriptions.dropna().apply(
    lambda l: np.where("" == l, True, False)).value_counts()
False    540455
Name: lowercase_descriptions, dtype: int64
# Transform "nan" toward "NaN"
data.loc[data.lowercase_descriptions.isnull()==False, "lowercase_descriptions"] = data.loc[
    data.lowercase_descriptions.isnull()==False, "lowercase_descriptions"
].apply(lambda l: np.where("nan" in l, None, l))

# Verified all the 'nan' changed into 'Nan'
data.lowercase_descriptions.dropna().apply(lambda l: np.where("nan" in l, True, False)).value_counts()
False    539724
Name: lowercase_descriptions, dtype: int64
#drop all the null values and hidden-null values
data.loc[(data.CustomerID.isnull()==False)&(data.lowercase_descriptions.isnull()==False)].info()

# data without any null values
data_clean = data.loc[(data.CustomerID.isnull()==False)&(data.lowercase_descriptions.isnull()==False)].copy()
data_clean.isnull().sum().sum()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 406223 entries, 0 to 541908
Data columns (total 10 columns):
InvoiceNo                 406223 non-null object
StockCode                 406223 non-null object
Description               406223 non-null object
Quantity                  406223 non-null int64
InvoiceDate               406223 non-null datetime64[ns]
UnitPrice                 406223 non-null float64
CustomerID                406223 non-null float64
Country                   406223 non-null object
date                      406223 non-null datetime64[ns]
lowercase_descriptions    406223 non-null object
dtypes: datetime64[ns](2), float64(2), int64(1), object(5)
memory usage: 34.1+ MB





0

New Customer Vs Existing

In this Project, we defined the Existing customers as the group of records with lastest $FirstConversion$ later than the $FirstConversionYearMonth$. In other words, a existing customer should have purchase history with InvoiceDate later than their first purchase date. Since we only have part of the Dec data, a lower than average monthly revenue number from Dec 2011 was expected.

# Create Revenue Column
data_clean['Revenue'] = data_clean['UnitPrice'] * data_clean['Quantity']

# Create YearMonth Column 
data_clean['YearMonth'] = data_clean['InvoiceDate'].apply(lambda date: date.year*100+date.month)

# New Vs. Existing
first_conversion = data_clean.groupby('CustomerID').agg({'InvoiceDate': lambda x: x.min()}).reset_index()
first_conversion.columns = ['CustomerID','FirstConversion']

# Change the Strings in FirstConversion Column into 'YearMonth' format number. 
first_conversion['FirstConversionYearMonth'] = first_conversion['FirstConversion'].apply\
(lambda date: date.year*100 + date.month)

first_conversion.head()
CustomerID FirstConversion FirstConversionYearMonth
0 12346.0 2011-01-18 10:01:00 201101
1 12347.0 2010-12-07 14:57:00 201012
2 12348.0 2010-12-16 19:09:00 201012
3 12349.0 2011-11-21 09:51:00 201111
4 12350.0 2011-02-02 16:01:00 201102
# Merge first_conversion with data_clean
df_customer = pd.merge(data_clean, first_conversion, on='CustomerID')
df_customer.head()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country date lowercase_descriptions Revenue YearMonth FirstConversion FirstConversionYearMonth
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 2010-12-01 white hanging heart t-light holder 15.30 201012 2010-12-01 08:26:00 201012
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 2010-12-01 white metal lantern 20.34 201012 2010-12-01 08:26:00 201012
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 2010-12-01 cream cupid hearts coat hanger 22.00 201012 2010-12-01 08:26:00 201012
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 2010-12-01 knitted union flag hot water bottle 20.34 201012 2010-12-01 08:26:00 201012
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 2010-12-01 red woolly hottie white heart. 20.34 201012 2010-12-01 08:26:00 201012
df_customer['customer_type'] = 'New'
df_customer.loc[df_customer['YearMonth']>df_customer['FirstConversionYearMonth'],'customer_type']='Existing'
df_customer.loc[df_customer['customer_type']=='Existing']      

df_customer_revenue = df_customer.groupby(['customer_type','YearMonth']).agg({'Revenue':'sum'}).reset_index()


sns.catplot(x="YearMonth", y="Revenue", data=df_customer_revenue, kind="point", aspect=2.5, \
            palette="YlGnBu",hue= "customer_type")

plt.ylabel('Revenue')
plt.xlabel('')
sns.despine(left=True, bottom=True)
plt.title('Monthly Revenue (New vs Existing)');

Cohort Analysis:

In this project, we define the cohort group as the customer who purchase on-line within the same months. To translate this idea into cohort analysis, this means we need to group people by their ‘CustomerID’ and ‘InvoiceDate’.

# create two variables: month of order and cohort
data_clean['order_month'] = data_clean['InvoiceDate'].dt.to_period('M')
data_clean['cohort'] = data_clean.groupby('CustomerID')['InvoiceDate'] \
                       .transform('min') \
                       .dt.to_period('M')
data_clean
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country date lowercase_descriptions Revenue YearMonth order_month cohort
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 2010-12-01 white hanging heart t-light holder 15.30 201012 2010-12 2010-12
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 2010-12-01 white metal lantern 20.34 201012 2010-12 2010-12
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 2010-12-01 cream cupid hearts coat hanger 22.00 201012 2010-12 2010-12
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 2010-12-01 knitted union flag hot water bottle 20.34 201012 2010-12 2010-12
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 2010-12-01 red woolly hottie white heart. 20.34 201012 2010-12 2010-12
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
541904 581587 22613 PACK OF 20 SPACEBOY NAPKINS 12 2011-12-09 12:50:00 0.85 12680.0 France 2011-12-09 pack of 20 spaceboy napkins 10.20 201112 2011-12 2011-08
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2011-12-09 12:50:00 2.10 12680.0 France 2011-12-09 children's apron dolly girl 12.60 201112 2011-12 2011-08
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 2011-12-09 12:50:00 4.15 12680.0 France 2011-12-09 childrens cutlery dolly girl 16.60 201112 2011-12 2011-08
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 2011-12-09 12:50:00 4.15 12680.0 France 2011-12-09 childrens cutlery circus parade 16.60 201112 2011-12 2011-08
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 2011-12-09 12:50:00 4.95 12680.0 France 2011-12-09 baking set 9 piece retrospot 14.85 201112 2011-12 2011-08

406223 rows × 14 columns

# add total number of customers for every time period. 
df_cohort = data_clean.groupby(['cohort', 'order_month']) \
              .agg(n_customers=('CustomerID', 'nunique')) \
              .reset_index(drop=False)

# add an indicator for periods (months since first purchase)
df_cohort['period_number'] = (df_cohort.order_month - df_cohort.cohort).\
                              apply(attrgetter('n'))
df_cohort.head(5)
cohort order_month n_customers period_number
0 2010-12 2010-12 948 0
1 2010-12 2011-01 362 1
2 2010-12 2011-02 317 2
3 2010-12 2011-03 367 3
4 2010-12 2011-04 341 4
# pivot the data into a form of the matrix
cohort_pivot = df_cohort.pivot_table(index='cohort',
                                     columns = 'period_number',
                                     values = 'n_customers')
cohort_pivot
period_number 0 1 2 3 4 5 6 7 8 9 10 11 12
cohort
2010-12 948.0 362.0 317.0 367.0 341.0 376.0 360.0 336.0 336.0 374.0 354.0 474.0 260.0
2011-01 421.0 101.0 119.0 102.0 138.0 126.0 110.0 108.0 131.0 146.0 155.0 63.0 NaN
2011-02 380.0 94.0 73.0 106.0 102.0 94.0 97.0 107.0 98.0 119.0 34.0 NaN NaN
2011-03 440.0 84.0 112.0 96.0 102.0 78.0 116.0 105.0 127.0 39.0 NaN NaN NaN
2011-04 299.0 68.0 66.0 63.0 62.0 71.0 69.0 78.0 25.0 NaN NaN NaN NaN
2011-05 279.0 66.0 48.0 48.0 60.0 68.0 74.0 29.0 NaN NaN NaN NaN NaN
2011-06 235.0 49.0 44.0 64.0 58.0 79.0 24.0 NaN NaN NaN NaN NaN NaN
2011-07 191.0 40.0 39.0 44.0 52.0 22.0 NaN NaN NaN NaN NaN NaN NaN
2011-08 167.0 42.0 42.0 42.0 23.0 NaN NaN NaN NaN NaN NaN NaN NaN
2011-09 298.0 89.0 97.0 36.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-10 352.0 93.0 46.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-11 321.0 43.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-12 41.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# divide by the cohort size (month 0) to obtain retention as %
cohort_size = cohort_pivot.iloc[:,0]
retention_matrix = cohort_pivot.divide(cohort_size, axis = 0)
# plot the rentention matrix
with sns.axes_style("white"):
    fig, ax = plt.subplots(1, 2, figsize=(12, 8), sharey=True, \
                           gridspec_kw={'width_ratios': [1, 11]})
    
    # retention matrix
    sns.heatmap(retention_matrix, 
                mask=retention_matrix.isnull(), 
                annot=True, 
                fmt='.0%', 
                cmap='RdYlGn', 
                ax=ax[1])
    ax[1].set_title('Monthly Cohorts: User Retention', fontsize=16)
    ax[1].set(xlabel='# of periods',
              ylabel='')

    
    # cohort size
    cohort_size_df = pd.DataFrame(cohort_size).rename(columns={0: 'cohort_size'})
    white_cmap = mcolors.ListedColormap(['white'])
    sns.heatmap(cohort_size_df, 
                annot=True, 
                cbar=False, 
                fmt='g', 
                cmap=white_cmap, 
                ax=ax[0])
    
    fig.tight_layout()