E-commerce Data Analysis Part One - Data Wrangling & Monthly Revenue

10 Nov 2020

Introduction:

In this E-commerce Data Series, we will analyze sales data from a UK online retialer. In part one, we will be mainly focus on the data wrangling and then move on to caculate the monthly revenue growth rate.

Table Of Content:

Exploratory Data Analysis
Data Cleaning
Monthly Revenue Growth

Prepare to Take-Off

# Setting up the environment
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
from operator import attrgetter
import seaborn as sns
import matplotlib.colors as mcolors

Exploratory Data Analysis

# Import Data
data = pd.read_csv("OnlineRetail.csv",engine="python")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null object
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB

It’s better to change $InvoiceDate$ column in to $datatime64(ns)$ types for future data analysis on monthly revenue growth. In this project, we would ignore the hours and minutes within $InvoiceDate$ column, only the day,month,year will be stored in the dataframe.

# Change InvoiceDate data type
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])
# Convert to InvoiceDate hours to the same mid-night time. 
data['date']=data['InvoiceDate'].dt.normalize()
# Check Data Type
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 9 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null datetime64[ns]
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
date           541909 non-null datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(1), object(4)
memory usage: 37.2+ MB

# Summerize the null value in dataframe. 
print(data.isnull().sum())
# Missing values percentage
missing_percentage = data.isnull().sum() / data.shape[0] * 100
missing_percentage

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
date                0
dtype: int64





InvoiceNo       0.000000
StockCode       0.000000
Description     0.268311
Quantity        0.000000
InvoiceDate     0.000000
UnitPrice       0.000000
CustomerID     24.926694
Country         0.000000
date            0.000000
dtype: float64

Around 25% of rows wihtin $CustomerID$ column is missing values, this would impose a question mark on the implementation of digital tracking system for any business. We normally would look into more details but in this case, we do not have any visibilities on them. Let’s continue to explore the missing values in $Description$ and $UnitPrice$ Column.

data[data.Description.isnull()].head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	CustomerID	Country	date
622	536414	22139	NaN	56	2010-12-01 11:52:00	NaN	United Kingdom	2010-12-01
1970	536545	21134	NaN	1	2010-12-01 14:32:00	NaN	United Kingdom	2010-12-01
1971	536546	22145	NaN	1	2010-12-01 14:33:00	NaN	United Kingdom	2010-12-01
1972	536547	37509	NaN	1	2010-12-01 14:33:00	NaN	United Kingdom	2010-12-01
1987	536549	85226A	NaN	1	2010-12-01 14:34:00	NaN	United Kingdom	2010-12-01

# How often a record without description is also missing unit price information and customerID
data[data.Description.isnull()].CustomerID.isnull().value_counts()

True    1454
Name: CustomerID, dtype: int64

data[data.Description.isnull()].UnitPrice.isnull().value_counts()

False    1454
Name: UnitPrice, dtype: int64

data[data.CustomerID.isnull()].head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	date
622	536414	22139	NaN	56	2010-12-01 11:52:00	0.00	NaN	United Kingdom	2010-12-01
1443	536544	21773	DECORATIVE ROSE BATHROOM BOTTLE	1	2010-12-01 14:32:00	2.51	NaN	United Kingdom	2010-12-01
1444	536544	21774	DECORATIVE CATS BATHROOM BOTTLE	2	2010-12-01 14:32:00	2.51	NaN	United Kingdom	2010-12-01
1445	536544	21786	POLKADOT RAIN HAT	4	2010-12-01 14:32:00	0.85	NaN	United Kingdom	2010-12-01
1446	536544	21787	RAIN PONCHO RETROSPOT	2	2010-12-01 14:32:00	1.66	NaN	United Kingdom	2010-12-01

data.loc[data.CustomerID.isnull(), ["UnitPrice", "Quantity"]].describe()

	UnitPrice	Quantity
count	135080.000000	135080.000000
mean	8.076577	1.995573
std	151.900816	66.696153
min	-11062.060000	-9600.000000
25%	1.630000	1.000000
50%	3.290000	1.000000
75%	5.450000	3.000000
max	17836.460000	5568.000000

It is fair to say ‘UnitePirce’ and ‘Quantity’ column shows extreme outliers. As we need to use historical purhcase price and quantity to make sale forcast in the future, these outliers can be disruptive. Therefore, let us dorp of the null values. Besides dropping all the missing values in description column. How about the hidden values? like ‘nan’ instead of ‘NaN’. In the next following steps we will drop any rows with ‘nan’ and ‘’ strings in it.

Data Cleaning - Dropping Rows With Missing Values

# Can we find any hidden Null values? "nan"-Strings? in Description
data.loc[data.Description.isnull()==False, "lowercase_descriptions"] =  data.loc[
    data.Description.isnull()==False,"Description"].apply(lambda l: l.lower())

data.lowercase_descriptions.dropna().apply(
    lambda l: np.where("nan" in l, True, False)).value_counts()

False    539724
True        731
Name: lowercase_descriptions, dtype: int64

# How about "" in Description?
data.lowercase_descriptions.dropna().apply(
    lambda l: np.where("" == l, True, False)).value_counts()

False    540455
Name: lowercase_descriptions, dtype: int64

# Transform "nan" toward "NaN"
data.loc[data.lowercase_descriptions.isnull()==False, "lowercase_descriptions"] = data.loc[
    data.lowercase_descriptions.isnull()==False, "lowercase_descriptions"
].apply(lambda l: np.where("nan" in l, None, l))

# Verified all the 'nan' changed into 'Nan'
data.lowercase_descriptions.dropna().apply(lambda l: np.where("nan" in l, True, False)).value_counts()

False    539724
Name: lowercase_descriptions, dtype: int64

#drop all the null values and hidden-null values
data.loc[(data.CustomerID.isnull()==False)&(data.lowercase_descriptions.isnull()==False)].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 406223 entries, 0 to 541908
Data columns (total 10 columns):
InvoiceNo                 406223 non-null object
StockCode                 406223 non-null object
Description               406223 non-null object
Quantity                  406223 non-null int64
InvoiceDate               406223 non-null datetime64[ns]
UnitPrice                 406223 non-null float64
CustomerID                406223 non-null float64
Country                   406223 non-null object
date                      406223 non-null datetime64[ns]
lowercase_descriptions    406223 non-null object
dtypes: datetime64[ns](2), float64(2), int64(1), object(5)
memory usage: 34.1+ MB

# data without any null values
data_clean = data.loc[(data.CustomerID.isnull()==False)&(data.lowercase_descriptions.isnull()==False)].copy()
data_clean.isnull().sum().sum()

Monthly Revenue Growth

# Create Revenue Column
data_clean['Revenue'] = data_clean['UnitPrice'] * data_clean['Quantity']

# Create YearMonth Column 
data_clean['YearMonth'] = data_clean['InvoiceDate'].apply(lambda date: date.year*100+date.month)
data_clean.head(n=5)

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	date	lowercase_descriptions	Revenue	YearMonth
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom	2010-12-01	white hanging heart t-light holder	15.30	201012
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	2010-12-01	white metal lantern	20.34	201012
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom	2010-12-01	cream cupid hearts coat hanger	22.00	201012
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	2010-12-01	knitted union flag hot water bottle	20.34	201012
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	2010-12-01	red woolly hottie white heart.	20.34	201012

#monthly data using pivot table. 
df_monthly = data_clean.copy()
df_monthly = pd.pivot_table(data_clean,index=['YearMonth'],values="Revenue",aggfunc=[np.sum])
print(df_monthly)

                   sum
               Revenue
YearMonth             
201012      554314.170
201101      474834.980
201102      435859.050
201103      578790.310
201104      425132.101
201105      647536.030
201106      607671.010
201107      573614.931
201108      615532.500
201109      930559.562
201110      973348.740
201111     1130196.310
201112      342039.000

# Groupby 'YearMonth' Column 
df_monthly_more = data_clean.groupby('YearMonth').agg({'CustomerID':lambda x: x.nunique(), 'Quantity': 'sum',\
                                                       'Revenue': 'sum',}).reset_index()
print(df_monthly_more)

    YearMonth  CustomerID  Quantity      Revenue
    201012       948.0    296153   554314.170
    201101       783.0    269219   474834.980
    201102       798.0    262381   435859.050
    201103      1020.0    343284   578790.310
    201104       899.0    278050   425132.101
    201105      1079.0    367411   647536.030
    201106      1051.0    356715   607671.010
    201107       993.0    363027   573614.931
    201108       980.0    386080   615532.500
    201109      1302.0    536963   930559.562
   201110      1425.0    568919   973348.740
   201111      1711.0    668568  1130196.310
   201112       685.0    203554   342039.000

# Create MonthlyGrowth Column: 
df_monthly_more['MonthlyGrowth'] = df_monthly_more['Revenue'].pct_change()
import seaborn as sns
sns.set_style("whitegrid")
sns.catplot(x="YearMonth", y="Revenue", data=df_monthly_more, kind="point", aspect=2.5, color="#95a5a6")
plt.ylabel('Revenue')
plt.xlabel('')
sns.despine(left=True, bottom=True)

df_monthly_more.style.format({"Customers": "{:.0f}", \
                         "Quantity": "{:.0f}",\
                         "Revenue": "${:.0f}",\
                        "MonthlyGrowth": "{:.2f}"})\
                    .bar(subset=["Revenue",], color='lightgreen')\
                    .bar(subset=["CustomerID"], color='#FFA07A')

	YearMonth	CustomerID	Quantity	Revenue	MonthlyGrowth
0	201012	948	296153	$554314	nan
1	201101	783	269219	$474835	-0.14
2	201102	798	262381	$435859	-0.08
3	201103	1020	343284	$578790	0.33
4	201104	899	278050	$425132	-0.27
5	201105	1079	367411	$647536	0.52
6	201106	1051	356715	$607671	-0.06
7	201107	993	363027	$573615	-0.06
8	201108	980	386080	$615533	0.07
9	201109	1302	536963	$930560	0.51
10	201110	1425	568919	$973349	0.05
11	201111	1711	668568	$1130196	0.16
12	201112	685	203554	$342039	-0.70

Conclusion:

The monthly revenue fluctuates over time and we see a sharp decline in Dec 2011. Since we do not have full month data in December 2011, the monthly revenue in Dec is much lower than previous months. From the line graph, we also saw a drop in the spring of 2011. These fluctuation may caused by credit card debts which were accumulated in previous Christmas season. Customers might also needs to file their tax during spring. In general, the E-commerce revenue is booming for this retailer, revenue from 2010 to 2011 is going upwards and we saw a major breakthourgh for 51% month to month growth in Sep.

The next question we need to ask: How many customer decide to come back and purchase from this retailer?

In the next post, we will look into the customer cohort and retention rate.

Data Empowerment

E-commerce Data Analysis Part One - Data Wrangling & Monthly Revenue

Introduction:

Table Of Content:

Prepare to Take-Off

Exploratory Data Analysis

Data Cleaning - Dropping Rows With Missing Values

Monthly Revenue Growth

Conclusion:

Related posts

E-commerce Data Analysis Part Three - RFM Customer Segmentation 17 Nov 2020

E-commerce Data Analysis Part Two - Customer Cohort Analysis 12 Nov 2020

Digital Marketing Channels Attribution Models 03 Nov 2020