Customer clustering

This part of the pipeline is responsible for clustering the customers based on their purchase history. The clustering is done using the KMeans algorithm from the scikit-learn library. The number of clusters is determined by the user.

customerClustering.clusterRFM(orders_top25: DataFrame) DataFrame

Takes the order dataset and creates RFM features per customer ID. Then clusters the customers based on these features and for the optimal number of clusters between 3 and 10 based on Silhouette score. Returns the customer dataset with assigned clusters.

Parameters:

orders_top25 (pd.DataFrame) – The top25 percent of customers DataFrame.

Returns:

The DataFrame including assigned clusters based on purchase behavior per CustomerID.

Return type:

pd.DataFrame

customerClustering.getTop25PercentCustomers(df: DataFrame) DataFrame

Takes the complete dataset and creates a subset of the top 25% of the most valuable customer based on their share of the total NetRevenue.

Parameters:

df (pd.DataFrame) – The preprocessed DataFrame. Necessary Columns are “OrderNumber”, “OrderDate”, “CustomerID”, “NetRevenue”.

Returns:

The subset DataFrame.

Return type:

pd.DataFrame