Customer clustering

This part of the pipeline is responsible for clustering the customers based on their purchase history. The clustering is done using the KMeans algorithm from the scikit-learn library. The number of clusters is determined by the user.

customerClustering.clusterRFM(orders_top25: DataFrame) → DataFrame

Takes the order dataset and creates RFM features per customer ID. Then clusters the customers based on these features and for the optimal number of clusters between 3 and 10 based on Silhouette score. Returns the customer dataset with assigned clusters.

Parameters:: orders_top25 (pd.DataFrame) – The top25 percent of customers DataFrame.
Returns:: The DataFrame including assigned clusters based on purchase behavior per CustomerID.
Return type:: pd.DataFrame

customerClustering.getTop25PercentCustomers(df: DataFrame) → DataFrame

Takes the complete dataset and creates a subset of the top 25% of the most valuable customer based on their share of the total NetRevenue.

Parameters:: df (pd.DataFrame) – The preprocessed DataFrame. Necessary Columns are “OrderNumber”, “OrderDate”, “CustomerID”, “NetRevenue”.
Returns:: The subset DataFrame.
Return type:: pd.DataFrame