Customer clustering
This part of the pipeline is responsible for clustering the customers based on their purchase history. The clustering is done using the KMeans algorithm from the scikit-learn library. The number of clusters is determined by the user.
- customerClustering.clusterRFM(orders_top25: DataFrame) DataFrame
Takes the order dataset and creates RFM features per customer ID. Then clusters the customers based on these features and for the optimal number of clusters between 3 and 10 based on Silhouette score. Returns the customer dataset with assigned clusters.
- Parameters:
orders_top25 (pd.DataFrame) – The top25 percent of customers DataFrame.
- Returns:
The DataFrame including assigned clusters based on purchase behavior per CustomerID.
- Return type:
pd.DataFrame
- customerClustering.getTop25PercentCustomers(df: DataFrame) DataFrame
Takes the complete dataset and creates a subset of the top 25% of the most valuable customer based on their share of the total NetRevenue.
- Parameters:
df (pd.DataFrame) – The preprocessed DataFrame. Necessary Columns are “OrderNumber”, “OrderDate”, “CustomerID”, “NetRevenue”.
- Returns:
The subset DataFrame.
- Return type:
pd.DataFrame