Preprocessing

The preprocessing.py file contains the main logic for preprocessing the data. This includes loading the initial data provided by AAG, merging the data, cleaning the data, adding basic extra features, and saving the result to a final cleaned parquet file.

preprocessing.addFeatures(df: DataFrame) DataFrame

Adds new features to the DataFrame.

Parameters:

df (pd.DataFrame) – The DataFrame to process.

Returns:

The DataFrame with the new features.

Return type:

pd.DataFrame

preprocessing.blendPostalCodes(df: DataFrame, plz_path: str, nomi) DataFrame

Blends the PostalCode column with external Postalcodes data, to correct inconsistencies. source: https://www.suche-postleitzahl.org/

Parameters:
  • df (pd.DataFrame) – The DataFrame to blend.

  • plz_path (str) – The path to the external PostalCodes data.

Returns:

The blended DataFrame.

Return type:

pd.DataFrame

preprocessing.finalCleaning(df: DataFrame) DataFrame

Cleans the DataFrame by filling in missing states and dropping unnecessary columns.

Parameters:

df (pd.DataFrame) – The DataFrame to clean.

Returns:

The cleaned DataFrame

Return type:

pd.DataFrame

preprocessing.initialCleaning(df: DataFrame) DataFrame

Cleans the DataFrame by renaming columns, dropping rows with missing values, filtering PostalCode, converting OrderDate to datetime, and adding a Season column.

Parameters:

df (pd.DataFrame) – The DataFrame to clean.

Returns:

The cleaned DataFrame.

Return type:

pd.DataFrame

preprocessing.mergeOnKunden(df_rechnung: DataFrame, df_kunden: DataFrame) DataFrame

Merges the Rechnungen and Kunden DataFrames on the Kunde_Verkauf_SK column.

Parameters:
  • df_rechnung (pd.DataFrame) – The Rechnungen DataFrame.

  • df_kunden (pd.DataFrame) – The Kunden DataFrame.

Returns:

The merged DataFrame.

Return type:

pd.DataFrame