Helpers

The helpers.py file contains helper functions that are used and/or re-used in the pipeline. This includes functions to load the data, save the data, and other utility functions. By abstracting the helper functions, the overall code for the pipeline is cleaner and easier to read.

helpers.convertDataToParquet(df: DataFrame, name: str) None

Converts the given DataFrame to a parquet file with the given name.

Parameters:
  • df (pd.DataFrame) – The DataFrame to convert.

  • name (str) – The name of the parquet file.

helpers.fillMissingStates(row: Series) str

Fills the missing states in the DataFrame based on the PostalCode.

Parameters:

row (pd.Series) – The row to process.

Returns:

The state name.

Return type:

str

helpers.findNearestState(postal_code)

Finds the nearest state based on the given postal code.

Parameters:

postal_code (int) – The postal code to find the nearest state for.

Returns:

The state name.

Return type:

str

helpers.getClosestMatch(row: Series, postal_code_to_cities: dict) str

Returns the closest match for the city based on the postal code, using fuzzy finding.

Parameters:
  • row (pd.Series) – The row to process.

  • postal_code_to_cities (dict) – The dictionary containing postal codes as keys and a list of possible cities as values.

Returns:

The closest match for the city.

Return type:

str

helpers.loadInitialData(rechnung_path: str, kunden_path: str) tuple

Loads the initial data from AAG and returns a DataFrame for Kunden and Rechnungen.

Parameters:
  • rechnung_path (str) – The path to the Rechnungen data.

  • kunden_path (str) – The path to the Kunden data.

Returns:

A tuple containing the Rechnungen and Kunden DataFrames.

Return type:

tuple

helpers.loadParquetFile(path: str) DataFrame

Loads a parquet file from the given path and returns it as a pandas DataFrame.

Parameters:

path (str) – The path to the parquet file.

Returns:

The parquet file as a pandas DataFrame.

Return type:

pd.DataFrame