Skip to content

Helper functions for dealing with everyday issues: column display, column matching, pandas, ...

Notifications You must be signed in to change notification settings

riccollado/helper

Repository files navigation

Helper Functions Library

Authors

  • Ricardo A. Collado
  • Olumide Akinola

Overview

This repository is a library of helper functions aimed at simplifying common data handling, processing, and analysis tasks in Python. Below is a brief overview of the main methods available in each helper module:

  • src/helper/basic_helper.py

    • clean_text_columns(): Cleans text columns in a dataframe by removing unwanted characters and fixing formatting issues.
    • shuffle_data(): Randomly shuffles lists or dictionaries to ensure non-sequential ordering.
    • load_config(): Loads configurations from a file for subsequent use in other modules.
    • remove_gremlins_from_text_columns(): Removes gremlins from text columns in a dataframe.
    • remove_nan_from_text_columns(): Replaces NaNs in DataFrame with a specified character.
    • text_column_cleaning_pipeline(): Applies a pipeline of text column cleaning functions to a dataframe.
    • shuffle_dictionary(): Shuffles a dictionary using a provided random number generator.
    • shuffle_list(): Shuffles a list using a provided random number generator.
    • flatten_list_of_pairs(): Flattens a list of pairs into a set containing all elements.
    • remove_empty_columns(): Removes empty columns from a dataframe.
    • remove_columns_with_few_values(): Removes columns with a small number of non-NaN values.
    • convert_cols_to_str(): Converts specified columns to string type.
    • full_data_cleaning_pipeline(): Applies a pipeline of data cleaning functions to a dataframe.
    • find_duplicated_columns(): Identifies all pairs of columns that are duplicates of each other.
    • find_column_indices(): Finds the index values of all columns with a specific name.
    • col_names_to_unique_capital_snake(): Converts column names to uppercase snake_case and makes them unique.
    • col_names_to_unique_names(): Makes column names unique without changing capitalization or inter-word spacing.
    • remove_duplicated_columns(): Removes duplicate columns from a dataframe.
    • remove_duplicates_preserve_order(): Removes duplicates from a list while preserving the original order.
    • compare_dates_min(): Compares two dates and returns a result.
    • compare_dates_max(): Compares two dates and returns a result.
    • compare_dates_max_no_zeroes(): Compares three dates and returns a result based on their comparison.
  • src/helper/column_matching.py

    • get_column_mapping_with_score(): Obtains the best column match using fuzzy matching.
  • src/helper/count_lines.py

    • count_py_lines(): Counts the lines of Python code in .py files within a specified starting folder.
    • count_ipynb_lines(): Counts the lines of code in Jupyter Notebook (.ipynb) files within a given folder.
    • count_lines_of_code(): Counts the total lines of code in .py and .ipynb files within a given folder.
  • src/helper/date_helper.py

    • generate_date_list(): Generates a list of dates from start to end with a specified increment in months.
    • generate_start_end_date_lists(): Generates a list of start and end dates for a series of backtests.
  • src/helper/display_helper.py

    • print_elements(): Prints each element of a list, iterable, or dictionary.
    • display_in_multiple_cols(): Displays a long list into multiple columns.
    • display_df_cols(): Displays the columns of a dataframe in a formatted manner.
    • display_df_cols_as_list(): Displays the columns of a dataframe as a list.
    • display_all(): Displays all rows and columns of a DataFrame.
    • display_full_df(): Displays a DataFrame with all rows and columns.
    • format_dataframe(): Formats the display of a DataFrame.
    • display_formatted_df(): Displays a formatted DataFrame.
    • display_entire_dataframe(): Displays the entire DataFrame without any truncation.
    • display_entire_formatted_df(): Displays the entire DataFrame without any truncation and with formatting.
    • display_formatted_series(): Displays a formatted Series as a one-column DataFrame.
    • display_entire_series(): Displays the entire Series as a one-column DataFrame without any truncation.
    • display_entire_formatted_series(): Displays the entire Series without any truncation and with formatting.
  • src/helper/os_helper.py

    • get_next_number(): Gets the next number for a file in a folder.
    • add_suffix_with_date_to_name(): Appends the current date to a given file name.
    • get_numbered_file_name_with_path(): Generates a new file name with a sequential number and its full path.
    • get_numbered_file_name_with_date_and_path(): Generates a new file name with a sequential number, optional date, and its full path.
    • create_named_folder(): Creates a named folder if it does not exist.
  • src/helper/pandas_helper.py

    • load_sheets_into_dict(): Loads all sheets from an Excel file into a dictionary.
  • src/helper/pattern.py

    • find_best_match(): Finds the best match for the input name in the name list.
    • get_variable_name(): Gets the name of a variable.
    • snake_case_to_capitalized_words(): Converts snake case to capitalized words.
    • convert_snake_case_to_words(): Converts snake_case to all caps separated by a separator.
    • convert_to_snake_case(): Converts a string to all CAPS snake_case without removing numbers.
    • camel_case_to_snake_case(): Converts camel case to snake case.
    • to_file_system_camel_case(): Converts a filename to a filesystem-safe camelCase string.
    • to_file_system_snake_case(): Converts a filename to a filesystem-safe snake_case string.
    • col_names_to_screaming_snake(): Normalizes the column names of a DataFrame to screaming snake case.
  • src/helper/plot_helper.py

    • plot_ontime_delayed_orders(): Plots and analyzes on-time vs delayed orders for a specific part number.
    • plot_partial_full_orders(): Plots and analyzes partial vs full orders for a specific part number.
    • plot_quality_analysis_graphs(): Plots and saves various quality analysis graphs for a specific part number.
  • src/helper/search_hardcoded_paths.py

    • locate_file_hardcoded_paths(): Searches for patterns in a file.
    • locate_hardcoded_paths_recursively(): Recursively searches through directories for patterns in Python files.
  • src/helper/stat_features.py

    • ma_proc_df(): Applies a rolling window operation on the DataFrame over a specified period.
    • ewm_proc_df(): Applies an exponential weighted moving window operation to DataFrame.
    • get_ma_stats_features(): Gets statistics for a moving window on a DataFrame.
    • get_ewm_stats_features(): Gets statistics for an exponential weighted moving window on a DataFrame.
    • add_stats_features(): Computes statistical features for a given DataFrame.

Below is a list of the primary configuration and setup files in this repository:

Source Files

File Description
src/helper/__init__.py Initializes the helper module.
src/helper/basic_helper.py Contains helper functions for data analysis such as cleaning text columns, shuffling lists/dictionaries, and loading configuration files.
src/helper/column_matching.py Provides functions for matching columns between different dataframes using fuzzy matching.
src/helper/display_helper.py Includes functions for displaying dataframes and series in a formatted manner.
src/helper/search_hardcoded_paths.py Searches for hardcoded file paths in Python scripts.
src/helper/count_lines.py Provides functions to count lines of Python and Jupyter Notebook code in a given folder.
src/data_loader.py Loads and preprocesses datasets for analysis.
src/config.py Manages configuration and parameter loading for the project.
src/logger.py Sets up and configures logging for debugging and monitoring purposes.
src/model.py Contains model definitions and related functions for machine learning tasks.
src/main.py The main entry point that ties together data loading, processing, and analysis workflows.

Data Files

File Description
data/SalesForce-Opportunity Final(in).csv The main dataset used for the Salesforce POC project.

Notebooks

File Description
notebook/salesforce.ipynb Jupyter notebook for analyzing the Salesforce data and applying various data processing and analysis techniques.

Configuration Files

File Description
.flake8 Configuration file for Flake8, a linting tool for Python.
.gitignore Specifies files and directories to be ignored by Git.
.markdownlint.json Configuration file for MarkdownLint, a tool for enforcing markdown style rules.
.pre-commit-config.yaml Configuration file for pre-commit hooks, running checks before commits.
.pylintrc Configuration file for Pylint, a static code analysis tool for Python.
.python-version Specifies the Python version used for the project.
copilot_commit_instructions.md Instructions for using GitHub Copilot to generate commit messages.
mkdocs.yml Configuration file for MkDocs, a static site generator for documentation.
mypy.ini Configuration file for Mypy, a static type checker for Python.
poetry.lock Lock file for Poetry, specifying exact dependency versions.
pyproject.toml Configuration file for Poetry with project dependencies and settings.

README

File Description
README.md This file, providing an overview of the project, its structure, and its contents.

Poetry Dependency Management

This project uses Poetry for dependency management. Poetry provides a way to declare, manage and install dependencies of Python projects, ensuring you have the right stack everywhere.

Installing and Updating Poetry

To install Poetry:

curl -sSL https://install.python-poetry.org | python3 -

To update Poetry:

poetry self update

Managing Dependencies

To install all dependencies (including optional groups):

poetry install --all-extras

To install only specific dependency groups:

# Install only the main dependencies
poetry install

# Install main dependencies and dev dependencies
poetry install --with dev

# Install main dependencies and documentation dependencies
poetry install --with doc

# Install multiple specific groups
poetry install --with dev,doc

To update dependencies:

# Update all dependencies
poetry update

# Update a specific package
poetry update package-name

Dependency Groups

The project has the following dependency groups defined in the pyproject.toml file:

Group Description
default Core dependencies required for running the project's main functionality
dev Development dependencies including linting tools, type checkers, and Jupyter tools
package Python packaging tools for distribution and package management
doc Documentation generation tools including MkDocs and related plugins

Using This Repository as a Git Submodule

To include this helper library in another Python project as a Git submodule, follow these steps:

  1. Add the submodule

    git submodule add <REPO-URL> path/to/submodule
  2. Initialize and Update

    git submodule init
    git submodule update
  3. Set up your Python environment

    • Install and configure pyenv to manage your Python versions.

    • Create or select the intended Python version with pyenv:

      pyenv install 3.9.16
      pyenv local 3.9.16
  4. Install the submodule's dependencies (using Poetry)

    cd path/to/submodule
    poetry install
  5. Update the parent project's pyproject.toml Add the submodule path to the tool.poetry.dependencies section to ensure the submodule is included in the parent project's environment:

    [tool.poetry.dependencies]
    python = "^3.9"
    helper = { path = "path/to/submodule" }
  6. Install the parent project's dependencies

    poetry install

Example Usage

Once you've completed these steps, you can import and use the helper functions throughout your main Python project. Ensure that any IDE settings (for example, in VS Code) use the same Python environment to avoid module resolution issues.

Here is an example of how to call a function from the submodule in your parent project:

# filepath: /path/to/parent_project/src/main.py

from helper.src.helper.basic_helper import clean_text_columns

# Example usage of the clean_text_columns function
dataframe = ...  # your pandas DataFrame
cleaned_dataframe = clean_text_columns(dataframe)

By following these steps, you can seamlessly integrate the helper functions library into your main Python project as a Git submodule.

About

Helper functions for dealing with everyday issues: column display, column matching, pandas, ...

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages