Helper Functions Library

Authors

Ricardo A. Collado
Olumide Akinola

Overview

This repository is a library of helper functions aimed at simplifying common data handling, processing, and analysis tasks in Python. Below is a brief overview of the main methods available in each helper module:

src/helper/basic_helper.py
- clean_text_columns(): Cleans text columns in a dataframe by removing unwanted characters and fixing formatting issues.
- shuffle_data(): Randomly shuffles lists or dictionaries to ensure non-sequential ordering.
- load_config(): Loads configurations from a file for subsequent use in other modules.
- remove_gremlins_from_text_columns(): Removes gremlins from text columns in a dataframe.
- remove_nan_from_text_columns(): Replaces NaNs in DataFrame with a specified character.
- text_column_cleaning_pipeline(): Applies a pipeline of text column cleaning functions to a dataframe.
- shuffle_dictionary(): Shuffles a dictionary using a provided random number generator.
- shuffle_list(): Shuffles a list using a provided random number generator.
- flatten_list_of_pairs(): Flattens a list of pairs into a set containing all elements.
- remove_empty_columns(): Removes empty columns from a dataframe.
- remove_columns_with_few_values(): Removes columns with a small number of non-NaN values.
- convert_cols_to_str(): Converts specified columns to string type.
- full_data_cleaning_pipeline(): Applies a pipeline of data cleaning functions to a dataframe.
- find_duplicated_columns(): Identifies all pairs of columns that are duplicates of each other.
- find_column_indices(): Finds the index values of all columns with a specific name.
- col_names_to_unique_capital_snake(): Converts column names to uppercase snake_case and makes them unique.
- col_names_to_unique_names(): Makes column names unique without changing capitalization or inter-word spacing.
- remove_duplicated_columns(): Removes duplicate columns from a dataframe.
- remove_duplicates_preserve_order(): Removes duplicates from a list while preserving the original order.
- compare_dates_min(): Compares two dates and returns a result.
- compare_dates_max(): Compares two dates and returns a result.
- compare_dates_max_no_zeroes(): Compares three dates and returns a result based on their comparison.
src/helper/column_matching.py
- get_column_mapping_with_score(): Obtains the best column match using fuzzy matching.
src/helper/count_lines.py
- count_py_lines(): Counts the lines of Python code in .py files within a specified starting folder.
- count_ipynb_lines(): Counts the lines of code in Jupyter Notebook (.ipynb) files within a given folder.
- count_lines_of_code(): Counts the total lines of code in .py and .ipynb files within a given folder.
src/helper/date_helper.py
- generate_date_list(): Generates a list of dates from start to end with a specified increment in months.
- generate_start_end_date_lists(): Generates a list of start and end dates for a series of backtests.
src/helper/display_helper.py
- print_elements(): Prints each element of a list, iterable, or dictionary.
- display_in_multiple_cols(): Displays a long list into multiple columns.
- display_df_cols(): Displays the columns of a dataframe in a formatted manner.
- display_df_cols_as_list(): Displays the columns of a dataframe as a list.
- display_all(): Displays all rows and columns of a DataFrame.
- display_full_df(): Displays a DataFrame with all rows and columns.
- format_dataframe(): Formats the display of a DataFrame.
- display_formatted_df(): Displays a formatted DataFrame.
- display_entire_dataframe(): Displays the entire DataFrame without any truncation.
- display_entire_formatted_df(): Displays the entire DataFrame without any truncation and with formatting.
- display_formatted_series(): Displays a formatted Series as a one-column DataFrame.
- display_entire_series(): Displays the entire Series as a one-column DataFrame without any truncation.
- display_entire_formatted_series(): Displays the entire Series without any truncation and with formatting.
src/helper/os_helper.py
- get_next_number(): Gets the next number for a file in a folder.
- add_suffix_with_date_to_name(): Appends the current date to a given file name.
- get_numbered_file_name_with_path(): Generates a new file name with a sequential number and its full path.
- get_numbered_file_name_with_date_and_path(): Generates a new file name with a sequential number, optional date, and its full path.
- create_named_folder(): Creates a named folder if it does not exist.
src/helper/pandas_helper.py
- load_sheets_into_dict(): Loads all sheets from an Excel file into a dictionary.
src/helper/pattern.py
- find_best_match(): Finds the best match for the input name in the name list.
- get_variable_name(): Gets the name of a variable.
- snake_case_to_capitalized_words(): Converts snake case to capitalized words.
- convert_snake_case_to_words(): Converts snake_case to all caps separated by a separator.
- convert_to_snake_case(): Converts a string to all CAPS snake_case without removing numbers.
- camel_case_to_snake_case(): Converts camel case to snake case.
- to_file_system_camel_case(): Converts a filename to a filesystem-safe camelCase string.
- to_file_system_snake_case(): Converts a filename to a filesystem-safe snake_case string.
- col_names_to_screaming_snake(): Normalizes the column names of a DataFrame to screaming snake case.
src/helper/plot_helper.py
- plot_ontime_delayed_orders(): Plots and analyzes on-time vs delayed orders for a specific part number.
- plot_partial_full_orders(): Plots and analyzes partial vs full orders for a specific part number.
- plot_quality_analysis_graphs(): Plots and saves various quality analysis graphs for a specific part number.
src/helper/search_hardcoded_paths.py
- locate_file_hardcoded_paths(): Searches for patterns in a file.
- locate_hardcoded_paths_recursively(): Recursively searches through directories for patterns in Python files.
src/helper/stat_features.py
- ma_proc_df(): Applies a rolling window operation on the DataFrame over a specified period.
- ewm_proc_df(): Applies an exponential weighted moving window operation to DataFrame.
- get_ma_stats_features(): Gets statistics for a moving window on a DataFrame.
- get_ewm_stats_features(): Gets statistics for an exponential weighted moving window on a DataFrame.
- add_stats_features(): Computes statistical features for a given DataFrame.

Below is a list of the primary configuration and setup files in this repository:

Source Files

File	Description
`src/helper/__init__.py`	Initializes the helper module.
`src/helper/basic_helper.py`	Contains helper functions for data analysis such as cleaning text columns, shuffling lists/dictionaries, and loading configuration files.
`src/helper/column_matching.py`	Provides functions for matching columns between different dataframes using fuzzy matching.
`src/helper/display_helper.py`	Includes functions for displaying dataframes and series in a formatted manner.
`src/helper/search_hardcoded_paths.py`	Searches for hardcoded file paths in Python scripts.
`src/helper/count_lines.py`	Provides functions to count lines of Python and Jupyter Notebook code in a given folder.
`src/data_loader.py`	Loads and preprocesses datasets for analysis.
`src/config.py`	Manages configuration and parameter loading for the project.
`src/logger.py`	Sets up and configures logging for debugging and monitoring purposes.
`src/model.py`	Contains model definitions and related functions for machine learning tasks.
`src/main.py`	The main entry point that ties together data loading, processing, and analysis workflows.

Data Files

File	Description
`data/SalesForce-Opportunity Final(in).csv`	The main dataset used for the Salesforce POC project.

Notebooks

File	Description
`notebook/salesforce.ipynb`	Jupyter notebook for analyzing the Salesforce data and applying various data processing and analysis techniques.

Configuration Files

File	Description
`.flake8`	Configuration file for Flake8, a linting tool for Python.
`.gitignore`	Specifies files and directories to be ignored by Git.
`.markdownlint.json`	Configuration file for MarkdownLint, a tool for enforcing markdown style rules.
`.pre-commit-config.yaml`	Configuration file for pre-commit hooks, running checks before commits.
`.pylintrc`	Configuration file for Pylint, a static code analysis tool for Python.
`.python-version`	Specifies the Python version used for the project.
`copilot_commit_instructions.md`	Instructions for using GitHub Copilot to generate commit messages.
`mkdocs.yml`	Configuration file for MkDocs, a static site generator for documentation.
`mypy.ini`	Configuration file for Mypy, a static type checker for Python.
`poetry.lock`	Lock file for Poetry, specifying exact dependency versions.
`pyproject.toml`	Configuration file for Poetry with project dependencies and settings.

README

File	Description
`README.md`	This file, providing an overview of the project, its structure, and its contents.

Poetry Dependency Management

This project uses Poetry for dependency management. Poetry provides a way to declare, manage and install dependencies of Python projects, ensuring you have the right stack everywhere.

Installing and Updating Poetry

To install Poetry:

curl -sSL https://install.python-poetry.org | python3 -

To update Poetry:

poetry self update

Managing Dependencies

To install all dependencies (including optional groups):

poetry install --all-extras

To install only specific dependency groups:

# Install only the main dependencies
poetry install

# Install main dependencies and dev dependencies
poetry install --with dev

# Install main dependencies and documentation dependencies
poetry install --with doc

# Install multiple specific groups
poetry install --with dev,doc

To update dependencies:

# Update all dependencies
poetry update

# Update a specific package
poetry update package-name

Dependency Groups

The project has the following dependency groups defined in the pyproject.toml file:

Group	Description
`default`	Core dependencies required for running the project's main functionality
`dev`	Development dependencies including linting tools, type checkers, and Jupyter tools
`package`	Python packaging tools for distribution and package management
`doc`	Documentation generation tools including MkDocs and related plugins

Using This Repository as a Git Submodule

To include this helper library in another Python project as a Git submodule, follow these steps:

Add the submodule

git submodule add <REPO-URL> path/to/submodule

Initialize and Update
```
git submodule init
git submodule update
```
Set up your Python environment
- Install and configure pyenv to manage your Python versions.
- Create or select the intended Python version with pyenv:
```
pyenv install 3.9.16
pyenv local 3.9.16
```
Install the submodule's dependencies (using Poetry)
```
cd path/to/submodule
poetry install
```
Update the parent project's pyproject.toml Add the submodule path to the tool.poetry.dependencies section to ensure the submodule is included in the parent project's environment:
```
[tool.poetry.dependencies]
python = "^3.9"
helper = { path = "path/to/submodule" }
```
Install the parent project's dependencies
```
poetry install
```

Example Usage

Once you've completed these steps, you can import and use the helper functions throughout your main Python project. Ensure that any IDE settings (for example, in VS Code) use the same Python environment to avoid module resolution issues.

Here is an example of how to call a function from the submodule in your parent project:

# filepath: /path/to/parent_project/src/main.py

from helper.src.helper.basic_helper import clean_text_columns

# Example usage of the clean_text_columns function
dataframe = ...  # your pandas DataFrame
cleaned_dataframe = clean_text_columns(dataframe)

By following these steps, you can seamlessly integrate the helper functions library into your main Python project as a Git submodule.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Helper Functions Library

Authors

Overview

Source Files

Data Files

Notebooks

Configuration Files

README

Poetry Dependency Management

Installing and Updating Poetry

Managing Dependencies

Dependency Groups

Using This Repository as a Git Submodule

Example Usage

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.vscode		.vscode
src/helper		src/helper
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
.python-version		.python-version
README.md		README.md
copilot_commit_instructions.md		copilot_commit_instructions.md
mkdocs.yml		mkdocs.yml
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml

riccollado/helper

Folders and files

Latest commit

History

Repository files navigation

Helper Functions Library

Authors

Overview

Source Files

Data Files

Notebooks

Configuration Files

README

Poetry Dependency Management

Installing and Updating Poetry

Managing Dependencies

Dependency Groups

Using This Repository as a Git Submodule

Example Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages