- Ricardo A. Collado
- Olumide Akinola
This repository is a library of helper functions aimed at simplifying common data handling, processing, and analysis tasks in Python. Below is a brief overview of the main methods available in each helper module:
-
clean_text_columns(): Cleans text columns in a dataframe by removing unwanted characters and fixing formatting issues.shuffle_data(): Randomly shuffles lists or dictionaries to ensure non-sequential ordering.load_config(): Loads configurations from a file for subsequent use in other modules.remove_gremlins_from_text_columns(): Removes gremlins from text columns in a dataframe.remove_nan_from_text_columns(): Replaces NaNs in DataFrame with a specified character.text_column_cleaning_pipeline(): Applies a pipeline of text column cleaning functions to a dataframe.shuffle_dictionary(): Shuffles a dictionary using a provided random number generator.shuffle_list(): Shuffles a list using a provided random number generator.flatten_list_of_pairs(): Flattens a list of pairs into a set containing all elements.remove_empty_columns(): Removes empty columns from a dataframe.remove_columns_with_few_values(): Removes columns with a small number of non-NaN values.convert_cols_to_str(): Converts specified columns to string type.full_data_cleaning_pipeline(): Applies a pipeline of data cleaning functions to a dataframe.find_duplicated_columns(): Identifies all pairs of columns that are duplicates of each other.find_column_indices(): Finds the index values of all columns with a specific name.col_names_to_unique_capital_snake(): Converts column names to uppercase snake_case and makes them unique.col_names_to_unique_names(): Makes column names unique without changing capitalization or inter-word spacing.remove_duplicated_columns(): Removes duplicate columns from a dataframe.remove_duplicates_preserve_order(): Removes duplicates from a list while preserving the original order.compare_dates_min(): Compares two dates and returns a result.compare_dates_max(): Compares two dates and returns a result.compare_dates_max_no_zeroes(): Compares three dates and returns a result based on their comparison.
-
get_column_mapping_with_score(): Obtains the best column match using fuzzy matching.
-
count_py_lines(): Counts the lines of Python code in .py files within a specified starting folder.count_ipynb_lines(): Counts the lines of code in Jupyter Notebook (.ipynb) files within a given folder.count_lines_of_code(): Counts the total lines of code in .py and .ipynb files within a given folder.
-
generate_date_list(): Generates a list of dates from start to end with a specified increment in months.generate_start_end_date_lists(): Generates a list of start and end dates for a series of backtests.
-
print_elements(): Prints each element of a list, iterable, or dictionary.display_in_multiple_cols(): Displays a long list into multiple columns.display_df_cols(): Displays the columns of a dataframe in a formatted manner.display_df_cols_as_list(): Displays the columns of a dataframe as a list.display_all(): Displays all rows and columns of a DataFrame.display_full_df(): Displays a DataFrame with all rows and columns.format_dataframe(): Formats the display of a DataFrame.display_formatted_df(): Displays a formatted DataFrame.display_entire_dataframe(): Displays the entire DataFrame without any truncation.display_entire_formatted_df(): Displays the entire DataFrame without any truncation and with formatting.display_formatted_series(): Displays a formatted Series as a one-column DataFrame.display_entire_series(): Displays the entire Series as a one-column DataFrame without any truncation.display_entire_formatted_series(): Displays the entire Series without any truncation and with formatting.
-
get_next_number(): Gets the next number for a file in a folder.add_suffix_with_date_to_name(): Appends the current date to a given file name.get_numbered_file_name_with_path(): Generates a new file name with a sequential number and its full path.get_numbered_file_name_with_date_and_path(): Generates a new file name with a sequential number, optional date, and its full path.create_named_folder(): Creates a named folder if it does not exist.
-
load_sheets_into_dict(): Loads all sheets from an Excel file into a dictionary.
-
find_best_match(): Finds the best match for the input name in the name list.get_variable_name(): Gets the name of a variable.snake_case_to_capitalized_words(): Converts snake case to capitalized words.convert_snake_case_to_words(): Converts snake_case to all caps separated by a separator.convert_to_snake_case(): Converts a string to all CAPS snake_case without removing numbers.camel_case_to_snake_case(): Converts camel case to snake case.to_file_system_camel_case(): Converts a filename to a filesystem-safe camelCase string.to_file_system_snake_case(): Converts a filename to a filesystem-safe snake_case string.col_names_to_screaming_snake(): Normalizes the column names of a DataFrame to screaming snake case.
-
plot_ontime_delayed_orders(): Plots and analyzes on-time vs delayed orders for a specific part number.plot_partial_full_orders(): Plots and analyzes partial vs full orders for a specific part number.plot_quality_analysis_graphs(): Plots and saves various quality analysis graphs for a specific part number.
-
src/helper/search_hardcoded_paths.pylocate_file_hardcoded_paths(): Searches for patterns in a file.locate_hardcoded_paths_recursively(): Recursively searches through directories for patterns in Python files.
-
ma_proc_df(): Applies a rolling window operation on the DataFrame over a specified period.ewm_proc_df(): Applies an exponential weighted moving window operation to DataFrame.get_ma_stats_features(): Gets statistics for a moving window on a DataFrame.get_ewm_stats_features(): Gets statistics for an exponential weighted moving window on a DataFrame.add_stats_features(): Computes statistical features for a given DataFrame.
Below is a list of the primary configuration and setup files in this repository:
| File | Description |
|---|---|
src/helper/__init__.py |
Initializes the helper module. |
src/helper/basic_helper.py |
Contains helper functions for data analysis such as cleaning text columns, shuffling lists/dictionaries, and loading configuration files. |
src/helper/column_matching.py |
Provides functions for matching columns between different dataframes using fuzzy matching. |
src/helper/display_helper.py |
Includes functions for displaying dataframes and series in a formatted manner. |
src/helper/search_hardcoded_paths.py |
Searches for hardcoded file paths in Python scripts. |
src/helper/count_lines.py |
Provides functions to count lines of Python and Jupyter Notebook code in a given folder. |
src/data_loader.py |
Loads and preprocesses datasets for analysis. |
src/config.py |
Manages configuration and parameter loading for the project. |
src/logger.py |
Sets up and configures logging for debugging and monitoring purposes. |
src/model.py |
Contains model definitions and related functions for machine learning tasks. |
src/main.py |
The main entry point that ties together data loading, processing, and analysis workflows. |
| File | Description |
|---|---|
data/SalesForce-Opportunity Final(in).csv |
The main dataset used for the Salesforce POC project. |
| File | Description |
|---|---|
notebook/salesforce.ipynb |
Jupyter notebook for analyzing the Salesforce data and applying various data processing and analysis techniques. |
| File | Description |
|---|---|
.flake8 |
Configuration file for Flake8, a linting tool for Python. |
.gitignore |
Specifies files and directories to be ignored by Git. |
.markdownlint.json |
Configuration file for MarkdownLint, a tool for enforcing markdown style rules. |
.pre-commit-config.yaml |
Configuration file for pre-commit hooks, running checks before commits. |
.pylintrc |
Configuration file for Pylint, a static code analysis tool for Python. |
.python-version |
Specifies the Python version used for the project. |
copilot_commit_instructions.md |
Instructions for using GitHub Copilot to generate commit messages. |
mkdocs.yml |
Configuration file for MkDocs, a static site generator for documentation. |
mypy.ini |
Configuration file for Mypy, a static type checker for Python. |
poetry.lock |
Lock file for Poetry, specifying exact dependency versions. |
pyproject.toml |
Configuration file for Poetry with project dependencies and settings. |
| File | Description |
|---|---|
README.md |
This file, providing an overview of the project, its structure, and its contents. |
This project uses Poetry for dependency management. Poetry provides a way to declare, manage and install dependencies of Python projects, ensuring you have the right stack everywhere.
To install Poetry:
curl -sSL https://install.python-poetry.org | python3 -To update Poetry:
poetry self updateTo install all dependencies (including optional groups):
poetry install --all-extrasTo install only specific dependency groups:
# Install only the main dependencies
poetry install
# Install main dependencies and dev dependencies
poetry install --with dev
# Install main dependencies and documentation dependencies
poetry install --with doc
# Install multiple specific groups
poetry install --with dev,docTo update dependencies:
# Update all dependencies
poetry update
# Update a specific package
poetry update package-nameThe project has the following dependency groups defined in the pyproject.toml file:
| Group | Description |
|---|---|
default |
Core dependencies required for running the project's main functionality |
dev |
Development dependencies including linting tools, type checkers, and Jupyter tools |
package |
Python packaging tools for distribution and package management |
doc |
Documentation generation tools including MkDocs and related plugins |
To include this helper library in another Python project as a Git submodule, follow these steps:
-
Add the submodule
git submodule add <REPO-URL> path/to/submodule
-
Initialize and Update
git submodule init git submodule update
-
Set up your Python environment
-
Install and configure pyenv to manage your Python versions.
-
Create or select the intended Python version with pyenv:
pyenv install 3.9.16 pyenv local 3.9.16
-
-
Install the submodule's dependencies (using Poetry)
cd path/to/submodule poetry install -
Update the parent project's
pyproject.tomlAdd the submodule path to thetool.poetry.dependenciessection to ensure the submodule is included in the parent project's environment:[tool.poetry.dependencies] python = "^3.9" helper = { path = "path/to/submodule" }
-
Install the parent project's dependencies
poetry install
Once you've completed these steps, you can import and use the helper functions throughout your main Python project. Ensure that any IDE settings (for example, in VS Code) use the same Python environment to avoid module resolution issues.
Here is an example of how to call a function from the submodule in your parent project:
# filepath: /path/to/parent_project/src/main.py
from helper.src.helper.basic_helper import clean_text_columns
# Example usage of the clean_text_columns function
dataframe = ... # your pandas DataFrame
cleaned_dataframe = clean_text_columns(dataframe)By following these steps, you can seamlessly integrate the helper functions library into your main Python project as a Git submodule.