Skip to content

Lots of updates!#1

Open
greg-randall wants to merge 29 commits intosudo-sein:mainfrom
greg-randall:main
Open

Lots of updates!#1
greg-randall wants to merge 29 commits intosudo-sein:mainfrom
greg-randall:main

Conversation

@greg-randall
Copy link

  1. Vendor Grouping and Normalization
  • Fuzzy Matching: Implemented merge_similar_descriptions using difflib. This automatically groups
    slight variations of vendor names (e.g., "Netflix" and "Netflix.com") without needing hardcoded
    rules.
  • Text Normalization: Added a normalize_description function to standardize vendor names by removing
    common location suffixes, "Transfer" prefixes, and special characters.
  • Amount Clustering: Added logic to group transactions from the same vendor that differ slightly in
    price (e.g., due to currency fluctuations). This allows the tool to treat them as a single
    subscription type while still distinguishing them from significant outliers (like a one-off
    downpayment vs. a monthly fee).
  1. Configuration and Filtering
  • CLI Arguments: Replaced hardcoded values with command-line arguments using argparse. Users can now
    customize:
    • --threshold: The sensitivity for grouping similar transaction amounts.
    • --recency-days: A filter to only show subscriptions active within a specific timeframe
      (defaulting to the last 90 days of the dataset).
    • --min-transaction-amount / --max-transaction-amount: filters to target specific expense
      ranges.
    • --debug: A flag for verbose output.
  • Ignore List: Added support for an external ignore file (--ignore-file), allowing users to exclude
    specific vendors (e.g., grocery stores) from the analysis.
  1. Data Compatibility
  • Dynamic Column Detection: The script now identifies columns based on a list of standard
    multilingual headers (English, Spanish, German, Polish, etc.) rather than fixed indices.
  • YNAB Support: Added specific handling for YNAB export formats, including detecting Outflow columns
    and converting them to the expected negative values for expenses.
  • Currency Parsing: Updated clean_amount to handle both US (1,234.56) and European (1.234,56)
    decimal formats.
  1. Reporting
  • Cost Estimation: Added a "Yearly Cost" calculation.
  • Sorting: The output is now sorted by the estimated yearly cost to highlight the largest expenses
    first.
  • Formatting: Financial values are now consistently formatted to two decimal places.
  1. Documentation
  • README: Updated the documentation to include a "CSV File Format" section, a guide to the new "How
    it works" logic, and a full table of the new command-line usage arguments.

- Refactor find_data_start to use dynamic standard_columns
- Update read_csv to use comma separator for YNAB compatibility
- Enable auto-language detection for column translation
- Add logic to negate Outflow amounts
- Remove duplicate code blocks
- Calculate 'Yearly_Cost' based on monthly amount * 12
- Sort output by 'Yearly_Cost' ascending (most expensive expenses first)
- Add 'cluster_amounts' function to group transaction amounts within 10% similarity
- Apply clustering before identifying subscriptions
- Import numpy for mean calculation
- Allows detecting price-adjusted subscriptions (e.g. price hikes) as a single subscription
- Introduce argparse to make the amount clustering threshold configurable via command-line (--threshold or -t)
- Set default clustering threshold to 15% as requested
- Update cluster_amounts function signature to accept the passed threshold
- Add 'normalize_description' to utils.py to standardize biller names (uppercase, remove location/noise, merge common patterns like MDC*TALQUIN and Paul's Termite)
- Update 'interpret.py' to group by 'Description' only, calculating mean amounts for groups
- Remove previous clustering logic as grouping by normalized description subsumes it
- Effectively merges recurring expenses from the same vendor even if amounts vary (e.g. utility bills)
- Use DataFrame.to_string(float_format='{:.2f}'.format) for precise output formatting
- Ensures consistent display of monetary values
- Add '--recency-days' argument (default 90) to filter out inactive subscriptions
- Filters based on the dataset's latest transaction date, not the current system date, to verify historical patterns accurately
- Removed specific logic for 'MDC*TALQUIN', 'Paul\'s Termite', and 'TRUIST' from normalize_description to avoid overfitting and PII usage.
- Retained generic text normalization (uppercase, location removal, special char cleanup).
…uping

- Added 'merge_similar_descriptions' function using 'difflib' to cluster vendor names.
- Replaces hardcoded regex logic with a data-driven approach (prefix + sequence ratio > 0.7).
- Ensures variations like 'MDC TALQUIN ELE...' and 'MDC TALQUIN ELECTRIC...' are grouped under a single vendor without PII in code.
- Fix NameError by defining merge_similar_descriptions and importing difflib.
- Completes the fuzzy matching feature integration.
- Add '--min-transaction-amount' (default 10.0) and '--max-transaction-amount' (default 10000.0) arguments to argparse.
- Update the filtering logic for subscription candidates to use these new configurable bounds.
- This allows identifying large recurring expenses like mortgages which were previously excluded by a low upper bound.
- Re-introduced 'cluster_amounts' logic to group transactions by amount similarity (default 15% threshold).
- Applied amount clustering *after* vendor description grouping.
- Updated 'get_subscription_candidates' to group by both Description and Amount.
- Effectively separates recurring monthly payments (e.g., mortgage) from one-off outliers (e.g., downpayments) under the same vendor.
- Updated 'get_subscription_candidates' to dynamically handle the column structure when grouping by both 'Description' and 'Amount'.
- Removed temporary debug prints from 'cluster_amounts'.
- Ensures that 'Movement Mortgage' payments are correctly separated from downpayments, with the outlier properly filtered out based on transaction count.
- Rewrote 'interpret.py' to ensure correct column handling when grouping by multiple keys.
- 'get_subscription_candidates' now robustly handles 7 columns (Description + Amount + aggs) by dropping the redundant average column.
- Removed debug prints.
- Verified that 'Movement Mortgage' downpayment outlier is successfully separated and filtered out, leaving only the recurring subscription.
- Modified 'cluster_amounts' to return only the modified 'Amount' Series instead of the full DataFrame slice.
- This conforms to recommended pandas patterns for 'groupby().apply()' when modifying a single column, effectively silencing the DeprecationWarning.
- Ensures correct behavior and forward compatibility with future pandas versions.
- Renamed 'cluster_amounts' to '_cluster_amounts_series' and refactored it to operate on a pandas Series.
- Switched from 'groupby().apply()' to 'groupby().transform()' for amount clustering.
- This approach correctly updates the 'Amount' column group-wise, preserves DataFrame structure, and effectively silences the DeprecationWarning.
- Ensures forward compatibility and improved performance for this operation.
- Created 'ignore_subscriptions.example.txt' with sample vendors.
- Added 'ignore_subscriptions.txt' to .gitignore for custom user ignores.
- Updated 'interpret.py' to load ignore patterns from '--ignore-file' (default 'ignore_subscriptions.txt').
- Implemented filtering logic to exclude transactions where the normalized description matches any ignored pattern.
- Created 'ignore_subscriptions.example.txt' with sample vendors.
- Added 'ignore_subscriptions.txt' to .gitignore for custom user ignores.
- Updated 'interpret.py' to load ignore patterns from '--ignore-file' (default 'ignore_subscriptions.txt').
- Implemented filtering logic to exclude transactions where the normalized description matches any ignored pattern.
- Fixed AttributeError by properly adding the '--ignore-file' argument to the argparse parser.
- Ensures the ignore logic can access the specified file path.
- Implemented bucketing by first character in 'merge_similar_descriptions' to reduce search space from O(N) to O(N/26) for each iteration.
- Added a length-based heuristic check to skip expensive 'difflib.SequenceMatcher' calculations if the maximum possible ratio is below the threshold.
- Significantly reduces processing time for large datasets with many unique vendor descriptions.
…ng accuracy

- Removed first-character bucketing from 'merge_similar_descriptions' as it prevented matching variations with different prefixes (e.g., 'COFBNDRCT' vs 'SP COFBNDRCT').
- Retained the length-based heuristic optimization to maintain reasonable performance gains.
- Restores the detection of 'COFBNDRCT' subscriptions.
- Documented all new command-line arguments (--threshold, --recency-days, --min/max-transaction-amount, --ignore-file, --debug).
- Added section on how to use the ignore file.
- Added 'How It Works' section explaining the normalization, fuzzy matching, and clustering pipeline.
- Detailed the expected CSV columns: Date, Description, and Amount.
- Listed all recognized variations for each column, directly from utils.py's standard_columns.
- Explained the script's automatic column translation and unification capabilities.
- Improved clarity for users preparing input data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant