Lots of updates! by greg-randall · Pull Request #1 · sudo-sein/subscription-finder

greg-randall · 2025-12-16T16:50:47Z

Vendor Grouping and Normalization

Fuzzy Matching: Implemented merge_similar_descriptions using difflib. This automatically groups
slight variations of vendor names (e.g., "Netflix" and "Netflix.com") without needing hardcoded
rules.
Text Normalization: Added a normalize_description function to standardize vendor names by removing
common location suffixes, "Transfer" prefixes, and special characters.
Amount Clustering: Added logic to group transactions from the same vendor that differ slightly in
price (e.g., due to currency fluctuations). This allows the tool to treat them as a single
subscription type while still distinguishing them from significant outliers (like a one-off
downpayment vs. a monthly fee).

Configuration and Filtering

CLI Arguments: Replaced hardcoded values with command-line arguments using argparse. Users can now
customize:
- --threshold: The sensitivity for grouping similar transaction amounts.
- --recency-days: A filter to only show subscriptions active within a specific timeframe
  (defaulting to the last 90 days of the dataset).
- --min-transaction-amount / --max-transaction-amount: filters to target specific expense
  ranges.
- --debug: A flag for verbose output.
Ignore List: Added support for an external ignore file (--ignore-file), allowing users to exclude
specific vendors (e.g., grocery stores) from the analysis.

Data Compatibility

Dynamic Column Detection: The script now identifies columns based on a list of standard
multilingual headers (English, Spanish, German, Polish, etc.) rather than fixed indices.
YNAB Support: Added specific handling for YNAB export formats, including detecting Outflow columns
and converting them to the expected negative values for expenses.
Currency Parsing: Updated clean_amount to handle both US (1,234.56) and European (1.234,56)
decimal formats.

Reporting

Cost Estimation: Added a "Yearly Cost" calculation.
Sorting: The output is now sorted by the estimated yearly cost to highlight the largest expenses
first.
Formatting: Financial values are now consistently formatted to two decimal places.

Documentation

README: Updated the documentation to include a "CSV File Format" section, a guide to the new "How
it works" logic, and a full table of the new command-line usage arguments.

- Refactor find_data_start to use dynamic standard_columns - Update read_csv to use comma separator for YNAB compatibility - Enable auto-language detection for column translation - Add logic to negate Outflow amounts - Remove duplicate code blocks

- Calculate 'Yearly_Cost' based on monthly amount * 12 - Sort output by 'Yearly_Cost' ascending (most expensive expenses first)

- Add 'cluster_amounts' function to group transaction amounts within 10% similarity - Apply clustering before identifying subscriptions - Import numpy for mean calculation - Allows detecting price-adjusted subscriptions (e.g. price hikes) as a single subscription

- Introduce argparse to make the amount clustering threshold configurable via command-line (--threshold or -t) - Set default clustering threshold to 15% as requested - Update cluster_amounts function signature to accept the passed threshold

- Add 'normalize_description' to utils.py to standardize biller names (uppercase, remove location/noise, merge common patterns like MDC*TALQUIN and Paul's Termite) - Update 'interpret.py' to group by 'Description' only, calculating mean amounts for groups - Remove previous clustering logic as grouping by normalized description subsumes it - Effectively merges recurring expenses from the same vendor even if amounts vary (e.g. utility bills)

- Use DataFrame.to_string(float_format='{:.2f}'.format) for precise output formatting - Ensures consistent display of monetary values

- Add '--recency-days' argument (default 90) to filter out inactive subscriptions - Filters based on the dataset's latest transaction date, not the current system date, to verify historical patterns accurately

- Removed specific logic for 'MDC*TALQUIN', 'Paul\'s Termite', and 'TRUIST' from normalize_description to avoid overfitting and PII usage. - Retained generic text normalization (uppercase, location removal, special char cleanup).

…uping - Added 'merge_similar_descriptions' function using 'difflib' to cluster vendor names. - Replaces hardcoded regex logic with a data-driven approach (prefix + sequence ratio > 0.7). - Ensures variations like 'MDC TALQUIN ELE...' and 'MDC TALQUIN ELECTRIC...' are grouped under a single vendor without PII in code.

- Fix NameError by defining merge_similar_descriptions and importing difflib. - Completes the fuzzy matching feature integration.

- Add '--min-transaction-amount' (default 10.0) and '--max-transaction-amount' (default 10000.0) arguments to argparse. - Update the filtering logic for subscription candidates to use these new configurable bounds. - This allows identifying large recurring expenses like mortgages which were previously excluded by a low upper bound.

- Re-introduced 'cluster_amounts' logic to group transactions by amount similarity (default 15% threshold). - Applied amount clustering *after* vendor description grouping. - Updated 'get_subscription_candidates' to group by both Description and Amount. - Effectively separates recurring monthly payments (e.g., mortgage) from one-off outliers (e.g., downpayments) under the same vendor.

- Updated 'get_subscription_candidates' to dynamically handle the column structure when grouping by both 'Description' and 'Amount'. - Removed temporary debug prints from 'cluster_amounts'. - Ensures that 'Movement Mortgage' payments are correctly separated from downpayments, with the outlier properly filtered out based on transaction count.

- Rewrote 'interpret.py' to ensure correct column handling when grouping by multiple keys. - 'get_subscription_candidates' now robustly handles 7 columns (Description + Amount + aggs) by dropping the redundant average column. - Removed debug prints. - Verified that 'Movement Mortgage' downpayment outlier is successfully separated and filtered out, leaving only the recurring subscription.

- Modified 'cluster_amounts' to return only the modified 'Amount' Series instead of the full DataFrame slice. - This conforms to recommended pandas patterns for 'groupby().apply()' when modifying a single column, effectively silencing the DeprecationWarning. - Ensures correct behavior and forward compatibility with future pandas versions.

… apply" This reverts commit f0ccf76.

- Renamed 'cluster_amounts' to '_cluster_amounts_series' and refactored it to operate on a pandas Series. - Switched from 'groupby().apply()' to 'groupby().transform()' for amount clustering. - This approach correctly updates the 'Amount' column group-wise, preserves DataFrame structure, and effectively silences the DeprecationWarning. - Ensures forward compatibility and improved performance for this operation.

- Created 'ignore_subscriptions.example.txt' with sample vendors. - Added 'ignore_subscriptions.txt' to .gitignore for custom user ignores. - Updated 'interpret.py' to load ignore patterns from '--ignore-file' (default 'ignore_subscriptions.txt'). - Implemented filtering logic to exclude transactions where the normalized description matches any ignored pattern.

- Fixed AttributeError by properly adding the '--ignore-file' argument to the argparse parser. - Ensures the ignore logic can access the specified file path.

- Implemented bucketing by first character in 'merge_similar_descriptions' to reduce search space from O(N) to O(N/26) for each iteration. - Added a length-based heuristic check to skip expensive 'difflib.SequenceMatcher' calculations if the maximum possible ratio is below the threshold. - Significantly reduces processing time for large datasets with many unique vendor descriptions.

…ng accuracy - Removed first-character bucketing from 'merge_similar_descriptions' as it prevented matching variations with different prefixes (e.g., 'COFBNDRCT' vs 'SP COFBNDRCT'). - Retained the length-based heuristic optimization to maintain reasonable performance gains. - Restores the detection of 'COFBNDRCT' subscriptions.

- Documented all new command-line arguments (--threshold, --recency-days, --min/max-transaction-amount, --ignore-file, --debug). - Added section on how to use the ignore file. - Added 'How It Works' section explaining the normalization, fuzzy matching, and clustering pipeline.

- Detailed the expected CSV columns: Date, Description, and Amount. - Listed all recognized variations for each column, directly from utils.py's standard_columns. - Explained the script's automatic column translation and unification capabilities. - Improved clarity for users preparing input data.

greg-randall added 29 commits December 16, 2025 09:40

added a few more words for the description/amount

e46934d

exit if no heading row found

0035603

fix(utils): enhance clean_amount to handle US/EU currency formats

fde0a98

feat(interpret): sort subscriptions by estimated yearly cost

c99b2ba

- Calculate 'Yearly_Cost' based on monthly amount * 12 - Sort output by 'Yearly_Cost' ascending (most expensive expenses first)

feat(interpret): format Amount and Yearly_Cost to two decimal places

09a2a06

- Use DataFrame.to_string(float_format='{:.2f}'.format) for precise output formatting - Ensures consistent display of monetary values

feat(interpret): filter subscriptions by recency

1b279f5

- Add '--recency-days' argument (default 90) to filter out inactive subscriptions - Filters based on the dataset's latest transaction date, not the current system date, to verify historical patterns accurately

fix(interpret): add missing merge_similar_descriptions function

ff1d266

- Fix NameError by defining merge_similar_descriptions and importing difflib. - Completes the fuzzy matching feature integration.

Revert "fix(interpret): silence DeprecationWarning in cluster_amounts…

53163b4

… apply" This reverts commit f0ccf76.

fix(interpret): add missing --ignore-file argument definition

f960052

- Fixed AttributeError by properly adding the '--ignore-file' argument to the argparse parser. - Ensures the ignore logic can access the specified file path.

adding ignores functionality

d692e5f

formatting

28744ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lots of updates!#1

Lots of updates!#1
greg-randall wants to merge 29 commits intosudo-sein:mainfrom
greg-randall:main

greg-randall commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

greg-randall commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant