Open
Conversation
- Refactor find_data_start to use dynamic standard_columns - Update read_csv to use comma separator for YNAB compatibility - Enable auto-language detection for column translation - Add logic to negate Outflow amounts - Remove duplicate code blocks
- Calculate 'Yearly_Cost' based on monthly amount * 12 - Sort output by 'Yearly_Cost' ascending (most expensive expenses first)
- Add 'cluster_amounts' function to group transaction amounts within 10% similarity - Apply clustering before identifying subscriptions - Import numpy for mean calculation - Allows detecting price-adjusted subscriptions (e.g. price hikes) as a single subscription
- Introduce argparse to make the amount clustering threshold configurable via command-line (--threshold or -t) - Set default clustering threshold to 15% as requested - Update cluster_amounts function signature to accept the passed threshold
- Add 'normalize_description' to utils.py to standardize biller names (uppercase, remove location/noise, merge common patterns like MDC*TALQUIN and Paul's Termite) - Update 'interpret.py' to group by 'Description' only, calculating mean amounts for groups - Remove previous clustering logic as grouping by normalized description subsumes it - Effectively merges recurring expenses from the same vendor even if amounts vary (e.g. utility bills)
- Use DataFrame.to_string(float_format='{:.2f}'.format) for precise output formatting
- Ensures consistent display of monetary values
- Add '--recency-days' argument (default 90) to filter out inactive subscriptions - Filters based on the dataset's latest transaction date, not the current system date, to verify historical patterns accurately
- Removed specific logic for 'MDC*TALQUIN', 'Paul\'s Termite', and 'TRUIST' from normalize_description to avoid overfitting and PII usage. - Retained generic text normalization (uppercase, location removal, special char cleanup).
…uping - Added 'merge_similar_descriptions' function using 'difflib' to cluster vendor names. - Replaces hardcoded regex logic with a data-driven approach (prefix + sequence ratio > 0.7). - Ensures variations like 'MDC TALQUIN ELE...' and 'MDC TALQUIN ELECTRIC...' are grouped under a single vendor without PII in code.
- Fix NameError by defining merge_similar_descriptions and importing difflib. - Completes the fuzzy matching feature integration.
- Add '--min-transaction-amount' (default 10.0) and '--max-transaction-amount' (default 10000.0) arguments to argparse. - Update the filtering logic for subscription candidates to use these new configurable bounds. - This allows identifying large recurring expenses like mortgages which were previously excluded by a low upper bound.
- Re-introduced 'cluster_amounts' logic to group transactions by amount similarity (default 15% threshold). - Applied amount clustering *after* vendor description grouping. - Updated 'get_subscription_candidates' to group by both Description and Amount. - Effectively separates recurring monthly payments (e.g., mortgage) from one-off outliers (e.g., downpayments) under the same vendor.
- Updated 'get_subscription_candidates' to dynamically handle the column structure when grouping by both 'Description' and 'Amount'. - Removed temporary debug prints from 'cluster_amounts'. - Ensures that 'Movement Mortgage' payments are correctly separated from downpayments, with the outlier properly filtered out based on transaction count.
- Rewrote 'interpret.py' to ensure correct column handling when grouping by multiple keys. - 'get_subscription_candidates' now robustly handles 7 columns (Description + Amount + aggs) by dropping the redundant average column. - Removed debug prints. - Verified that 'Movement Mortgage' downpayment outlier is successfully separated and filtered out, leaving only the recurring subscription.
- Modified 'cluster_amounts' to return only the modified 'Amount' Series instead of the full DataFrame slice. - This conforms to recommended pandas patterns for 'groupby().apply()' when modifying a single column, effectively silencing the DeprecationWarning. - Ensures correct behavior and forward compatibility with future pandas versions.
… apply" This reverts commit f0ccf76.
- Renamed 'cluster_amounts' to '_cluster_amounts_series' and refactored it to operate on a pandas Series. - Switched from 'groupby().apply()' to 'groupby().transform()' for amount clustering. - This approach correctly updates the 'Amount' column group-wise, preserves DataFrame structure, and effectively silences the DeprecationWarning. - Ensures forward compatibility and improved performance for this operation.
- Created 'ignore_subscriptions.example.txt' with sample vendors. - Added 'ignore_subscriptions.txt' to .gitignore for custom user ignores. - Updated 'interpret.py' to load ignore patterns from '--ignore-file' (default 'ignore_subscriptions.txt'). - Implemented filtering logic to exclude transactions where the normalized description matches any ignored pattern.
- Created 'ignore_subscriptions.example.txt' with sample vendors. - Added 'ignore_subscriptions.txt' to .gitignore for custom user ignores. - Updated 'interpret.py' to load ignore patterns from '--ignore-file' (default 'ignore_subscriptions.txt'). - Implemented filtering logic to exclude transactions where the normalized description matches any ignored pattern.
- Fixed AttributeError by properly adding the '--ignore-file' argument to the argparse parser. - Ensures the ignore logic can access the specified file path.
- Implemented bucketing by first character in 'merge_similar_descriptions' to reduce search space from O(N) to O(N/26) for each iteration. - Added a length-based heuristic check to skip expensive 'difflib.SequenceMatcher' calculations if the maximum possible ratio is below the threshold. - Significantly reduces processing time for large datasets with many unique vendor descriptions.
…ng accuracy - Removed first-character bucketing from 'merge_similar_descriptions' as it prevented matching variations with different prefixes (e.g., 'COFBNDRCT' vs 'SP COFBNDRCT'). - Retained the length-based heuristic optimization to maintain reasonable performance gains. - Restores the detection of 'COFBNDRCT' subscriptions.
- Documented all new command-line arguments (--threshold, --recency-days, --min/max-transaction-amount, --ignore-file, --debug). - Added section on how to use the ignore file. - Added 'How It Works' section explaining the normalization, fuzzy matching, and clustering pipeline.
- Detailed the expected CSV columns: Date, Description, and Amount. - Listed all recognized variations for each column, directly from utils.py's standard_columns. - Explained the script's automatic column translation and unification capabilities. - Improved clarity for users preparing input data.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
slight variations of vendor names (e.g., "Netflix" and "Netflix.com") without needing hardcoded
rules.
common location suffixes, "Transfer" prefixes, and special characters.
price (e.g., due to currency fluctuations). This allows the tool to treat them as a single
subscription type while still distinguishing them from significant outliers (like a one-off
downpayment vs. a monthly fee).
customize:
(defaulting to the last 90 days of the dataset).
ranges.
specific vendors (e.g., grocery stores) from the analysis.
multilingual headers (English, Spanish, German, Polish, etc.) rather than fixed indices.
and converting them to the expected negative values for expenses.
decimal formats.
first.
it works" logic, and a full table of the new command-line usage arguments.