This document serves as a critical, living template designed to equip agents with a rapid and comprehensive understanding of the codebase's architecture, enabling efficient navigation and effective contribution from day one. Update this document as the codebase evolves.
This section provides a high-level overview of the project's directory and file structure, categorised by architectural layer or major functional area. It is essential for quickly navigating the codebase, locating relevant files, and understanding the overall organization and separation of concerns.
whatsthedamage/
├── config/ # Configuration files
│ └── gunicorn_conf.py # Gunicorn production configuration
├── docs/ # Project documentation
│ ├── calculator_pattern_example.py
│ └── scripts/README.md # ML documentation
├── src/whatsthedamage/ # Main source code
│ ├── api/ # REST API endpoints
│ │ ├── v2/ # API v2 endpoints and schemas
│ │ │ ├── endpoints.py # API v2 processing endpoints
│ │ │ └── schema.py # API response schemas
│ │ ├── docs.py # API documentation
│ │ └── helpers.py # API helper functions
│ ├── config/ # Configuration classes
│ │ ├── config.py # Central configuration
| │ ├── config.yml.default # Default configuration template
│ │ ├── dt_models.py # Data models for API responses
│ │ ├── exclusions.json # ExclusionService configuration
│ │ ├── flask_config.py # Flask-specific configuration
│ │ └── ml_config.py # ML configuration
│ │ └── text_config.py # TextCorrectionService configuration
│ ├── controllers/ # Request handling
│ │ ├── cli_controller.py # CLI argument parsing
│ │ ├── ml_cli.py # ML CLI interface
│ │ ├── routes.py # Web routes
│ │ └── routes_helpers.py # Web route helpers
│ ├── models/ # Data models and processing
│ │ ├── api_models.py # API models
│ │ ├── csv_file_handler.py # CSV file parsing
│ │ ├── csv_processor.py # CSV processing orchestrator
│ │ ├── csv_row.py # Transaction row model
│ │ ├── dt_calculators.py # Calculator pattern implementations
│ │ ├── dt_models.py # DataTable related models
│ │ ├── dt_response_builder.py # DataTables response builder
│ │ ├── machine_learning.py # ML model training and inference
│ │ ├── row_enrichment.py # Regex-based categorization
│ │ ├── row_enrichment_ml.py # ML-based categorization
│ │ ├── row_filter.py # Date filtering
│ │ ├── rows_processor.py # Main processing pipeline
│ │ └── statistical_algorithms.py # Statistical analysis
│ ├── scripts/ # placeholder
│ ├── services/ # Business logic services
│ │ ├── cache_service.py # Caching service
│ │ ├── configuration_service.py # Configuration loading
│ │ ├── data_formatting_service.py # Output formatting (deprecated)
│ │ ├── drilldown_service.py # Drilldown functionality
│ │ ├── exclusion_service.py # Exclusion handling
│ │ ├── file_upload_service.py # File upload handling
│ │ ├── id_mapping_service.py # ID mapping for secure URLs
│ │ ├── ml_service.py # ML business logic orchestration
│ │ ├── processing_service.py # Core processing service
│ │ ├── response_builder_service.py # Response construction (deprecated)
│ │ ├── response_formatting_service.py # Unified formatting & response building
│ │ ├── service_container.py # Service container factory
│ │ ├── session_service.py # Web session management
│ │ ├── smote_service.py # SMOTE synthetic data generation
│ │ ├── statistical_analysis_service.py # Statistical analysis
│ │ ├── text_correction_service.py # Text cleaning for ML
│ │ └── validation_service.py # File validation
│ ├── static/ # Backend static assets
│ ├── utils/ # Utility functions
│ │ ├── data_loader.py # Data loading utils for Machine Learning
│ │ ├── date_converter.py # Date parsing/formatting
│ │ ├── flask_locale.py # Flask localization
│ │ ├── logging.py # Centralized logging utils
│ │ ├── validation.py # Validation utilities
│ │ └── version.py # Version management
│ ├── view/ # Presentation layer
│ │ ├── frontend/ # TypeScript frontend
│ │ │ ├── src/ # Frontend sources
│ │ │ │ ├── main.ts # Main entry point
│ │ │ │ ├── js/ # TypeScript modules
│ │ │ │ ├── css/ # CSS files
│ │ │ │ └── types/ # Type definitions
│ │ │ ├── package.json # npm dependencies
│ │ │ ├── vite.config.js # Vite configuration
│ │ │ └── public/ # Public assets
│ │ ├── static/ # Flask static files
│ │ │ └── dist/ # Frontend build output
│ │ ├── templates/ # Jinja2 templates
│ │ ├── forms.py # Flask forms
│ │ └── row_printer.py # Console output formatting
│ └── uploads/ # File uploads
├── tests/ # Backend tests
│ ├── services/ # Service layer tests
│ └── ... # Other test files
├── .github/ # GitHub configurations
├── .gitignore # Git ignore patterns
├── API.md # REST API documentation
├── ARCHITECTURE.md # This document
├── CONTRIBUTING.md # Contribution guidelines
├── LICENSE # License information
├── Makefile # Build automation
├── PRODUCT.md # Product information
├── README.md # Project overview
├── pyproject.toml # Python project metadata
├── requirements.txt # Python dependencies
├── requirements-dev.txt # Development dependencies
└── requirements-web.txt # Web-specific dependencies
Provide a simple block diagram (e.g., a C4 Model Level 1: System Context diagram, or a basic component diagram) or a clear text-based description of the major components and their interactions. Focus on how data flows, services communicate, and key architectural boundaries.
[User] <--> [CLI Interface] <--> [ProcessingService] <--> [CSVProcessor] <--> [CsvFileHandler]
|
+--> [Web Interface] <--> [Flask App] <--> [ProcessingService]
|
+--> [REST API v2] <--> [ProcessingService]
The system follows a layered architecture with clear separation of concerns:
- Presentation Layer: CLI, Web, and REST API interfaces
- Service Layer: Business logic services (ProcessingService, ValidationService, etc.)
- Model Layer: Data processing and domain logic (CSVProcessor, RowsProcessor, etc.)
- Configuration Layer: Centralized configuration management
- Utility Layer: Cross-cutting concerns (localization, date handling)
Name: Web Application
Description: The main user interface for interacting with whatsthedamage, allowing users to upload CSV files, configure processing options, and view transaction analysis results. The web interface uses server-side rendering with Flask templates and progressive enhancement with TypeScript.
Technologies: Flask (Jinja2 templates), TypeScript, Vite, Bootstrap, DataTables
Deployment: Flask development server (make web), Gunicorn for production
Name: Processing Service
Description: Core business logic service that orchestrates CSV transaction processing. This service handles file parsing, transaction categorization (regex or ML), filtering, and aggregation. It provides a unified interface used by all delivery mechanisms (CLI, Web, API).
Technologies: Python, Flask extensions for dependency injection
Deployment: Part of the Flask application
Name: Validation Service
Description: Handles file validation including type checking, size limits, and content integrity verification. Ensures uploaded files meet requirements before processing.
Technologies: Python
Deployment: Part of the Flask application
Name: Configuration Service
Description: Manages loading and access to configuration settings including CSV format definitions, attribute mappings, enrichment patterns, and categories.
Technologies: Python, YAML parsing
Deployment: Part of the Flask application
Name: Response Formatting Service
Description: Unified service combining data formatting and response building capabilities. Formats processed transaction data for various output targets including console, HTML, CSV, and JSON. Supports the unified DataTablesResponse format for web and API interfaces. This service merges the functionality of the previous DataFormattingService and ResponseBuilderService to reduce cognitive complexity and ensure consistent formatting across all interfaces.
Technologies: Python
Deployment: Part of the Flask application
Key Features:
- Multiple output formats: HTML tables, CSV strings, JSON, plain text
- DataTablesResponse formatting for web and API interfaces
- Currency formatting with locale support
- Template preparation for Jinja2 rendering
- Error response building for consistent API error handling
- Account-aware formatting with secure ID handling
Name: ID Mapping Service
Description: Provides secure URL generation and mapping between internal IDs and user-facing identifiers. This service enables safe drilldown functionality by creating non-predictable URLs for accessing specific transaction details. Uses CacheService for storage to comply with existing architectural patterns.
Technologies: Python, Flask-Caching integration
Deployment: Part of the Flask application
Key Features:
- Secure mapping between account numbers and IDs
- Category name/ID mapping for URL safety
- Month timestamp/ID mapping for time-based drilldown
- Cache-backed storage for performance
Name: Drilldown Service
Description: Enables detailed transaction analysis by providing drilldown capabilities. This service allows users to explore specific transactions, categories, or time periods in greater detail through secure URL-based access.
Technologies: Python
Deployment: Part of the Flask application
Name: ML Service
Description: Core business logic service for machine learning operations. Orchestrates model training, prediction, and evaluation. Provides a unified interface for ML operations including hyperparameter tuning, confidence calibration, and SMOTE support.
Technologies: Python, scikit-learn, joblib
Deployment: Part of the Flask application
Name: SMOTE Service
Description: Handles synthetic data generation for rare categories using Synthetic Minority Oversampling Technique. Identifies imbalanced classes and generates synthetic samples to improve model performance on underrepresented categories.
Technologies: Python, imbalanced-learn
Deployment: Part of the Flask application
Key Features:
- Automatic detection of imbalanced classes
- Configurable SMOTE parameters via ML configuration
- Integration with MLService for model training
- Support for rare category enhancement
Name: Text Correction Service
Description: Provides ML-specific text cleaning and preprocessing for partner field values. Ensures consistent text normalization between training and inference phases, including unicode normalization, payment provider removal, and suffix cleaning.
Technologies: Python, regex
Deployment: Part of the Flask application
Name: File Uploads and Processing Results
Type: File system storage
Purpose: Stores uploaded CSV files, configuration files, and processing results. The system uses temporary file storage for uploaded files and caching for processed results.
Key Files/Directories:
src/whatsthedamage/uploads/: Temporary file uploadssrc/whatsthedamage/static/: ML models and metadata- Session-based caching for processed results
Name: Web Session Management
Type: Flask session storage
Purpose: Manages user session state between requests for the web interface, including file uploads and processing results.
Service Name: Random Forest Model with Confidence Calibration
Purpose: Provides ML-based transaction categorization as an alternative to regex-based categorization. The model is trained on historical transaction data and includes advanced features like confidence calibration, SMOTE for rare categories, and multi-CPU training support.
Integration Method: joblib model loading (security warning: only use trusted models)
Key Features:
- Random Forest classifier with 200 estimators
- Confidence calibration using CalibratedClassifierCV
- SMOTE support for handling imbalanced datasets
- Multi-CPU parallel processing
- Confidence threshold for categorization
- Comprehensive metrics and evaluation
Model Files:
model-rf-v6alpha_en.joblib: Trained model with calibrationmodel-rf-v6alpha_en.manifest.json: Training metadata and parametersmodel-rf-v6alpha_en.testdata.json: Test data for validation
Service Name: gettext/i18n
Purpose: Provides localization support for English and Hungarian languages.
Integration Method: Python gettext module with locale files
Cloud Provider: Self-hosted or any cloud provider
Key Services Used:
- Flask development server for local development
- Gunicorn for production deployment
- Vite for frontend bundling and optimization
CI/CD Pipeline: Makefile-based automation with commands like:
make dev: Set up development environmentmake test: Run testsmake web: Run Flask development servermake vite-build: Build production frontend assets
Monitoring & Logging:
- Structured Logging System: Comprehensive logging with configurable levels (DEBUG, INFO, WARN, ERROR), output destinations (stdout or file), and formats (text or JSON)
- CLI Configuration: Command line arguments
--log-level,--log-output, and--log-formatfor runtime logging configuration - Default Configuration: WARN level logging to stdout for both CLI and web interfaces
- Context Support: Structured logging with contextual information support via LoggerAdapter
- File Output: Optional file-based logging with automatic fallback to stdout on errors
Authentication: Not applicable (local tool, no user accounts)
Authorization: Not applicable (local tool)
Data Encryption: Not applicable (local file processing)
Key Security Tools/Practices:
- Input validation for all user and file inputs
- File type and content verification
- Secure file handling with proper cleanup
- Never logging sensitive data (account numbers, personal info)
- Resource management with prompt file handle closing
- Error handling without exposing internal errors
Known Security Issues:
- joblib model loading can execute arbitrary code (only use trusted models)
- File uploads require validation of MIME types and extensions
Local Setup Instructions: See CONTRIBUTING.md or README.md
Testing Frameworks:
- pytest for backend unit and integration tests
- Vitest for frontend tests
Code Quality Tools:
- ruff for Python linting and formatting
- mypy for Python type checking
- ESLint for JavaScript/TypeScript linting
Build Tools:
- Makefile for workflow automation
- Vite for frontend bundling
- npm for frontend dependency management
Known Architectural Debts:
- Migrate from monolith to separate backend and frontend repositories.
Planned Major Changes:
- Migrate from memory-based caching to more robust solution
- Enhance ML model management and security
- Improve error handling and user feedback
- Add more statistical analysis features
- Support additional CSV formats and banks
Significant Future Features:
- Event-driven architecture for real-time updates
- Enhanced API capabilities for third-party integrations
- Mobile application support
- Additional localization languages
Recent Architectural Improvements:
- Service consolidation: Merged DataFormattingService and ResponseBuilderService into unified ResponseFormattingService
- Improved dependency injection patterns with standardized service container
- Enhanced IdMappingService to use CacheService for consistency
- Simplified service registration and usage across CLI and web contexts
Project Name: whatsthedamage
Repository URL: https://github.com/abalage/whatsthedamage
Primary Contact/Team: Balage Abalage
Date of Last Update: 2026-04-07
CLI: Command Line Interface - The command-line tool for processing transactions
CSV: Comma-Separated Values - The file format used for bank transaction exports
ML: Machine Learning - Production-ready feature for transaction categorization using Random Forest with confidence calibration and SMOTE support
DataTablesResponse: Unified response format containing processed transaction data with aggregation by category and time period
Calculator Pattern: Extensibility pattern allowing custom transaction calculations beyond built-in categorization