Skip to content

refactor: decompose TableDetector._detect_text_based_table (grade E, score 32) #98

@longieirl

Description

@longieirl

Summary

TableDetector._detect_text_based_table has cyclomatic complexity grade E (score 32), making it the highest-complexity function in the codebase. It is temporarily excluded from the Xenon CI gate (introduced in #88) to unblock the gate without forcing an unsafe blind refactor.

This issue tracks the work needed to remove that exclusion.

Exit condition

The --exclude flag for packages/parser-core/src/bankstatements_core/analysis/table_detector.py must be removed from the Xenon CI step in .github/workflows/ci.yml once this issue is resolved.

Why not now

The function is a PDF heuristic hotspot tightly coupled to the pdfplumber word-coordinate API. Decomposing it without targeted characterisation tests carries high regression risk — it cannot be safely changed without first pinning its observed behaviour.

Required sequence

  1. Write characterisation tests for _detect_text_based_table

    • Cover the main branching paths: empty words, column-coverage threshold, text-density threshold, word-gap detection
    • Use real or synthetic pdfplumber word objects as fixtures
    • Tests must pass before any structural changes are made
  2. Decompose the function into cohesive sub-functions

    • Candidate extractions: column-coverage check, density check, word-gap scan, boundary decision
    • Each sub-function should have a single responsibility and be independently testable
  3. Remove the Xenon exclusion

    • Delete --exclude packages/parser-core/src/bankstatements_core/analysis/table_detector.py from the CI step
    • Confirm gate passes with the new structure

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions