From 5910cb3ca5700c4bc026da908c47652cb89b6879 Mon Sep 17 00:00:00 2001 From: Gordon Leong Date: Tue, 10 Mar 2026 01:14:50 +0800 Subject: [PATCH] Add scotch-parser semantic comparison and steal plan --- .../scotch-parser-semantic-comparison.md | 175 ++++++++++++++++++ 1 file changed, 175 insertions(+) create mode 100644 docs/internal/planning/scotch-parser-semantic-comparison.md diff --git a/docs/internal/planning/scotch-parser-semantic-comparison.md b/docs/internal/planning/scotch-parser-semantic-comparison.md new file mode 100644 index 00000000..35fae793 --- /dev/null +++ b/docs/internal/planning/scotch-parser-semantic-comparison.md @@ -0,0 +1,175 @@ +# scotch-parser v0.1: semantic tree capability comparison (edgartools 5.19 vs analyzed sec-parser) + +## 1) Executive comparison (implementation-first) + +- edgartools already has a production parser pipeline (`HTMLParser -> HTMLPreprocessor -> DocumentBuilder -> postprocessor`) with typed nodes and configurable strategies; this is a stronger base engine than reusing sec-parser architecture. +- edgartools has explicit hybrid section detection (TOC-first, then heading, then pattern), while sec-parser analysis describes section detection mainly as regex + style pipelines. +- edgartools has richer built-in form coverage in section patterns (10-K/10-Q/20-F/8-K) and includes issuer-specific cross-reference index handling (e.g., GE/Citigroup style filings). +- edgartools table parsing is materially ahead of the analyzed sec-parser baseline: row/header inference, period header heuristics, and coarse table typing already exist. +- sec-parser still contributes useful deterministic heuristics as *enrichment passes*: stable hierarchical IDs, sentence segmentation tuned for SEC abbreviations, prose-table stubs, and footnote linking. +- sec-parser’s title-level weakness (first-style-wins) should **not** be imported; edgartools already uses multi-detector voting and confidence. +- scotch-parser should treat edgartools `Document/Node` output as the canonical base tree, then apply idempotent enrichment passes that only add metadata/new nodes, never mutate raw source text. +- Keep current sanitize/stamp baseline; only add minimal optional deltas where evidence anchoring requires finer granularity (table cells and footnote markers). +- Highest ROI now is not “new parser logic”; it is deterministic post-parse enrichment and sidecar generation that maps node IDs to stamped DOM anchors. +- For v0.1, avoid ML-dependent classification; keep everything regex/style/position/statistics-based so artifacts are reproducible for diffing. +- For downstream reader + card pipeline, the key missing artifact is a strong `sidecar.json` (TOC hierarchy, node paths, table metadata, diagnostics, anchor locators). +- Recommendation: do **not** copy sec-parser pipeline classes; reimplement only selected heuristics as small pure functions/passes under `packages/scotch-parser/`. + +## 2) Gap & steal list + +| Heuristic / concept from sec-parser analysis | Category | edgartools 5.19 status | Steal? | scotch-parser pass (if YES) | Inputs | Deterministic rules | Outputs | Risks/failure mode | Minimal tests | +|---|---|---|---|---|---|---|---|---|---| +| Stable hierarchical IDs (`TopSection -> Title -> text index`) | Item boundary / normalization | **Partially covered** (sections exist, but no global stable IR ID contract for every node/sentence/table cell) | YES | `assign_semantic_ids_pass` | Parsed node tree + stamped DOM map | Canonical slug + ordinal scheme per parent; never use runtime object identity | `node_id`, `parent_id`, `path`, `dom_locator` | ID churn if sibling ordering changes unexpectedly | (1) Re-run deterministic id assignment twice on same doc => byte-identical IDs. (2) Insert unrelated node in another branch => unaffected branch IDs unchanged. | +| Two-pass heading level scoring by style rank | Heading detection hierarchy | **Already covered / better covered** (multi-detector + confidence voting) | NO | — | — | — | — | sec-parser approach can regress on styled disclaimers | N/A | +| Regex-hardening for Part/Item boundary normalization | Item boundary / normalization | **Partially covered** (header patterns + form-specific section extractor) | YES | `item_header_normalization_pass` | Heading nodes + section names | Normalize variants (`ITEM 1A`, `Item 1A.`, punctuation/em dash noise), map to canonical `item_1a` | Canonical item metadata on nodes + diagnostics | False positives from inline references | (1) Main header `ITEM 7—...` normalizes to `item_7`. (2) Sentence "See Item 7" does not become section header. | +| Introductory/pre-item bucket | Item boundary / normalization | **Partially covered** (sections exist, but pre-item explicit bucket not first-class in IR) | YES | `intro_bucket_pass` | Ordered top-level nodes + first item boundary | Everything before first canonical part/item tagged `introductory` | Synthetic Intro node + member references | Misclassify if filing has missing item headers | (1) Filing with cover + TOC + item headers => intro captured. (2) Filing starting directly with Item 1 => empty intro node absent. | +| Block merge across irrelevant/page artifacts | Block segmentation | **Partially covered** (postprocessor has cleanup; legacy merger ideas exist) | YES | `block_coalescing_pass` | Paragraph/text blocks + artifact markers | Merge adjacent text blocks separated only by page number/header noise | Fewer, larger `Block` nodes with provenance list | Over-merge across real headings | (1) `para + page# + para` merges. (2) `para + heading + para` does not merge. | +| Sentence boundary detection with SEC abbreviations | Block segmentation | **Not covered as required output artifact** | YES | `sentence_split_pass` | Clean block text + optional abbreviation list | pysbd/custom rules: protect `Inc.`, `No.`, `U.S.`, note refs `No. 3` | `Sentence` nodes with `sentence_id`, `section_id` | Split errors around legal abbreviations | (1) `Apple Inc. reported...` stays one sentence split point correct. (2) `See Note 3. Revenue...` not split after `Note 3.` incorrectly. | +| Table type classification (financial / TOC / exhibit / layout / reference) | Table extraction | **Partially covered** (`TableProcessor._detect_table_type` exists but coarse) | YES | `table_type_refinement_pass` | TableNode + local context blocks/headers | Use caption/header keywords + numeric density + nearby section labels | Refined `table_type`, confidence, reason codes | Borderline tables misclassified | (1) Balance-sheet style table => `financial_statement`. (2) index/page table => `toc`. | +| Prose-table context classifier (stub vs substantive) | Table extraction | **Not covered** | YES | `table_context_linking_pass` | Table node + prev/next text blocks | Detect stub cues (`see table below`, `following table`) vs substantive context length/content | `context_mode` (`stub`, `substantive`, `mixed`) + linked block IDs | Boilerplate phrases may be ambiguous | (1) Short reference phrase marked `stub`. (2) Multi-sentence discussion with figures marked `substantive`. | +| Table caption vs heading disambiguation | Table normalization | **Partially covered** | YES | `table_caption_disambiguation_pass` | Heading + immediate following table | If short heading directly precedes table and matches caption-like pattern, relabel as table caption | Heading demoted; table caption set | Can hide real subsection heading | (1) "Table 1..." before table => caption. (2) "Risk Factors" before table remains heading. | +| TOC table detection improvements (Pg./PAGE variants, structure cues) | Table extraction | **Partially covered** (TOC analyzer exists) | YES | `toc_table_strengthening_pass` | Table text matrix + anchor links | Case-insensitive page token + row pattern (`Item + page#`) + link density | Table tagged `toc_candidate`, score | TOC-like exhibit lists false positives | (1) TOC with `Pg.` detected. (2) Non-TOC numeric table rejected. | +| Footnote marker + body linking | Footnote detection/linking | **Partially covered** (ix footnotes extractable, but semantic node linking to prose/table refs is limited) | YES | `footnote_linking_pass` | DOM anchors + stamped ids + text/table cells | Match superscript markers to footnote blocks by normalized symbol/number order and local scope | `Footnote` nodes + backlinks from refs | Duplicate symbols reused per table | (1) Numeric superscript links to matching footnote text. (2) Asterisk markers with table-local scope link correctly. | +| Cross-reference detection (`see Note X`, `Item Y`) | Other high leverage | **Partially covered** (ranking utilities have pattern detection, not sidecar link graph output) | YES | `cross_reference_graph_pass` | Sentence nodes + section map + table/footnote index | Regex patterns resolve to canonical section/node IDs where available | `cross_refs[]` edges in sidecar | Ambiguous references without target | (1) `see Item 7` resolves to `item_7`. (2) unresolved target emits diagnostic only. | +| Page header/page number statistical filtering | Block segmentation | **Partially covered** (builder has page-number filtering heuristics) | NO for v0.1 | — | — | Already in base; add only if regressions observed | — | Duplicate logic can conflict with base parser cleanup | N/A | +| Exhibit boundary classifier | Known patterns | **Partially covered** (8-K/10-Q section patterns include exhibits) | LATER | `exhibit_boundary_pass` | Section headings + regex in tail | `EXHIBIT X.X` boundary + materiality rules | Exhibit subtree metadata | Filing-specific edge cases | (1) 99.1 marked material. (2) 31/32 marked boilerplate. | +| Structural boilerplate hash lookup | Other | **Not covered in parser core** | NO for v0.1 | — | — | Better as downstream service over sentence corpus | — | Asset maintenance burden | N/A | +| Parsed output caching by content hash | Other | **Partially covered** (cache infra exists, but sidecar/artifact cache for scotch not defined) | YES (later) | `artifact_cache_pass` | html hash + config hash | deterministic key for semantic tree + sidecar | cache hit/miss metadata | stale cache if versioning weak | (1) same input/config => hit. (2) config change => miss. | +| Issuer-specific cross-reference index parsing (GE/Citigroup style) | Special issuer handling | **Already covered** (`cross_reference_index.py`) | NO (reuse edgartools directly) | — | — | Use existing detection/parser as upstream signal | — | Reimplementation risk with no added value | N/A | + +## 3) Priorities + +### Top 5 steals for v0.1 (highest ROI for reader + card pipeline) + +1. `assign_semantic_ids_pass` (enables stable node/sentence/table references and diffability). +2. `sentence_split_pass` (required for card inputs and parquet/duckdb sentence store). +3. `table_context_linking_pass` (improves reader UX and avoids duplicate prose/table cards). +4. `table_type_refinement_pass` (better table UX routing and indexing). +5. `footnote_linking_pass` (high-signal accounting context; critical for evidence provenance). + +### Next 5 later + +1. `cross_reference_graph_pass` +2. `table_caption_disambiguation_pass` +3. `toc_table_strengthening_pass` +4. `item_header_normalization_pass` +5. `exhibit_boundary_pass` + +### Explicit do-not-steal (for now) + +- sec-parser architecture/pipeline framework itself. +- first-occurrence style ranking for heading levels. +- parser-embedded boilerplate-hash asset management. +- any non-deterministic/ML-first classification in core parsing path. + +## 4) Minimal sanitize/stamp deltas required? + +Baseline answer: **mostly no changes required** for first pass implementation. + +Recommended minimal deltas (optional flags, default OFF): + +1. **Table evidence granularity delta (smallest possible):** in stamping, add optional cell stamping (`td`, `th`) only when `--stamp-table-cells` is enabled. + - Why: `footnote_linking_pass` and `table_context_linking_pass` need precise table-cell provenance for sidecar locators. + - Minimality: do not alter existing table-level stamping behavior; add separate opt-in branch. + +2. **Footnote anchor delta (smallest possible):** in stamping, add optional stamp to superscript/reference-like inline nodes (`sup`, small anchor refs) outside script/table exclusions when `--stamp-footnote-refs` is enabled. + - Why: deterministic `ref -> footnote` linking requires stable anchors on both ends. + - Minimality: no sanitizer rewrite, no global inline stamping. + +No sanitize changes are required to implement the top-5 passes listed above unless a specific filing class shows unresolved unsafe inline URI/style artifacts beyond current sanitizer scope. + +## 5) Implementation plan (`packages/scotch-parser/`) + +```text +packages/scotch-parser/ + pyproject.toml + scotch_parser/ + __init__.py + pipeline.py # orchestrates parse + passes + models/ + ir_nodes.py # Document/Item/Heading/Block/Sentence/Table/Cell/Footnote + sidecar.py # sidecar schema dataclasses + adapters/ + edgartools_adapter.py # parse_html + section/table/headings extraction adapters + stamping_adapter.py # map data-src-id anchors into locator index + passes/ + assign_semantic_ids.py + sentence_split.py + table_type_refinement.py + table_context_linking.py + footnote_linking.py + cross_reference_graph.py # later + item_header_normalization.py # later + serializers/ + sidecar_json.py + parquet_writer.py + duckdb_writer.py + diagnostics/ + quality_metrics.py + rule_trace.py + tests/ + fixtures/ + filings/ + test_assign_semantic_ids.py + test_sentence_split.py + test_table_context_linking.py + test_table_type_refinement.py + test_footnote_linking.py +``` + +### Sequencing + +1. **Acquire + sanitize + stamp** (existing external scripts stay authoritative). +2. **Base parse via edgartools** (`edgar.documents.parse_html`) and extract sections/headings/tables from `Document`. +3. **Build initial IR tree** (`Document -> Item/Heading/Block/Table`). +4. **Run enrichment passes in fixed order:** + `item_header_normalization (later)` -> `assign_semantic_ids` -> `table_type_refinement` -> `table_context_linking` -> `footnote_linking` -> `sentence_split` -> `cross_reference_graph (later)`. +5. **Emit artifacts:** `semantic_tree.json`, `sidecar.json`, sentence parquet/duckdb. + +### How to call edgartools as base engine + +- Use `parse_html(html, config)` once. +- Consume: + - `document.sections` for Item/Part boundaries + - `document.headings` for hierarchy candidates + - `document.tables` for table nodes and metadata + - `document.text()` only for fallback diagnostics, not as primary structural source + +### Validation/scoring loop + +Compute deterministic metrics per filing and fail pipeline if thresholds breached: + +- `% blocks with node_id` (target 100%) +- `% sentences with parent block_id + section_id` (target 100%) +- table typing coverage (non-`general` share) +- footnote link precision on golden fixtures +- unresolved cross-reference rate +- anchor resolution rate (`node_id -> data-src-id`) + +### Mapping IR node ids to DOM anchors + +- Primary key: stamped `data-src-id` map built from stamped HTML. +- For each IR node, store: + - `primary_locator`: best `data-src-id` + - `fallback_locator`: xpath/css path + - `source_offsets`: optional char offsets within block text +- Table/Cell and Footnote nodes use optional stamping deltas only when enabled; otherwise attach table-level locator and diagnostic `locator_granularity=table`. + +## 6) Test plan (fixtures + metrics) + +### Fixtures to add + +- 10-K with classic TOC + normal Item headings +- 10-Q with repeated item numbers across Part I/II +- 10-K with dense financial tables and superscript footnotes +- issuer with cross-reference index style (GE/Citigroup-like) +- filing with page header/footer artifacts in middle of section + +### Programmatic checks + +- Determinism check: run full pipeline twice, compare `sidecar.json` hash. +- ID stability check under irrelevant node insertion fixture. +- Sentence splitter regression suite for SEC abbreviation edge cases. +- Table context confusion matrix (`stub` vs `substantive`) on labeled fixtures. +- Footnote link exact-match accuracy and unresolved-link diagnostics. +- Section hierarchy integrity: every sentence must resolve to exactly one section path.