Minimal, machine-first ingestion pipeline with one OCR pass per page, then two downstream products:
- Highlight-triggered excerpt notes for Obsidian.
- Canonical full-book corpus in JSONL for future retrieval/RAG tooling.
The canonical source of truth is corpus/books/<book_id>/pages.jsonl. Downstream outputs are derived from this corpus and run artifacts. No re-OCR per excerpt.
- Use Python 3.11+.
- Install Python dependencies:
python -m pip install -r requirements.txt- Install external OCR binary:
- Tesseract must be installed and available on
PATH. - Verify:
tesseract --versionIf pytesseract or opencv-python-headless is missing, the CLI fails with explicit dependency errors.
- Pipeline defaults:
configs/pipeline.yaml - Book config:
configs/books/sample_book.yaml - Note template:
templates/obsidian_note.md
sample_book.yaml includes:
book_id- bibliographic metadata (
title,creator,year,publisher_studio,format) scans_pathvault_out_path- note defaults (
note_type,note_status,note_version,YAML_schema_version,register,tags)
CLI entrypoint:
python -m ingest --helpPhase 1: OCR spine (canonical corpus)
python -m ingest ocr `
--book configs/books/sample_book.yaml `
--out corpus `
--runs runs `
--max-pages 3Phase 2: Highlight detection
python -m ingest detect-highlights `
--book configs/books/sample_book.yaml `
--runs runs `
--run-id <run_id> `
--max-pages 3Phase 3: Span selection (highlights -> OCR lines + context)
python -m ingest make-spans `
--book configs/books/sample_book.yaml `
--runs runs `
--run-id <run_id> `
--corpus corpus `
--k-before 2 `
--k-after 2 `
--max-pages 3Phase 4: Emit Obsidian notes (+ optional sidecar JSON)
python -m ingest emit-obsidian `
--book configs/books/sample_book.yaml `
--runs runs `
--run-id <run_id> `
--corpus corpus `
--vault runs/<run_id>/obsidian_staging `
--sidecar-json `
--max-pages 3emit-obsidian and export-book-text both use the same deterministic token+layout-aware text renderer by default.
This rendering does not modify canonical raw OCR evidence in pages.jsonl.
Use --no-clean-text on either command to keep raw OCR line breaks.
Optional corpus export:
python -m ingest export-book-text `
--book configs/books/sample_book.yaml `
--out corpus `
--format txtMarkdown export:
python -m ingest export-book-text `
--book configs/books/sample_book.yaml `
--out corpus `
--format mdRelevant commands support:
--dry-run--overwrite {never|if_same_run|always}(default:never)--max-pages N--run-id <id>--no-clean-text(emit-obsidian,export-book-text)
Overwrite semantics are fail-closed by default:
never: fail if output already exists.if_same_run: allow overwrite only underruns/<run_id>/....always: allow replacement.
Canonical corpus (corpus/.../pages.jsonl) remains fail-closed unless --overwrite always.
corpus/
books/
<book_id>/
pages.jsonl
book.txt
book.md
runs/
<run_id>/
book_<book_id>/
page_0001/
page_text.json
page_overlay.png
highlight_mask.png
highlight_candidates.json
highlights_overlay.png
spans.json
spans_overlay.png
obsidian_staging/
<book_id>/
<note>.md
<note>.span.json
No new YAML frontmatter keys are introduced.
Emitter uses only existing schema keys:
uuidnote_versionYAML_schema_versionnote_typenote_statustagsformattitlecreatoryearpublisher_studioregister
Provenance fields such as page number, line ids, run id, bbox coordinates, and config hash are written in:
- note body under
## Source - optional sidecar JSON
<note>.span.json
Use:
scripts/dev_smoke_test.ps1 -BookConfig configs/books/sample_book.yamlThe script runs OCR -> highlight detection -> span generation -> note emission with --max-pages 3.
Frontmatter rendering smoke check:
python scripts/frontmatter_smoke_check.pyThe pipeline supports tuning OCR, highlight detection, QA, and span selection settings through:
- Named Scenarios - Pre-configured tuning profiles for common use cases
- CLI Overrides - Direct command-line control of individual settings
- YAML Configuration - Base settings in
configs/pipeline.yaml
Configuration precedence: built-in defaults < pipeline YAML < CLI overrides.
To compare OCR results across multiple scenarios on the same sample pages:
scripts/tune_sample.ps1 -MaxPages 3This runs all default scenarios (baseline, conservative_text, messy_scan_rescue, highlight_sensitive) and outputs separate runs for comparison.
Run specific scenarios:
scripts/tune_sample.ps1 -Scenarios @("baseline", "messy_scan_rescue") -MaxPages 5Each scenario produces complete artifacts in runs/<timestamp>_<scenario>/:
page_text.json- OCR resultspage_overlay.png- OCR visualizationhighlight_mask.png- Highlight detection maskhighlights_overlay.png- Highlight visualizationspans.json- Text spans with contextspans_overlay.png- Span visualization
- baseline: Current defaults (no overrides)
- conservative_text: Tighter QA thresholds, fewer false positives
- messy_scan_rescue: More forgiving for noisy/low-confidence scans
- highlight_sensitive: Permissive highlight detection for faint markers
Apply a scenario to any command with --scenario:
python -m ingest ocr `
--book configs/books/sample_book.yaml `
--scenario messy_scan_rescue `
--out corpus `
--runs runs `
--max-pages 3Override individual settings directly on the command line. CLI overrides take precedence over both scenario and YAML settings.
OCR Settings:
--ocr-psm <int>- Page segmentation mode (default: 6)--ocr-language <str>- Language code (default: "eng")--ocr-line-y-tolerance-px <int>- Line grouping Y tolerance (default: 14)
Highlight Settings:
--highlight-min-area <int>- Minimum area threshold (default: 120)--highlight-kernel-size <int>- Morphological kernel size (default: 5)--highlight-edge-margin-px <int>- Edge margin filter (default: 25)--highlight-max-hw-ratio <float>- Max height/width ratio (default: 3.0)--highlight-max-height-frac <float>- Max height fraction (default: 0.15)
QA Settings:
--qa-min-avg-word-conf <float>- Min average word confidence (default: 58.0)--qa-max-garbage-ratio <float>- Max garbage ratio (default: 0.22)--qa-max-pipe-ratio <float>- Max pipe character ratio (default: 0.04)--qa-min-alpha-ratio <float>- Min alpha character ratio (default: 0.45)
Span Settings:
--span-min-overlap-frac <float>- Min overlap fraction (default: 0.02)--span-min-x-overlap-px <int>- Min X overlap in pixels (default: 40)--span-max-overlap-lines <int>- Max overlap lines (default: 8)
Start with a scenario and tweak individual settings:
python -m ingest ocr `
--book configs/books/sample_book.yaml `
--scenario messy_scan_rescue `
--ocr-psm 3 `
--qa-min-avg-word-conf 45.0 `
--out corpus `
--runs runs `
--max-pages 3Or override without a scenario:
python -m ingest detect-highlights `
--book configs/books/sample_book.yaml `
--highlight-min-area 80 `
--highlight-kernel-size 3 `
--runs runs `
--run-id <run_id> `
--max-pages 3- PDF ingestion support (v0 currently supports image folders only).
- Better paragraph/reading-order line grouping.
- More robust highlight color modeling for difficult scans.