-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Evaluate and decide the long-term extraction architecture for raged, with Docling as a leading candidate for document parsing/extraction.
Context
Current extraction responsibilities are split:
- API-side MIME-specific extraction (HTML/PDF/text/etc.)
- Async enrichment worker for tier-2/tier-3 metadata/entity extraction
This has worked, but extraction coverage and maintenance effort are growing with format diversity.
Problem
We currently own format-specific extraction logic and parser edge-case handling in-house. This creates:
- ongoing maintenance burden,
- uneven format support,
- duplicated effort versus mature document parsing projects.
Proposal (RFC)
Adopt a pluggable extraction architecture and evaluate Docling as the primary extraction backend for complex document types.
Draft direction
- Keep enrichment queue/worker contracts stable.
- Introduce extraction provider abstraction at ingestion boundary.
- Start with optional Docling provider behind feature flag/config.
- Keep existing extractor as fallback during migration.
- Define rollout gates (quality, latency, cost, operability).
Scope
In scope
- extraction/parsing layer boundaries and provider model,
- migration strategy from current extractor to provider-based model,
- deployment model (in-process vs sidecar/service),
- observability and failure handling expectations.
Out of scope
- replacing tier-2/tier-3 semantic enrichment logic,
- immediate full rewrite of ingestion pipeline.
Decision Criteria
- Extraction quality across representative MIME types
- Throughput/latency under current ingestion load
- Operational complexity (dependencies, packaging, runtime footprint)
- Failure modes and graceful fallback behavior
- Security/compliance constraints for local-only processing
Suggested Deliverables
- Architecture decision (ADR/RFC conclusion)
- Minimal spike implementation with feature flag
- Benchmark + quality report against current extractor
- Rollout plan with fallback and abort criteria
Open Questions
- Should Docling run as a separate service or inside worker/API runtime?
- Which MIME families should migrate first (PDF, Office, HTML, images)?
- What quality bar and performance SLOs must be met before defaulting to Docling?
- How should extracted structure (tables/layout) map into existing chunking flow?
Acceptance
This RFC is complete when we have:
- a selected target architecture,
- an agreed migration plan,
- explicit go/no-go criteria for Docling default adoption.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels