Skip to content

RFC: Extraction architecture direction (Docling as primary parser candidate) #100

@mfittko

Description

@mfittko

Summary

Evaluate and decide the long-term extraction architecture for raged, with Docling as a leading candidate for document parsing/extraction.

Context

Current extraction responsibilities are split:

  • API-side MIME-specific extraction (HTML/PDF/text/etc.)
  • Async enrichment worker for tier-2/tier-3 metadata/entity extraction

This has worked, but extraction coverage and maintenance effort are growing with format diversity.

Problem

We currently own format-specific extraction logic and parser edge-case handling in-house. This creates:

  • ongoing maintenance burden,
  • uneven format support,
  • duplicated effort versus mature document parsing projects.

Proposal (RFC)

Adopt a pluggable extraction architecture and evaluate Docling as the primary extraction backend for complex document types.

Draft direction

  1. Keep enrichment queue/worker contracts stable.
  2. Introduce extraction provider abstraction at ingestion boundary.
  3. Start with optional Docling provider behind feature flag/config.
  4. Keep existing extractor as fallback during migration.
  5. Define rollout gates (quality, latency, cost, operability).

Scope

In scope

  • extraction/parsing layer boundaries and provider model,
  • migration strategy from current extractor to provider-based model,
  • deployment model (in-process vs sidecar/service),
  • observability and failure handling expectations.

Out of scope

  • replacing tier-2/tier-3 semantic enrichment logic,
  • immediate full rewrite of ingestion pipeline.

Decision Criteria

  • Extraction quality across representative MIME types
  • Throughput/latency under current ingestion load
  • Operational complexity (dependencies, packaging, runtime footprint)
  • Failure modes and graceful fallback behavior
  • Security/compliance constraints for local-only processing

Suggested Deliverables

  • Architecture decision (ADR/RFC conclusion)
  • Minimal spike implementation with feature flag
  • Benchmark + quality report against current extractor
  • Rollout plan with fallback and abort criteria

Open Questions

  1. Should Docling run as a separate service or inside worker/API runtime?
  2. Which MIME families should migrate first (PDF, Office, HTML, images)?
  3. What quality bar and performance SLOs must be met before defaulting to Docling?
  4. How should extracted structure (tables/layout) map into existing chunking flow?

Acceptance

This RFC is complete when we have:

  • a selected target architecture,
  • an agreed migration plan,
  • explicit go/no-go criteria for Docling default adoption.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions