RFC: Extraction architecture direction (Docling as primary parser candidate)

## Summary
Evaluate and decide the long-term extraction architecture for raged, with **Docling** as a leading candidate for document parsing/extraction.

## Context
Current extraction responsibilities are split:
- API-side MIME-specific extraction (HTML/PDF/text/etc.)
- Async enrichment worker for tier-2/tier-3 metadata/entity extraction

This has worked, but extraction coverage and maintenance effort are growing with format diversity.

## Problem
We currently own format-specific extraction logic and parser edge-case handling in-house. This creates:
- ongoing maintenance burden,
- uneven format support,
- duplicated effort versus mature document parsing projects.

## Proposal (RFC)
Adopt a **pluggable extraction architecture** and evaluate **Docling** as the primary extraction backend for complex document types.

### Draft direction
1. Keep enrichment queue/worker contracts stable.
2. Introduce extraction provider abstraction at ingestion boundary.
3. Start with optional Docling provider behind feature flag/config.
4. Keep existing extractor as fallback during migration.
5. Define rollout gates (quality, latency, cost, operability).

## Scope
### In scope
- extraction/parsing layer boundaries and provider model,
- migration strategy from current extractor to provider-based model,
- deployment model (in-process vs sidecar/service),
- observability and failure handling expectations.

### Out of scope
- replacing tier-2/tier-3 semantic enrichment logic,
- immediate full rewrite of ingestion pipeline.

## Decision Criteria
- Extraction quality across representative MIME types
- Throughput/latency under current ingestion load
- Operational complexity (dependencies, packaging, runtime footprint)
- Failure modes and graceful fallback behavior
- Security/compliance constraints for local-only processing

## Suggested Deliverables
- Architecture decision (ADR/RFC conclusion)
- Minimal spike implementation with feature flag
- Benchmark + quality report against current extractor
- Rollout plan with fallback and abort criteria

## Open Questions
1. Should Docling run as a separate service or inside worker/API runtime?
2. Which MIME families should migrate first (PDF, Office, HTML, images)?
3. What quality bar and performance SLOs must be met before defaulting to Docling?
4. How should extracted structure (tables/layout) map into existing chunking flow?

## Acceptance
This RFC is complete when we have:
- a selected target architecture,
- an agreed migration plan,
- explicit go/no-go criteria for Docling default adoption.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Extraction architecture direction (Docling as primary parser candidate) #100

Summary

Context

Problem

Proposal (RFC)

Draft direction

Scope

In scope

Out of scope

Decision Criteria

Suggested Deliverables

Open Questions

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RFC: Extraction architecture direction (Docling as primary parser candidate) #100

Description

Summary

Context

Problem

Proposal (RFC)

Draft direction

Scope

In scope

Out of scope

Decision Criteria

Suggested Deliverables

Open Questions

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions