Summary
Refactor TMI's content extraction pipeline into a two-layer Source/Extractor architecture, add document access tracking, and implement Google Drive as the proof-of-concept service provider.
Design spec: 2026-04-08-content-providers-design.md
Parent issue: #214 (Phase 1 complete)
Follow-up issue: #249 (Confluence + OneDrive providers, delegated provider infrastructure)
Scope
Infrastructure refactor + Google Drive as the proof-of-concept service provider. Delegated provider infrastructure (token table, encryption, account linking endpoints) and additional providers are tracked in #249.
Implementation Phases
- Source/Extractor refactor — Split existing
ContentProvider into ContentSource + ContentExtractor layers; introduce pipeline orchestrator. No behavior change.
- Document access tracking — Add
access_status and content_source fields to Document model; URL pattern matcher; creation-time detection; 422 for unconfigured providers.
- Google Drive source — Operator config, service account auth, validate/request access, background poller.
- Timmy session integration — Skip inaccessible documents,
skipped_sources in session response, refresh_sources endpoint, request_access endpoint.
- OpenAPI spec updates — New schemas, endpoints, modified schemas.
Key Design Decisions
- Two-layer pipeline: Sources (auth + fetch bytes) separated from Extractors (bytes → text)
- Two provider categories: Service providers (operator credentials) and Delegated providers (per-user OAuth tokens)
- Google Drive auth: Regular bot account (share-with-account model), least privilege
- Document access tracking:
access_status and content_source fields on Document model
- Hybrid validation: Synchronous access check at document creation, async background poller for pending access
- Unconfigured providers: Reject with 422 (clear, actionable error)
Acceptance Criteria
- Existing content providers (HTTP, PDF, direct text, JSON/DFD) work identically after refactor
- Google Drive documents can be added and accessed via service account
- Documents with pending access are tracked and polled
- Unconfigured provider URLs return 422 with actionable message
- Timmy sessions skip inaccessible documents and report what was skipped
- OpenAPI spec updated with new schemas and endpoints
- Unit tests for each new component
- Integration tests for Google Drive access flow
Summary
Refactor TMI's content extraction pipeline into a two-layer Source/Extractor architecture, add document access tracking, and implement Google Drive as the proof-of-concept service provider.
Design spec: 2026-04-08-content-providers-design.md
Parent issue: #214 (Phase 1 complete)
Follow-up issue: #249 (Confluence + OneDrive providers, delegated provider infrastructure)
Scope
Infrastructure refactor + Google Drive as the proof-of-concept service provider. Delegated provider infrastructure (token table, encryption, account linking endpoints) and additional providers are tracked in #249.
Implementation Phases
ContentProviderintoContentSource+ContentExtractorlayers; introduce pipeline orchestrator. No behavior change.access_statusandcontent_sourcefields to Document model; URL pattern matcher; creation-time detection; 422 for unconfigured providers.skipped_sourcesin session response,refresh_sourcesendpoint,request_accessendpoint.Key Design Decisions
access_statusandcontent_sourcefields on Document modelAcceptance Criteria