Skip to content

feat(timmy): content provider infrastructure and Google Drive source #232

@ericfitz

Description

@ericfitz

Summary

Refactor TMI's content extraction pipeline into a two-layer Source/Extractor architecture, add document access tracking, and implement Google Drive as the proof-of-concept service provider.

Design spec: 2026-04-08-content-providers-design.md
Parent issue: #214 (Phase 1 complete)
Follow-up issue: #249 (Confluence + OneDrive providers, delegated provider infrastructure)

Scope

Infrastructure refactor + Google Drive as the proof-of-concept service provider. Delegated provider infrastructure (token table, encryption, account linking endpoints) and additional providers are tracked in #249.

Implementation Phases

  1. Source/Extractor refactor — Split existing ContentProvider into ContentSource + ContentExtractor layers; introduce pipeline orchestrator. No behavior change.
  2. Document access tracking — Add access_status and content_source fields to Document model; URL pattern matcher; creation-time detection; 422 for unconfigured providers.
  3. Google Drive source — Operator config, service account auth, validate/request access, background poller.
  4. Timmy session integration — Skip inaccessible documents, skipped_sources in session response, refresh_sources endpoint, request_access endpoint.
  5. OpenAPI spec updates — New schemas, endpoints, modified schemas.

Key Design Decisions

  • Two-layer pipeline: Sources (auth + fetch bytes) separated from Extractors (bytes → text)
  • Two provider categories: Service providers (operator credentials) and Delegated providers (per-user OAuth tokens)
  • Google Drive auth: Regular bot account (share-with-account model), least privilege
  • Document access tracking: access_status and content_source fields on Document model
  • Hybrid validation: Synchronous access check at document creation, async background poller for pending access
  • Unconfigured providers: Reject with 422 (clear, actionable error)

Acceptance Criteria

  • Existing content providers (HTTP, PDF, direct text, JSON/DFD) work identically after refactor
  • Google Drive documents can be added and accessed via service account
  • Documents with pending access are tracked and polled
  • Unconfigured provider URLs return 422 with actionable message
  • Timmy sessions skip inaccessible documents and report what was skipped
  • OpenAPI spec updated with new schemas and endpoints
  • Unit tests for each new component
  • Integration tests for Google Drive access flow

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

Status

Done

Relationships

None yet

Development

No branches or pull requests

Issue actions