Summary
Consolidate metadata from all connected sources into a searchable, browsable data catalog. Show basic lineage (which source feeds which agent/report/alert) as an interactive graph. Read-only — no data movement, just visibility.
Problem
- Users connect many sources but lose track of what data exists where
- No single view of all tables, columns, and their usage across the platform
- When a source goes down, there's no quick way to see what agents/reports are affected
- No search across all sources ("which table has a customer_email column?")
- No tagging or classification (PII, financial, internal, public)
Proposed Solution
Data Catalog
- Asset inventory: All sources, tables, and columns in one searchable view
- Search: Full-text search across table names, column names, and descriptions
- Filters: By source type, data type, tag, freshness, row count range
- Column explorer: Click any table → see all columns with types, stats, descriptions
- Tags & classification: User-defined tags (PII, financial, staging, production) + auto-suggested by LLM
- Freshness indicators: Last sync/update time per source, staleness warnings
Lineage Viewer
- Source → Agent lineage: Which agents use which sources
- Source → Report lineage: Which report templates reference which sources
- Source → Alert lineage: Which alerts query which sources
- Interactive graph: Click any node to see details, highlight upstream/downstream
- Impact analysis: Select a source → see all dependent assets highlighted
How It Works
- Catalog data: Already exists in `Source.metadata_` (table_infos, sample_profile) — just needs a unified presentation layer
- Lineage data: Already exists in Agent.source_ids, ReportTemplate queries, Alert.source_id — just needs graph construction
- No new data collection: Everything is read from existing models
Technical Notes
- No new tables needed: Catalog is a read-only view over existing `Source`, `Agent`, `ReportTemplate`, `Alert` models
- Tags: Add optional `tags` JSON field to `Source` model (new Alembic migration)
- Search endpoint: `GET /api/catalog/search?q=customer_email&type=column&source_type=sql_database`
- Lineage endpoint: `GET /api/catalog/lineage?source_id=xxx` → returns graph edges
- Frontend: Two new pages:
- `/catalog` — searchable asset inventory with filters
- `/catalog/lineage` — interactive graph (React Flow or d3)
- LLM tagging: Optional auto-classification of columns (detect PII patterns like email, SSN, phone)
Acceptance Criteria
Summary
Consolidate metadata from all connected sources into a searchable, browsable data catalog. Show basic lineage (which source feeds which agent/report/alert) as an interactive graph. Read-only — no data movement, just visibility.
Problem
Proposed Solution
Data Catalog
Lineage Viewer
How It Works
Technical Notes
Acceptance Criteria