Skip to content

Data Catalog & Lineage Viewer (Read-Only Asset Inventory) #117

@Empreiteiro

Description

@Empreiteiro

Summary

Consolidate metadata from all connected sources into a searchable, browsable data catalog. Show basic lineage (which source feeds which agent/report/alert) as an interactive graph. Read-only — no data movement, just visibility.

Problem

  • Users connect many sources but lose track of what data exists where
  • No single view of all tables, columns, and their usage across the platform
  • When a source goes down, there's no quick way to see what agents/reports are affected
  • No search across all sources ("which table has a customer_email column?")
  • No tagging or classification (PII, financial, internal, public)

Proposed Solution

Data Catalog

  1. Asset inventory: All sources, tables, and columns in one searchable view
  2. Search: Full-text search across table names, column names, and descriptions
  3. Filters: By source type, data type, tag, freshness, row count range
  4. Column explorer: Click any table → see all columns with types, stats, descriptions
  5. Tags & classification: User-defined tags (PII, financial, staging, production) + auto-suggested by LLM
  6. Freshness indicators: Last sync/update time per source, staleness warnings

Lineage Viewer

  1. Source → Agent lineage: Which agents use which sources
  2. Source → Report lineage: Which report templates reference which sources
  3. Source → Alert lineage: Which alerts query which sources
  4. Interactive graph: Click any node to see details, highlight upstream/downstream
  5. Impact analysis: Select a source → see all dependent assets highlighted

How It Works

  • Catalog data: Already exists in `Source.metadata_` (table_infos, sample_profile) — just needs a unified presentation layer
  • Lineage data: Already exists in Agent.source_ids, ReportTemplate queries, Alert.source_id — just needs graph construction
  • No new data collection: Everything is read from existing models

Technical Notes

  • No new tables needed: Catalog is a read-only view over existing `Source`, `Agent`, `ReportTemplate`, `Alert` models
  • Tags: Add optional `tags` JSON field to `Source` model (new Alembic migration)
  • Search endpoint: `GET /api/catalog/search?q=customer_email&type=column&source_type=sql_database`
  • Lineage endpoint: `GET /api/catalog/lineage?source_id=xxx` → returns graph edges
  • Frontend: Two new pages:
    • `/catalog` — searchable asset inventory with filters
    • `/catalog/lineage` — interactive graph (React Flow or d3)
  • LLM tagging: Optional auto-classification of columns (detect PII patterns like email, SSN, phone)

Acceptance Criteria

  • Unified view of all sources, tables, and columns across the platform
  • Full-text search across table names, column names, and descriptions
  • Filter by source type, data type, tags, and freshness
  • Lineage graph showing source → agent/report/alert dependencies
  • Impact analysis: highlight all downstream assets when selecting a source
  • User-defined tags with LLM-suggested auto-classification (PII, financial, etc.)
  • Freshness indicators with staleness warnings
  • Entirely read-only — no data movement or modification

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions