Skip to content

Schema Documentation & Data Dictionary Generator #111

@Empreiteiro

Description

@Empreiteiro

Summary

Generate comprehensive data documentation from connected sources: a browsable data dictionary with table/column descriptions (LLM-generated), statistics, relationships, and exportable formats (Markdown, HTML, PDF).

Problem

  • The platform stores rich metadata (table_infos, sample_profile, relationships) but does not present it as documentation
  • Teams lack up-to-date data dictionaries — they go stale quickly when maintained manually
  • Column descriptions require domain knowledge that the LLM can infer from column names, types, and sample values
  • No way to export schema information for sharing with stakeholders or onboarding new team members

Proposed Solution

Documentation Layers

  1. Source Overview: Name, type, connection info (redacted), table count, total rows, last sync
  2. Table Documentation: Per-table description (LLM-generated), row count, column count, sample data
  3. Column Documentation: Name, type, description (LLM-generated), nullability, uniqueness, sample values, statistics
  4. Relationship Map: Detected FKs and joins between tables/sources
  5. Data Quality Summary: Completeness score, type consistency, outlier flags

LLM-Generated Descriptions

The LLM analyzes column names, data types, sample values, and table context to generate human-readable descriptions:

  • user_id (INTEGER, NOT NULL, UNIQUE) → "Unique identifier for each user account"
  • created_at (TIMESTAMP) → "Timestamp when the record was first created"
  • mrr (DECIMAL) → "Monthly Recurring Revenue in the account's billing currency"

Export Formats

  • In-app browser: Searchable, filterable documentation page
  • Markdown: For GitHub/GitLab wikis
  • HTML: Standalone page for internal portals
  • PDF: For stakeholder distribution
  • dbt docs compatible: schema.yml with descriptions for dbt docs generate

Technical Notes

  • Input: Source.metadata_.table_infos, sample_profile, and source_relationships — all already available
  • LLM descriptions: Batch process columns with chat_completion() using table context for accuracy
  • Caching: Store generated descriptions in Source.metadata_ to avoid re-generating on every view
  • New endpoints:
    • GET /api/sources/{id}/documentation — browsable JSON for frontend
    • GET /api/sources/{id}/documentation/export?format=md|html|pdf — downloadable file
  • Frontend: New DataDictionary page component with search, filter by table/type, and expandable column details

Acceptance Criteria

  • Auto-generate column descriptions using LLM for all connected sources
  • Browsable in-app data dictionary with search and filtering
  • Export as Markdown, HTML, and PDF
  • Include profiling statistics (nulls, unique counts, ranges) per column
  • Show detected relationships between tables
  • Descriptions cached and editable by users (override LLM suggestion)
  • Bulk generation: document all sources in one click

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions