Schema Documentation & Data Dictionary Generator

## Summary

Generate comprehensive data documentation from connected sources: a browsable data dictionary with table/column descriptions (LLM-generated), statistics, relationships, and exportable formats (Markdown, HTML, PDF).

## Problem

- The platform stores rich metadata (table_infos, sample_profile, relationships) but does not present it as documentation
- Teams lack up-to-date data dictionaries — they go stale quickly when maintained manually
- Column descriptions require domain knowledge that the LLM can infer from column names, types, and sample values
- No way to export schema information for sharing with stakeholders or onboarding new team members

## Proposed Solution

### Documentation Layers

1. **Source Overview**: Name, type, connection info (redacted), table count, total rows, last sync
2. **Table Documentation**: Per-table description (LLM-generated), row count, column count, sample data
3. **Column Documentation**: Name, type, description (LLM-generated), nullability, uniqueness, sample values, statistics
4. **Relationship Map**: Detected FKs and joins between tables/sources
5. **Data Quality Summary**: Completeness score, type consistency, outlier flags

### LLM-Generated Descriptions

The LLM analyzes column names, data types, sample values, and table context to generate human-readable descriptions:
- `user_id (INTEGER, NOT NULL, UNIQUE)` → "Unique identifier for each user account"
- `created_at (TIMESTAMP)` → "Timestamp when the record was first created"
- `mrr (DECIMAL)` → "Monthly Recurring Revenue in the account's billing currency"

### Export Formats

- **In-app browser**: Searchable, filterable documentation page
- **Markdown**: For GitHub/GitLab wikis
- **HTML**: Standalone page for internal portals
- **PDF**: For stakeholder distribution
- **dbt docs compatible**: schema.yml with descriptions for dbt docs generate

## Technical Notes

- **Input**: `Source.metadata_.table_infos`, `sample_profile`, and `source_relationships` — all already available
- **LLM descriptions**: Batch process columns with `chat_completion()` using table context for accuracy
- **Caching**: Store generated descriptions in `Source.metadata_` to avoid re-generating on every view
- **New endpoints**:
  - `GET /api/sources/{id}/documentation` — browsable JSON for frontend
  - `GET /api/sources/{id}/documentation/export?format=md|html|pdf` — downloadable file
- **Frontend**: New `DataDictionary` page component with search, filter by table/type, and expandable column details

## Acceptance Criteria

- [ ] Auto-generate column descriptions using LLM for all connected sources
- [ ] Browsable in-app data dictionary with search and filtering
- [ ] Export as Markdown, HTML, and PDF
- [ ] Include profiling statistics (nulls, unique counts, ranges) per column
- [ ] Show detected relationships between tables
- [ ] Descriptions cached and editable by users (override LLM suggestion)
- [ ] Bulk generation: document all sources in one click

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema Documentation & Data Dictionary Generator #111

Summary

Problem

Proposed Solution

Documentation Layers

LLM-Generated Descriptions

Export Formats

Technical Notes

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Schema Documentation & Data Dictionary Generator #111

Description

Summary

Problem

Proposed Solution

Documentation Layers

LLM-Generated Descriptions

Export Formats

Technical Notes

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions