Summary
Generate comprehensive data documentation from connected sources: a browsable data dictionary with table/column descriptions (LLM-generated), statistics, relationships, and exportable formats (Markdown, HTML, PDF).
Problem
- The platform stores rich metadata (table_infos, sample_profile, relationships) but does not present it as documentation
- Teams lack up-to-date data dictionaries — they go stale quickly when maintained manually
- Column descriptions require domain knowledge that the LLM can infer from column names, types, and sample values
- No way to export schema information for sharing with stakeholders or onboarding new team members
Proposed Solution
Documentation Layers
- Source Overview: Name, type, connection info (redacted), table count, total rows, last sync
- Table Documentation: Per-table description (LLM-generated), row count, column count, sample data
- Column Documentation: Name, type, description (LLM-generated), nullability, uniqueness, sample values, statistics
- Relationship Map: Detected FKs and joins between tables/sources
- Data Quality Summary: Completeness score, type consistency, outlier flags
LLM-Generated Descriptions
The LLM analyzes column names, data types, sample values, and table context to generate human-readable descriptions:
user_id (INTEGER, NOT NULL, UNIQUE) → "Unique identifier for each user account"
created_at (TIMESTAMP) → "Timestamp when the record was first created"
mrr (DECIMAL) → "Monthly Recurring Revenue in the account's billing currency"
Export Formats
- In-app browser: Searchable, filterable documentation page
- Markdown: For GitHub/GitLab wikis
- HTML: Standalone page for internal portals
- PDF: For stakeholder distribution
- dbt docs compatible: schema.yml with descriptions for dbt docs generate
Technical Notes
- Input:
Source.metadata_.table_infos, sample_profile, and source_relationships — all already available
- LLM descriptions: Batch process columns with
chat_completion() using table context for accuracy
- Caching: Store generated descriptions in
Source.metadata_ to avoid re-generating on every view
- New endpoints:
GET /api/sources/{id}/documentation — browsable JSON for frontend
GET /api/sources/{id}/documentation/export?format=md|html|pdf — downloadable file
- Frontend: New
DataDictionary page component with search, filter by table/type, and expandable column details
Acceptance Criteria
Summary
Generate comprehensive data documentation from connected sources: a browsable data dictionary with table/column descriptions (LLM-generated), statistics, relationships, and exportable formats (Markdown, HTML, PDF).
Problem
Proposed Solution
Documentation Layers
LLM-Generated Descriptions
The LLM analyzes column names, data types, sample values, and table context to generate human-readable descriptions:
user_id (INTEGER, NOT NULL, UNIQUE)→ "Unique identifier for each user account"created_at (TIMESTAMP)→ "Timestamp when the record was first created"mrr (DECIMAL)→ "Monthly Recurring Revenue in the account's billing currency"Export Formats
Technical Notes
Source.metadata_.table_infos,sample_profile, andsource_relationships— all already availablechat_completion()using table context for accuracySource.metadata_to avoid re-generating on every viewGET /api/sources/{id}/documentation— browsable JSON for frontendGET /api/sources/{id}/documentation/export?format=md|html|pdf— downloadable fileDataDictionarypage component with search, filter by table/type, and expandable column detailsAcceptance Criteria