Skip to content

ETL Pipeline Documentation Reverse-Engineer (SQL → Natural Language Docs) #115

@Empreiteiro

Description

@Empreiteiro

Summary

Given existing SQL (stored procedures, views, dbt models, complex queries), the LLM reads the code and generates human-readable documentation: what each CTE does, which tables are read/written, implicit business rules, and data flow diagrams.

Problem

  • Teams inherit undocumented ETL pipelines with hundreds of lines of SQL
  • Understanding a complex stored procedure or dbt model requires deep reading
  • Business rules are buried in CASE WHEN logic that non-technical stakeholders can't parse
  • Onboarding new data engineers to existing pipelines is slow without docs
  • No tool combines SQL parsing with LLM comprehension for documentation generation

Proposed Solution

Documentation Generated

  1. Overview: One-paragraph summary of what the SQL does
  2. Step-by-step breakdown: Each CTE/subquery explained in plain language
  3. Input/Output tables: Which tables are read, which are written/created
  4. Business rules extracted: CASE WHEN logic, filter conditions, join logic documented as business rules
  5. Data flow diagram: Mermaid flowchart showing table → CTE → output flow
  6. Complexity assessment: Number of joins, CTEs, aggregations, nesting depth
  7. Refactoring suggestions: Identify opportunities to simplify or break apart complex queries

How It Works

  1. User pastes SQL or uploads .sql files
  2. LLM analyzes the code with optional schema context from connected sources
  3. Platform returns structured documentation with sections
  4. User can regenerate sections, edit descriptions, and export

Export Formats

  • Markdown: For GitHub/wiki documentation
  • Confluence-ready HTML: Formatted for Confluence pages
  • Inline comments: Annotated SQL with -- comments explaining each section
  • README section: Summary suitable for dbt model description or README

Technical Notes

  • No SQL parser needed: LLM handles SQL comprehension directly — no need for sqlglot or similar
  • Schema enrichment: If referenced tables match connected sources, add column descriptions and profiling context
  • Batch processing: Support uploading multiple .sql files and generating docs for each
  • New endpoint: `POST /api/tools/document-sql` with `sql`, `source_ids[]` (optional), `format`
  • Frontend: Split-view — SQL on the left, generated docs on the right

Acceptance Criteria

  • Generate overview + step-by-step breakdown for any SQL input
  • Extract and document business rules from CASE/WHERE/JOIN logic
  • Identify input and output tables automatically
  • Generate Mermaid data flow diagram
  • Export as Markdown, HTML (Confluence-ready), and inline SQL comments
  • Enriched documentation when tables match connected sources
  • Support SQL of any complexity (up to ~500 lines)
  • Batch mode: upload multiple .sql files

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions