Skip to content

dbt Model & Schema.yml Generator from Connected Sources #109

@Empreiteiro

Description

@Empreiteiro

Summary

Generate dbt project artifacts (SQL models, schema.yml, sources.yml) from connected data sources using AI-assisted code generation. The platform already parses dbt manifests (ask_dbt.py) — this closes the loop by generating dbt code back.

Problem

  • Writing dbt staging models is repetitive: SELECT + CAST + rename for every source table
  • schema.yml with column descriptions and tests requires manual effort per column
  • sources.yml must be kept in sync with actual database schemas
  • The platform already has all the metadata needed (table_infos, profiling, relationships) but doesn't generate dbt artifacts
  • Teams spend hours on boilerplate that could be auto-generated from profiling

Proposed Solution

Generated Artifacts

  1. sources.yml: Source definitions with database/schema/table references, column lists, and freshness config
  2. Staging models (stg_*.sql): SELECT from source with column renaming, type casting, and basic cleaning
  3. Intermediate models (int_*.sql): Join/union staging models based on detected relationships
  4. Mart models (fct_*.sql, dim_*.sql): Fact and dimension tables from Gold layer suggestions
  5. schema.yml: Column descriptions (LLM-generated), tests (not_null, unique, accepted_values, relationships)

Naming Conventions

Follow dbt best practices:

  • stg_{source_name}__{table_name}.sql
  • int_{domain}__{description}.sql
  • fct_{business_process}.sql / dim_{entity}.sql

How It Works

  1. User selects source(s) and target dbt layer (staging, intermediate, marts)
  2. LLM receives schema + profiling + detected relationships
  3. Platform generates .sql and .yml files as downloadable text or ZIP
  4. User copies into their dbt project

Technical Notes

  • Existing infrastructure: ask_dbt.py already parses manifest.json with _extract_table_infos_from_manifest() — reuse for bidirectional context
  • Profiling → tests: sample_profile columns with 0 nulls → not_null test; low cardinality → accepted_values; unique counts matching row count → unique test
  • Relationship → ref(): suggest_source_relationships() from sql_utils.py maps to dbt ref() and relationships tests
  • New endpoint: POST /api/sources/{id}/generate-dbt with params: layer, project_name, naming_convention
  • Output format: Individual files or bundled ZIP download

Acceptance Criteria

  • Generate sources.yml from any connected SQL source with accurate database/schema/table paths
  • Generate staging models with proper CAST, rename, and cleaning based on profiling
  • Generate schema.yml with LLM-written column descriptions
  • Auto-suggest dbt tests from profiling (not_null, unique, accepted_values, relationships)
  • Follow dbt naming conventions (stg_, int_, fct_, dim_)
  • Output downloadable as individual files or ZIP bundle
  • Support Jinja ref() and source() macros in generated SQL

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions