Summary
Generate dbt project artifacts (SQL models, schema.yml, sources.yml) from connected data sources using AI-assisted code generation. The platform already parses dbt manifests (ask_dbt.py) — this closes the loop by generating dbt code back.
Problem
- Writing dbt staging models is repetitive: SELECT + CAST + rename for every source table
- schema.yml with column descriptions and tests requires manual effort per column
- sources.yml must be kept in sync with actual database schemas
- The platform already has all the metadata needed (table_infos, profiling, relationships) but doesn't generate dbt artifacts
- Teams spend hours on boilerplate that could be auto-generated from profiling
Proposed Solution
Generated Artifacts
- sources.yml: Source definitions with database/schema/table references, column lists, and freshness config
- Staging models (
stg_*.sql): SELECT from source with column renaming, type casting, and basic cleaning
- Intermediate models (
int_*.sql): Join/union staging models based on detected relationships
- Mart models (
fct_*.sql, dim_*.sql): Fact and dimension tables from Gold layer suggestions
- schema.yml: Column descriptions (LLM-generated), tests (not_null, unique, accepted_values, relationships)
Naming Conventions
Follow dbt best practices:
stg_{source_name}__{table_name}.sql
int_{domain}__{description}.sql
fct_{business_process}.sql / dim_{entity}.sql
How It Works
- User selects source(s) and target dbt layer (staging, intermediate, marts)
- LLM receives schema + profiling + detected relationships
- Platform generates .sql and .yml files as downloadable text or ZIP
- User copies into their dbt project
Technical Notes
- Existing infrastructure:
ask_dbt.py already parses manifest.json with _extract_table_infos_from_manifest() — reuse for bidirectional context
- Profiling → tests:
sample_profile columns with 0 nulls → not_null test; low cardinality → accepted_values; unique counts matching row count → unique test
- Relationship → ref():
suggest_source_relationships() from sql_utils.py maps to dbt ref() and relationships tests
- New endpoint:
POST /api/sources/{id}/generate-dbt with params: layer, project_name, naming_convention
- Output format: Individual files or bundled ZIP download
Acceptance Criteria
Summary
Generate dbt project artifacts (SQL models, schema.yml, sources.yml) from connected data sources using AI-assisted code generation. The platform already parses dbt manifests (
ask_dbt.py) — this closes the loop by generating dbt code back.Problem
Proposed Solution
Generated Artifacts
stg_*.sql): SELECT from source with column renaming, type casting, and basic cleaningint_*.sql): Join/union staging models based on detected relationshipsfct_*.sql,dim_*.sql): Fact and dimension tables from Gold layer suggestionsNaming Conventions
Follow dbt best practices:
stg_{source_name}__{table_name}.sqlint_{domain}__{description}.sqlfct_{business_process}.sql/dim_{entity}.sqlHow It Works
Technical Notes
ask_dbt.pyalready parsesmanifest.jsonwith_extract_table_infos_from_manifest()— reuse for bidirectional contextsample_profilecolumns with 0 nulls →not_nulltest; low cardinality →accepted_values; unique counts matching row count →uniquetestsuggest_source_relationships()fromsql_utils.pymaps to dbtref()andrelationshipstestsPOST /api/sources/{id}/generate-dbtwith params:layer,project_name,naming_conventionAcceptance Criteria