Skip to content

AI-Assisted Medallion Architecture: Auto-Generate Bronze/Silver/Gold SQL Layers #107

@Empreiteiro

Description

@Empreiteiro

Summary

Enable users to automatically generate a medallion architecture (Bronze → Silver → Gold) for their connected data sources using AI-assisted SQL generation. The platform already has schema introspection, LLM integration, and a template system — this feature combines them to let users build production-grade data layers without writing SQL by hand.

Context: What is Medallion Architecture?

The medallion pattern organizes data into three progressive quality layers:

  • Bronze (Raw Staging): 1:1 copy of source data with metadata columns (_loaded_at, _source_id). No transformations — preserves the raw truth.
  • Silver (Cleaned & Conformed): Data type casting, null handling, deduplication, column renaming to consistent conventions, basic validation. This is the "trusted" analytical layer.
  • Gold (Business Aggregates): Dimensional models, KPI metrics, pre-computed aggregates optimized for specific business questions.

This is the standard in modern data engineering (used in dbt, Databricks, Snowflake) and aligns well with Data Talks' mission of making data analysis accessible.

Problem

  • Users upload CSV/XLSX files that are loaded raw into in-memory SQLite at query time — no persistent staging layer
  • No mechanism to clean, type-cast, or deduplicate data before analysis
  • No way to pre-compute business aggregates for faster/consistent answers
  • The LLM re-discovers schema and data quality issues on every query instead of working with pre-cleaned data
  • Advanced users familiar with medallion patterns have no way to implement them in the platform

Proposed Solution

Phase 1: Bronze Layer Generation

Automatic on source creation/sync:

  1. When a source is created or synced, persist raw data into a Bronze table in the platform's database
  2. Add metadata columns: _loaded_at TIMESTAMP, _source_file TEXT, _row_hash TEXT
  3. Store all columns as TEXT (no type coercion — preserve raw values)

SQL generated:

CREATE TABLE bronze_{source_id}_raw (
  _loaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  _source_file TEXT,
  _row_hash TEXT,
  col1 TEXT,
  col2 TEXT,
  ...
);

Integration point: Extend crud.py upload flow and future sync mechanism.

Phase 2: AI-Assisted Silver Schema Generation

LLM analyzes Bronze data and suggests Silver schema:

  1. Feed sample_profile (column types, nulls, top values, numeric stats) to the LLM
  2. LLM returns structured schema suggestion:
    {
      "columns": [
        {
          "source_column": "created_at",
          "silver_name": "created_at",
          "target_type": "TIMESTAMP",
          "cast_expression": "CAST(created_at AS TIMESTAMP)",
          "null_strategy": "DROP_ROW",
          "validation": "NOT NULL"
        },
        {
          "source_column": "email ",
          "silver_name": "email",
          "target_type": "TEXT",
          "cast_expression": "LOWER(TRIM(email))",
          "null_strategy": "FILL_UNKNOWN",
          "validation": "LIKE '%@%.%'"
        },
        {
          "source_column": "revenue",
          "silver_name": "revenue_usd",
          "target_type": "DECIMAL(12,2)",
          "cast_expression": "CAST(REPLACE(revenue, ',', '') AS DECIMAL(12,2))",
          "null_strategy": "FILL_ZERO",
          "validation": ">= 0"
        }
      ],
      "deduplication": {
        "key_columns": ["id"],
        "strategy": "KEEP_LATEST",
        "order_by": "created_at DESC"
      }
    }
  3. User reviews and adjusts suggestions in the UI before applying
  4. Platform generates and executes the Silver CREATE TABLE + INSERT

SQL generated:

CREATE TABLE silver_{source_id}_cleaned AS
SELECT
  CAST(created_at AS TIMESTAMP) AS created_at,
  LOWER(TRIM(email)) AS email,
  CAST(REPLACE(revenue, ',', '') AS DECIMAL(12,2)) AS revenue_usd,
  -- ... more columns
FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY created_at DESC) AS _rn
  FROM bronze_{source_id}_raw
  WHERE created_at IS NOT NULL  -- null strategy
) sub
WHERE _rn = 1;  -- deduplication

Phase 3: AI-Assisted Gold Layer Generation

LLM suggests business aggregates based on Silver schema + user context:

  1. Analyze Silver columns to identify dimensions (categorical) and measures (numeric)
  2. Suggest common aggregation patterns:
    • Time-series aggregates (daily/weekly/monthly)
    • Dimensional rollups (by category, region, status)
    • KPI metrics (totals, averages, counts, ratios)
  3. User selects which Gold tables to create

Example Gold suggestions:

{
  "gold_tables": [
    {
      "name": "gold_revenue_by_month",
      "description": "Monthly revenue aggregates",
      "sql": "SELECT DATE_TRUNC('month', created_at) AS month, SUM(revenue_usd) AS total_revenue, COUNT(*) AS transaction_count FROM silver_{source_id}_cleaned GROUP BY 1",
      "dimensions": ["month"],
      "measures": ["total_revenue", "transaction_count"]
    },
    {
      "name": "gold_customer_segments",
      "description": "Customer segmentation by purchase behavior",
      "sql": "SELECT customer_id, COUNT(*) AS order_count, SUM(revenue_usd) AS lifetime_value, CASE WHEN SUM(revenue_usd) > 10000 THEN 'VIP' WHEN SUM(revenue_usd) > 1000 THEN 'Regular' ELSE 'Occasional' END AS segment FROM silver_{source_id}_cleaned GROUP BY 1",
      "dimensions": ["segment"],
      "measures": ["order_count", "lifetime_value"]
    }
  ]
}

Phase 4: Query Routing

Update ask_* scripts to prefer Silver/Gold layers when available:

  1. If Gold table matches the question topic → query Gold (fastest, pre-aggregated)
  2. If Silver table exists → query Silver (clean data, better LLM accuracy)
  3. Fallback to Bronze/raw → current behavior (in-memory SQLite)

This improves LLM accuracy since it works with typed, clean data instead of raw strings.

Technical Architecture

New Database Tables

class MedallionLayer(Base):
    id: UUID
    source_id: UUID (FKSource)
    layer: str  # "bronze" | "silver" | "gold"
    table_name: str  # "bronze_abc123_raw"
    schema_config: dict  # JSON - column mappings, transforms, dedup rules
    status: str  # "pending" | "building" | "ready" | "error"
    row_count: int
    last_built_at: datetime
    build_duration_ms: int
    error_message: str (nullable)
    created_at: datetime

class MedallionBuildLog(Base):
    id: UUID
    layer_id: UUID (FKMedallionLayer)
    action: str  # "create" | "refresh" | "rebuild"
    status: str
    rows_processed: int
    rows_rejected: int
    rejection_reasons: dict  # JSON - {"null_email": 15, "invalid_date": 3}
    duration_ms: int
    created_at: datetime

New Backend Service

backend/app/services/medallion_service.py:

  • generate_bronze_ddl(source) → str — CREATE TABLE SQL for bronze layer
  • suggest_silver_schema(source) → dict — LLM-powered schema suggestion
  • generate_silver_ddl(source, schema_config) → str — CREATE + INSERT SQL
  • suggest_gold_aggregates(source, silver_schema) → list — LLM-powered aggregation suggestions
  • generate_gold_ddl(source, gold_config) → str — CREATE TABLE AS SELECT
  • build_layer(source, layer, config) — Execute DDL/DML, track in MedallionBuildLog
  • refresh_layer(source, layer) — Rebuild layer from upstream (e.g., bronze → silver)

New API Endpoints

backend/app/routers/medallion_router.py:

POST   /api/sources/{id}/medallion/bronze          — Create bronze layer
POST   /api/sources/{id}/medallion/silver/suggest   — Get AI silver schema suggestion
POST   /api/sources/{id}/medallion/silver           — Create silver layer with config
POST   /api/sources/{id}/medallion/gold/suggest     — Get AI gold aggregate suggestions
POST   /api/sources/{id}/medallion/gold             — Create gold layer(s) with config
GET    /api/sources/{id}/medallion                  — Get medallion status for source
POST   /api/sources/{id}/medallion/{layer}/refresh  — Rebuild a specific layer
DELETE /api/sources/{id}/medallion/{layer}           — Drop a layer
GET    /api/sources/{id}/medallion/{layer}/logs     — Build history and rejection stats

Frontend Components

  • MedallionPanel: Source detail tab showing layer status (Bronze ✅ → Silver ✅ → Gold ⏳)
  • SilverSchemaEditor: Review/edit AI-suggested column mappings, types, null strategies
  • GoldAggregateSelector: Choose which suggested aggregates to materialize
  • MedallionBuildLog: Show build history with row counts and rejection details
  • LayerPreview: Before/after data preview for each layer transition

LLM Prompt Templates

Silver schema suggestion prompt:

You are a data engineer. Analyze this data profile and suggest a Silver layer schema.

Source profile:
{sample_profile}

For each column, suggest:
- Appropriate SQL data type
- CAST/transformation expression
- Null handling strategy (DROP_ROW, FILL_DEFAULT, FILL_ZERO, KEEP_NULL)
- Naming convention (snake_case)
- Validation rule if applicable

Also suggest deduplication strategy if you detect a primary key candidate.

Return JSON format: { "columns": [...], "deduplication": {...} }

Gold aggregate suggestion prompt:

You are a data analyst. Given this Silver layer schema, suggest useful business aggregate tables.

Silver schema:
{silver_columns_with_types}

Sample data (5 rows):
{preview_rows}

Suggest 3-5 Gold tables with:
- Meaningful name and description
- SQL query (using the silver table)
- Identified dimensions and measures
- Business question each table answers

Return JSON format: { "gold_tables": [...] }

Integration with Existing Systems

Existing Component Medallion Integration
Source.metadata_ Add medallion_config with layer references and settings
sample_profile Input for LLM silver schema suggestion
template_executor.py Reuse query execution and caching patterns
alert_scheduler.py Schedule periodic medallion layer refreshes
ask_csv.py / ask_sql.py Route queries to Silver/Gold when available
LLM chat_completion() Power schema and aggregate suggestions
Template system Add medallion-specific templates for common source types

Acceptance Criteria

  • Bronze layer auto-created on source upload with metadata columns (_loaded_at, _row_hash)
  • AI suggests Silver schema with type casting, null handling, and dedup from sample_profile
  • User can review and edit Silver schema suggestions before applying
  • Silver layer created with proper CAST expressions, deduplication, and validation
  • AI suggests 3-5 Gold aggregate tables based on Silver schema
  • User selects which Gold tables to materialize
  • All layers refresh automatically when source data is synced
  • Build logs show rows processed, rows rejected, and rejection reasons
  • Q&A queries automatically route to Silver/Gold layers when available
  • Layer status visible in source detail page (Bronze ✅ Silver ✅ Gold ⏳)
  • Works with CSV, XLSX, SQL database, and BigQuery source types

Future Extensions

  • Cross-source Gold layers: Combine Silver tables from multiple sources into unified Gold models
  • dbt compatibility: Export medallion definitions as dbt models for users who want to manage layers externally
  • Version control: Track schema evolution across Silver/Gold rebuilds
  • Cost estimation: Estimate storage and compute cost before building layers
  • Natural language layer creation: "Create a monthly revenue summary grouped by product category"

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions