AI-Assisted Medallion Architecture: Auto-Generate Bronze/Silver/Gold SQL Layers

## Summary

Enable users to automatically generate a **medallion architecture** (Bronze → Silver → Gold) for their connected data sources using AI-assisted SQL generation. The platform already has schema introspection, LLM integration, and a template system — this feature combines them to let users build production-grade data layers without writing SQL by hand.

## Context: What is Medallion Architecture?

The medallion pattern organizes data into three progressive quality layers:

- **Bronze (Raw Staging)**: 1:1 copy of source data with metadata columns (`_loaded_at`, `_source_id`). No transformations — preserves the raw truth.
- **Silver (Cleaned & Conformed)**: Data type casting, null handling, deduplication, column renaming to consistent conventions, basic validation. This is the "trusted" analytical layer.
- **Gold (Business Aggregates)**: Dimensional models, KPI metrics, pre-computed aggregates optimized for specific business questions.

This is the standard in modern data engineering (used in dbt, Databricks, Snowflake) and aligns well with Data Talks' mission of making data analysis accessible.

## Problem

- Users upload CSV/XLSX files that are loaded raw into in-memory SQLite at query time — no persistent staging layer
- No mechanism to clean, type-cast, or deduplicate data before analysis
- No way to pre-compute business aggregates for faster/consistent answers
- The LLM re-discovers schema and data quality issues on every query instead of working with pre-cleaned data
- Advanced users familiar with medallion patterns have no way to implement them in the platform

## Proposed Solution

### Phase 1: Bronze Layer Generation

**Automatic on source creation/sync:**
1. When a source is created or synced, persist raw data into a Bronze table in the platform's database
2. Add metadata columns: `_loaded_at TIMESTAMP`, `_source_file TEXT`, `_row_hash TEXT`
3. Store all columns as TEXT (no type coercion — preserve raw values)

**SQL generated:**
```sql
CREATE TABLE bronze_{source_id}_raw (
  _loaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  _source_file TEXT,
  _row_hash TEXT,
  col1 TEXT,
  col2 TEXT,
  ...
);
```

**Integration point:** Extend `crud.py` upload flow and future sync mechanism.

### Phase 2: AI-Assisted Silver Schema Generation

**LLM analyzes Bronze data and suggests Silver schema:**
1. Feed `sample_profile` (column types, nulls, top values, numeric stats) to the LLM
2. LLM returns structured schema suggestion:
   ```json
   {
     "columns": [
       {
         "source_column": "created_at",
         "silver_name": "created_at",
         "target_type": "TIMESTAMP",
         "cast_expression": "CAST(created_at AS TIMESTAMP)",
         "null_strategy": "DROP_ROW",
         "validation": "NOT NULL"
       },
       {
         "source_column": "email ",
         "silver_name": "email",
         "target_type": "TEXT",
         "cast_expression": "LOWER(TRIM(email))",
         "null_strategy": "FILL_UNKNOWN",
         "validation": "LIKE '%@%.%'"
       },
       {
         "source_column": "revenue",
         "silver_name": "revenue_usd",
         "target_type": "DECIMAL(12,2)",
         "cast_expression": "CAST(REPLACE(revenue, ',', '') AS DECIMAL(12,2))",
         "null_strategy": "FILL_ZERO",
         "validation": ">= 0"
       }
     ],
     "deduplication": {
       "key_columns": ["id"],
       "strategy": "KEEP_LATEST",
       "order_by": "created_at DESC"
     }
   }
   ```
3. User reviews and adjusts suggestions in the UI before applying
4. Platform generates and executes the Silver CREATE TABLE + INSERT

**SQL generated:**
```sql
CREATE TABLE silver_{source_id}_cleaned AS
SELECT
  CAST(created_at AS TIMESTAMP) AS created_at,
  LOWER(TRIM(email)) AS email,
  CAST(REPLACE(revenue, ',', '') AS DECIMAL(12,2)) AS revenue_usd,
  -- ... more columns
FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY created_at DESC) AS _rn
  FROM bronze_{source_id}_raw
  WHERE created_at IS NOT NULL  -- null strategy
) sub
WHERE _rn = 1;  -- deduplication
```

### Phase 3: AI-Assisted Gold Layer Generation

**LLM suggests business aggregates based on Silver schema + user context:**
1. Analyze Silver columns to identify dimensions (categorical) and measures (numeric)
2. Suggest common aggregation patterns:
   - Time-series aggregates (daily/weekly/monthly)
   - Dimensional rollups (by category, region, status)
   - KPI metrics (totals, averages, counts, ratios)
3. User selects which Gold tables to create

**Example Gold suggestions:**
```json
{
  "gold_tables": [
    {
      "name": "gold_revenue_by_month",
      "description": "Monthly revenue aggregates",
      "sql": "SELECT DATE_TRUNC('month', created_at) AS month, SUM(revenue_usd) AS total_revenue, COUNT(*) AS transaction_count FROM silver_{source_id}_cleaned GROUP BY 1",
      "dimensions": ["month"],
      "measures": ["total_revenue", "transaction_count"]
    },
    {
      "name": "gold_customer_segments",
      "description": "Customer segmentation by purchase behavior",
      "sql": "SELECT customer_id, COUNT(*) AS order_count, SUM(revenue_usd) AS lifetime_value, CASE WHEN SUM(revenue_usd) > 10000 THEN 'VIP' WHEN SUM(revenue_usd) > 1000 THEN 'Regular' ELSE 'Occasional' END AS segment FROM silver_{source_id}_cleaned GROUP BY 1",
      "dimensions": ["segment"],
      "measures": ["order_count", "lifetime_value"]
    }
  ]
}
```

### Phase 4: Query Routing

Update `ask_*` scripts to prefer Silver/Gold layers when available:
1. If Gold table matches the question topic → query Gold (fastest, pre-aggregated)
2. If Silver table exists → query Silver (clean data, better LLM accuracy)
3. Fallback to Bronze/raw → current behavior (in-memory SQLite)

This improves LLM accuracy since it works with typed, clean data instead of raw strings.

## Technical Architecture

### New Database Tables

```python
class MedallionLayer(Base):
    id: UUID
    source_id: UUID (FK → Source)
    layer: str  # "bronze" | "silver" | "gold"
    table_name: str  # "bronze_abc123_raw"
    schema_config: dict  # JSON - column mappings, transforms, dedup rules
    status: str  # "pending" | "building" | "ready" | "error"
    row_count: int
    last_built_at: datetime
    build_duration_ms: int
    error_message: str (nullable)
    created_at: datetime

class MedallionBuildLog(Base):
    id: UUID
    layer_id: UUID (FK → MedallionLayer)
    action: str  # "create" | "refresh" | "rebuild"
    status: str
    rows_processed: int
    rows_rejected: int
    rejection_reasons: dict  # JSON - {"null_email": 15, "invalid_date": 3}
    duration_ms: int
    created_at: datetime
```

### New Backend Service

`backend/app/services/medallion_service.py`:
- `generate_bronze_ddl(source) → str` — CREATE TABLE SQL for bronze layer
- `suggest_silver_schema(source) → dict` — LLM-powered schema suggestion
- `generate_silver_ddl(source, schema_config) → str` — CREATE + INSERT SQL
- `suggest_gold_aggregates(source, silver_schema) → list` — LLM-powered aggregation suggestions
- `generate_gold_ddl(source, gold_config) → str` — CREATE TABLE AS SELECT
- `build_layer(source, layer, config)` — Execute DDL/DML, track in MedallionBuildLog
- `refresh_layer(source, layer)` — Rebuild layer from upstream (e.g., bronze → silver)

### New API Endpoints

`backend/app/routers/medallion_router.py`:
```
POST   /api/sources/{id}/medallion/bronze          — Create bronze layer
POST   /api/sources/{id}/medallion/silver/suggest   — Get AI silver schema suggestion
POST   /api/sources/{id}/medallion/silver           — Create silver layer with config
POST   /api/sources/{id}/medallion/gold/suggest     — Get AI gold aggregate suggestions
POST   /api/sources/{id}/medallion/gold             — Create gold layer(s) with config
GET    /api/sources/{id}/medallion                  — Get medallion status for source
POST   /api/sources/{id}/medallion/{layer}/refresh  — Rebuild a specific layer
DELETE /api/sources/{id}/medallion/{layer}           — Drop a layer
GET    /api/sources/{id}/medallion/{layer}/logs     — Build history and rejection stats
```

### Frontend Components

- **MedallionPanel**: Source detail tab showing layer status (Bronze ✅ → Silver ✅ → Gold ⏳)
- **SilverSchemaEditor**: Review/edit AI-suggested column mappings, types, null strategies
- **GoldAggregateSelector**: Choose which suggested aggregates to materialize
- **MedallionBuildLog**: Show build history with row counts and rejection details
- **LayerPreview**: Before/after data preview for each layer transition

### LLM Prompt Templates

**Silver schema suggestion prompt:**
```
You are a data engineer. Analyze this data profile and suggest a Silver layer schema.

Source profile:
{sample_profile}

For each column, suggest:
- Appropriate SQL data type
- CAST/transformation expression
- Null handling strategy (DROP_ROW, FILL_DEFAULT, FILL_ZERO, KEEP_NULL)
- Naming convention (snake_case)
- Validation rule if applicable

Also suggest deduplication strategy if you detect a primary key candidate.

Return JSON format: { "columns": [...], "deduplication": {...} }
```

**Gold aggregate suggestion prompt:**
```
You are a data analyst. Given this Silver layer schema, suggest useful business aggregate tables.

Silver schema:
{silver_columns_with_types}

Sample data (5 rows):
{preview_rows}

Suggest 3-5 Gold tables with:
- Meaningful name and description
- SQL query (using the silver table)
- Identified dimensions and measures
- Business question each table answers

Return JSON format: { "gold_tables": [...] }
```

## Integration with Existing Systems

| Existing Component | Medallion Integration |
|---|---|
| `Source.metadata_` | Add `medallion_config` with layer references and settings |
| `sample_profile` | Input for LLM silver schema suggestion |
| `template_executor.py` | Reuse query execution and caching patterns |
| `alert_scheduler.py` | Schedule periodic medallion layer refreshes |
| `ask_csv.py` / `ask_sql.py` | Route queries to Silver/Gold when available |
| LLM `chat_completion()` | Power schema and aggregate suggestions |
| Template system | Add medallion-specific templates for common source types |

## Acceptance Criteria

- [ ] Bronze layer auto-created on source upload with metadata columns (`_loaded_at`, `_row_hash`)
- [ ] AI suggests Silver schema with type casting, null handling, and dedup from `sample_profile`
- [ ] User can review and edit Silver schema suggestions before applying
- [ ] Silver layer created with proper CAST expressions, deduplication, and validation
- [ ] AI suggests 3-5 Gold aggregate tables based on Silver schema
- [ ] User selects which Gold tables to materialize
- [ ] All layers refresh automatically when source data is synced
- [ ] Build logs show rows processed, rows rejected, and rejection reasons
- [ ] Q&A queries automatically route to Silver/Gold layers when available
- [ ] Layer status visible in source detail page (Bronze ✅ Silver ✅ Gold ⏳)
- [ ] Works with CSV, XLSX, SQL database, and BigQuery source types

## Future Extensions

- **Cross-source Gold layers**: Combine Silver tables from multiple sources into unified Gold models
- **dbt compatibility**: Export medallion definitions as dbt models for users who want to manage layers externally
- **Version control**: Track schema evolution across Silver/Gold rebuilds
- **Cost estimation**: Estimate storage and compute cost before building layers
- **Natural language layer creation**: "Create a monthly revenue summary grouped by product category"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI-Assisted Medallion Architecture: Auto-Generate Bronze/Silver/Gold SQL Layers #107

Summary

Context: What is Medallion Architecture?

Problem

Proposed Solution

Phase 1: Bronze Layer Generation

Phase 2: AI-Assisted Silver Schema Generation

Phase 3: AI-Assisted Gold Layer Generation

Phase 4: Query Routing

Technical Architecture

New Database Tables

New Backend Service

New API Endpoints

Frontend Components

LLM Prompt Templates

Integration with Existing Systems

Acceptance Criteria

Future Extensions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Existing Component	Medallion Integration
`Source.metadata_`	Add `medallion_config` with layer references and settings
`sample_profile`	Input for LLM silver schema suggestion
`template_executor.py`	Reuse query execution and caching patterns
`alert_scheduler.py`	Schedule periodic medallion layer refreshes
`ask_csv.py` / `ask_sql.py`	Route queries to Silver/Gold when available
LLM `chat_completion()`	Power schema and aggregate suggestions
Template system	Add medallion-specific templates for common source types

AI-Assisted Medallion Architecture: Auto-Generate Bronze/Silver/Gold SQL Layers #107

Description

Summary

Context: What is Medallion Architecture?

Problem

Proposed Solution

Phase 1: Bronze Layer Generation

Phase 2: AI-Assisted Silver Schema Generation

Phase 3: AI-Assisted Gold Layer Generation

Phase 4: Query Routing

Technical Architecture

New Database Tables

New Backend Service

New API Endpoints

Frontend Components

LLM Prompt Templates

Integration with Existing Systems

Acceptance Criteria

Future Extensions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions