You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enable users to automatically generate a medallion architecture (Bronze → Silver → Gold) for their connected data sources using AI-assisted SQL generation. The platform already has schema introspection, LLM integration, and a template system — this feature combines them to let users build production-grade data layers without writing SQL by hand.
Context: What is Medallion Architecture?
The medallion pattern organizes data into three progressive quality layers:
Bronze (Raw Staging): 1:1 copy of source data with metadata columns (_loaded_at, _source_id). No transformations — preserves the raw truth.
Silver (Cleaned & Conformed): Data type casting, null handling, deduplication, column renaming to consistent conventions, basic validation. This is the "trusted" analytical layer.
Gold (Business Aggregates): Dimensional models, KPI metrics, pre-computed aggregates optimized for specific business questions.
This is the standard in modern data engineering (used in dbt, Databricks, Snowflake) and aligns well with Data Talks' mission of making data analysis accessible.
Problem
Users upload CSV/XLSX files that are loaded raw into in-memory SQLite at query time — no persistent staging layer
No mechanism to clean, type-cast, or deduplicate data before analysis
No way to pre-compute business aggregates for faster/consistent answers
The LLM re-discovers schema and data quality issues on every query instead of working with pre-cleaned data
Advanced users familiar with medallion patterns have no way to implement them in the platform
Proposed Solution
Phase 1: Bronze Layer Generation
Automatic on source creation/sync:
When a source is created or synced, persist raw data into a Bronze table in the platform's database
Add metadata columns: _loaded_at TIMESTAMP, _source_file TEXT, _row_hash TEXT
Store all columns as TEXT (no type coercion — preserve raw values)
User reviews and adjusts suggestions in the UI before applying
Platform generates and executes the Silver CREATE TABLE + INSERT
SQL generated:
CREATETABLEsilver_{source_id}_cleaned ASSELECT
CAST(created_at ASTIMESTAMP) AS created_at,
LOWER(TRIM(email)) AS email,
CAST(REPLACE(revenue, ',', '') ASDECIMAL(12,2)) AS revenue_usd,
-- ... more columnsFROM (
SELECT*, ROW_NUMBER() OVER (PARTITION BY id ORDER BY created_at DESC) AS _rn
FROM bronze_{source_id}_raw
WHERE created_at IS NOT NULL-- null strategy
) sub
WHERE _rn =1; -- deduplication
Phase 3: AI-Assisted Gold Layer Generation
LLM suggests business aggregates based on Silver schema + user context:
Analyze Silver columns to identify dimensions (categorical) and measures (numeric)
Suggest common aggregation patterns:
Time-series aggregates (daily/weekly/monthly)
Dimensional rollups (by category, region, status)
KPI metrics (totals, averages, counts, ratios)
User selects which Gold tables to create
Example Gold suggestions:
{
"gold_tables": [
{
"name": "gold_revenue_by_month",
"description": "Monthly revenue aggregates",
"sql": "SELECT DATE_TRUNC('month', created_at) AS month, SUM(revenue_usd) AS total_revenue, COUNT(*) AS transaction_count FROM silver_{source_id}_cleaned GROUP BY 1",
"dimensions": ["month"],
"measures": ["total_revenue", "transaction_count"]
},
{
"name": "gold_customer_segments",
"description": "Customer segmentation by purchase behavior",
"sql": "SELECT customer_id, COUNT(*) AS order_count, SUM(revenue_usd) AS lifetime_value, CASE WHEN SUM(revenue_usd) > 10000 THEN 'VIP' WHEN SUM(revenue_usd) > 1000 THEN 'Regular' ELSE 'Occasional' END AS segment FROM silver_{source_id}_cleaned GROUP BY 1",
"dimensions": ["segment"],
"measures": ["order_count", "lifetime_value"]
}
]
}
Phase 4: Query Routing
Update ask_* scripts to prefer Silver/Gold layers when available:
If Gold table matches the question topic → query Gold (fastest, pre-aggregated)
POST /api/sources/{id}/medallion/bronze — Create bronze layer
POST /api/sources/{id}/medallion/silver/suggest — Get AI silver schema suggestion
POST /api/sources/{id}/medallion/silver — Create silver layer with config
POST /api/sources/{id}/medallion/gold/suggest — Get AI gold aggregate suggestions
POST /api/sources/{id}/medallion/gold — Create gold layer(s) with config
GET /api/sources/{id}/medallion — Get medallion status for source
POST /api/sources/{id}/medallion/{layer}/refresh — Rebuild a specific layer
DELETE /api/sources/{id}/medallion/{layer} — Drop a layer
GET /api/sources/{id}/medallion/{layer}/logs — Build history and rejection stats
GoldAggregateSelector: Choose which suggested aggregates to materialize
MedallionBuildLog: Show build history with row counts and rejection details
LayerPreview: Before/after data preview for each layer transition
LLM Prompt Templates
Silver schema suggestion prompt:
You are a data engineer. Analyze this data profile and suggest a Silver layer schema.
Source profile:
{sample_profile}
For each column, suggest:
- Appropriate SQL data type
- CAST/transformation expression
- Null handling strategy (DROP_ROW, FILL_DEFAULT, FILL_ZERO, KEEP_NULL)
- Naming convention (snake_case)
- Validation rule if applicable
Also suggest deduplication strategy if you detect a primary key candidate.
Return JSON format: { "columns": [...], "deduplication": {...} }
Gold aggregate suggestion prompt:
You are a data analyst. Given this Silver layer schema, suggest useful business aggregate tables.
Silver schema:
{silver_columns_with_types}
Sample data (5 rows):
{preview_rows}
Suggest 3-5 Gold tables with:
- Meaningful name and description
- SQL query (using the silver table)
- Identified dimensions and measures
- Business question each table answers
Return JSON format: { "gold_tables": [...] }
Integration with Existing Systems
Existing Component
Medallion Integration
Source.metadata_
Add medallion_config with layer references and settings
sample_profile
Input for LLM silver schema suggestion
template_executor.py
Reuse query execution and caching patterns
alert_scheduler.py
Schedule periodic medallion layer refreshes
ask_csv.py / ask_sql.py
Route queries to Silver/Gold when available
LLM chat_completion()
Power schema and aggregate suggestions
Template system
Add medallion-specific templates for common source types
Acceptance Criteria
Bronze layer auto-created on source upload with metadata columns (_loaded_at, _row_hash)
AI suggests Silver schema with type casting, null handling, and dedup from sample_profile
User can review and edit Silver schema suggestions before applying
Silver layer created with proper CAST expressions, deduplication, and validation
AI suggests 3-5 Gold aggregate tables based on Silver schema
User selects which Gold tables to materialize
All layers refresh automatically when source data is synced
Build logs show rows processed, rows rejected, and rejection reasons
Q&A queries automatically route to Silver/Gold layers when available
Layer status visible in source detail page (Bronze ✅ Silver ✅ Gold ⏳)
Works with CSV, XLSX, SQL database, and BigQuery source types
Future Extensions
Cross-source Gold layers: Combine Silver tables from multiple sources into unified Gold models
dbt compatibility: Export medallion definitions as dbt models for users who want to manage layers externally
Version control: Track schema evolution across Silver/Gold rebuilds
Cost estimation: Estimate storage and compute cost before building layers
Natural language layer creation: "Create a monthly revenue summary grouped by product category"
Summary
Enable users to automatically generate a medallion architecture (Bronze → Silver → Gold) for their connected data sources using AI-assisted SQL generation. The platform already has schema introspection, LLM integration, and a template system — this feature combines them to let users build production-grade data layers without writing SQL by hand.
Context: What is Medallion Architecture?
The medallion pattern organizes data into three progressive quality layers:
_loaded_at,_source_id). No transformations — preserves the raw truth.This is the standard in modern data engineering (used in dbt, Databricks, Snowflake) and aligns well with Data Talks' mission of making data analysis accessible.
Problem
Proposed Solution
Phase 1: Bronze Layer Generation
Automatic on source creation/sync:
_loaded_at TIMESTAMP,_source_file TEXT,_row_hash TEXTSQL generated:
Integration point: Extend
crud.pyupload flow and future sync mechanism.Phase 2: AI-Assisted Silver Schema Generation
LLM analyzes Bronze data and suggests Silver schema:
sample_profile(column types, nulls, top values, numeric stats) to the LLM{ "columns": [ { "source_column": "created_at", "silver_name": "created_at", "target_type": "TIMESTAMP", "cast_expression": "CAST(created_at AS TIMESTAMP)", "null_strategy": "DROP_ROW", "validation": "NOT NULL" }, { "source_column": "email ", "silver_name": "email", "target_type": "TEXT", "cast_expression": "LOWER(TRIM(email))", "null_strategy": "FILL_UNKNOWN", "validation": "LIKE '%@%.%'" }, { "source_column": "revenue", "silver_name": "revenue_usd", "target_type": "DECIMAL(12,2)", "cast_expression": "CAST(REPLACE(revenue, ',', '') AS DECIMAL(12,2))", "null_strategy": "FILL_ZERO", "validation": ">= 0" } ], "deduplication": { "key_columns": ["id"], "strategy": "KEEP_LATEST", "order_by": "created_at DESC" } }SQL generated:
Phase 3: AI-Assisted Gold Layer Generation
LLM suggests business aggregates based on Silver schema + user context:
Example Gold suggestions:
{ "gold_tables": [ { "name": "gold_revenue_by_month", "description": "Monthly revenue aggregates", "sql": "SELECT DATE_TRUNC('month', created_at) AS month, SUM(revenue_usd) AS total_revenue, COUNT(*) AS transaction_count FROM silver_{source_id}_cleaned GROUP BY 1", "dimensions": ["month"], "measures": ["total_revenue", "transaction_count"] }, { "name": "gold_customer_segments", "description": "Customer segmentation by purchase behavior", "sql": "SELECT customer_id, COUNT(*) AS order_count, SUM(revenue_usd) AS lifetime_value, CASE WHEN SUM(revenue_usd) > 10000 THEN 'VIP' WHEN SUM(revenue_usd) > 1000 THEN 'Regular' ELSE 'Occasional' END AS segment FROM silver_{source_id}_cleaned GROUP BY 1", "dimensions": ["segment"], "measures": ["order_count", "lifetime_value"] } ] }Phase 4: Query Routing
Update
ask_*scripts to prefer Silver/Gold layers when available:This improves LLM accuracy since it works with typed, clean data instead of raw strings.
Technical Architecture
New Database Tables
New Backend Service
backend/app/services/medallion_service.py:generate_bronze_ddl(source) → str— CREATE TABLE SQL for bronze layersuggest_silver_schema(source) → dict— LLM-powered schema suggestiongenerate_silver_ddl(source, schema_config) → str— CREATE + INSERT SQLsuggest_gold_aggregates(source, silver_schema) → list— LLM-powered aggregation suggestionsgenerate_gold_ddl(source, gold_config) → str— CREATE TABLE AS SELECTbuild_layer(source, layer, config)— Execute DDL/DML, track in MedallionBuildLogrefresh_layer(source, layer)— Rebuild layer from upstream (e.g., bronze → silver)New API Endpoints
backend/app/routers/medallion_router.py:Frontend Components
LLM Prompt Templates
Silver schema suggestion prompt:
Gold aggregate suggestion prompt:
Integration with Existing Systems
Source.metadata_medallion_configwith layer references and settingssample_profiletemplate_executor.pyalert_scheduler.pyask_csv.py/ask_sql.pychat_completion()Acceptance Criteria
_loaded_at,_row_hash)sample_profileFuture Extensions