Summary
AI-assisted column mapping between a source schema and a target schema. Given two connected sources (or a source + a manually defined target), the LLM suggests column-by-column mappings with transformation expressions (CAST, TRIM, COALESCE, lookups). Essential for data migrations and integration projects.
Problem
- Source-to-target mapping is one of the most time-consuming ETL tasks
- Manual mapping in spreadsheets is error-prone and doesn't leverage data profiling
- Column names often differ between source and target (e.g.,
cust_nm → customer_name)
- Type mismatches require transformation logic that users must figure out manually
- No tool combines schema metadata + data profiling + LLM reasoning for mapping suggestions
Proposed Solution
Mapping Flow
- User selects source (connected data source) and target (another source, or manual schema definition)
- Platform profiles both sides and feeds metadata to the LLM
- LLM suggests mappings with confidence scores:
{
"mappings": [
{
"source_column": "cust_nm",
"target_column": "customer_name",
"confidence": 0.95,
"transform": "TRIM(cust_nm)",
"reasoning": "Name similarity + same data type (VARCHAR)"
},
{
"source_column": "created_dt",
"target_column": "created_at",
"confidence": 0.90,
"transform": "CAST(created_dt AS TIMESTAMP)",
"reasoning": "Date column with different naming convention"
},
{
"source_column": null,
"target_column": "updated_at",
"confidence": 0,
"transform": "CURRENT_TIMESTAMP",
"reasoning": "No source match — suggest default value"
}
]
}
- User reviews, adjusts, and exports as SQL, mapping spreadsheet, or dbt model
Matching Strategies
- Name similarity: Fuzzy match on column names (Levenshtein, token overlap)
- Type compatibility: Match compatible data types
- Value overlap: Compare sample values between source and target columns
- LLM semantic matching: Understand that
amt maps to amount, qty to quantity
Export Formats
- SQL (SELECT with transforms)
- CSV/XLSX mapping spreadsheet (for documentation)
- dbt model (SELECT with ref())
- Informatica/Talend mapping format
Technical Notes
- Profiling both sides: Use existing
_build_sample_profile() from crud.py for both source and target
- LLM context: Send both schemas + sample values + profiling stats for informed matching
- Manual target schema: Allow users to paste DDL or upload CSV header to define target
- New endpoint:
POST /api/mapping/suggest with source_id, target_id (or target_schema JSON)
- Frontend: Interactive mapping table with drag-drop reassignment and transform editing
Acceptance Criteria
Summary
AI-assisted column mapping between a source schema and a target schema. Given two connected sources (or a source + a manually defined target), the LLM suggests column-by-column mappings with transformation expressions (CAST, TRIM, COALESCE, lookups). Essential for data migrations and integration projects.
Problem
cust_nm→customer_name)Proposed Solution
Mapping Flow
{ "mappings": [ { "source_column": "cust_nm", "target_column": "customer_name", "confidence": 0.95, "transform": "TRIM(cust_nm)", "reasoning": "Name similarity + same data type (VARCHAR)" }, { "source_column": "created_dt", "target_column": "created_at", "confidence": 0.90, "transform": "CAST(created_dt AS TIMESTAMP)", "reasoning": "Date column with different naming convention" }, { "source_column": null, "target_column": "updated_at", "confidence": 0, "transform": "CURRENT_TIMESTAMP", "reasoning": "No source match — suggest default value" } ] }Matching Strategies
amtmaps toamount,qtytoquantityExport Formats
Technical Notes
_build_sample_profile()fromcrud.pyfor both source and targetPOST /api/mapping/suggestwithsource_id,target_id(ortarget_schemaJSON)Acceptance Criteria