Skip to content

Transformation Mapping Assistant (Source-to-Target Column Mapping) #112

@Empreiteiro

Description

@Empreiteiro

Summary

AI-assisted column mapping between a source schema and a target schema. Given two connected sources (or a source + a manually defined target), the LLM suggests column-by-column mappings with transformation expressions (CAST, TRIM, COALESCE, lookups). Essential for data migrations and integration projects.

Problem

  • Source-to-target mapping is one of the most time-consuming ETL tasks
  • Manual mapping in spreadsheets is error-prone and doesn't leverage data profiling
  • Column names often differ between source and target (e.g., cust_nmcustomer_name)
  • Type mismatches require transformation logic that users must figure out manually
  • No tool combines schema metadata + data profiling + LLM reasoning for mapping suggestions

Proposed Solution

Mapping Flow

  1. User selects source (connected data source) and target (another source, or manual schema definition)
  2. Platform profiles both sides and feeds metadata to the LLM
  3. LLM suggests mappings with confidence scores:
{
  "mappings": [
    {
      "source_column": "cust_nm",
      "target_column": "customer_name",
      "confidence": 0.95,
      "transform": "TRIM(cust_nm)",
      "reasoning": "Name similarity + same data type (VARCHAR)"
    },
    {
      "source_column": "created_dt",
      "target_column": "created_at",
      "confidence": 0.90,
      "transform": "CAST(created_dt AS TIMESTAMP)",
      "reasoning": "Date column with different naming convention"
    },
    {
      "source_column": null,
      "target_column": "updated_at",
      "confidence": 0,
      "transform": "CURRENT_TIMESTAMP",
      "reasoning": "No source match — suggest default value"
    }
  ]
}
  1. User reviews, adjusts, and exports as SQL, mapping spreadsheet, or dbt model

Matching Strategies

  • Name similarity: Fuzzy match on column names (Levenshtein, token overlap)
  • Type compatibility: Match compatible data types
  • Value overlap: Compare sample values between source and target columns
  • LLM semantic matching: Understand that amt maps to amount, qty to quantity

Export Formats

  • SQL (SELECT with transforms)
  • CSV/XLSX mapping spreadsheet (for documentation)
  • dbt model (SELECT with ref())
  • Informatica/Talend mapping format

Technical Notes

  • Profiling both sides: Use existing _build_sample_profile() from crud.py for both source and target
  • LLM context: Send both schemas + sample values + profiling stats for informed matching
  • Manual target schema: Allow users to paste DDL or upload CSV header to define target
  • New endpoint: POST /api/mapping/suggest with source_id, target_id (or target_schema JSON)
  • Frontend: Interactive mapping table with drag-drop reassignment and transform editing

Acceptance Criteria

  • Suggest column mappings between two connected sources with confidence scores
  • Support manual target schema definition (paste DDL or define columns)
  • Include transformation expressions (CAST, TRIM, COALESCE) in suggestions
  • Confidence score per mapping (high/medium/low/unmapped)
  • User can accept, reject, or modify each mapping
  • Export as SQL SELECT, mapping spreadsheet (CSV/XLSX), or dbt model
  • Handle unmapped columns with suggested defaults or flags

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions