Skip to content

Incremental Strategy Advisor (Append vs Merge vs SCD vs Full Refresh) #116

@Empreiteiro

Description

@Empreiteiro

Summary

Analyze a source table's profiling data and recommend the optimal incremental loading strategy: append-only, merge/upsert, SCD Type 1 or 2, snapshot, or full refresh. Generate the corresponding SQL/dbt configuration for the recommended strategy.

Problem

  • Choosing the wrong incremental strategy causes data duplication, data loss, or unnecessary compute costs
  • The decision requires understanding: presence of update timestamps, primary key behavior, change frequency, table size
  • Junior data engineers often default to full refresh when incremental would save 90% of compute
  • The platform already profiles columns that reveal incremental strategy signals (timestamps, unique IDs, monotonic sequences) but doesn't surface this insight

Proposed Solution

Analysis Signals

Signal What to check Strategy implication
Monotonic ID column id or auto_increment always increasing Append with cursor on ID
Update timestamp updated_at column present and populated Merge/upsert with timestamp cursor
No update tracking No timestamp or version column Full refresh or snapshot
Soft deletes deleted_at or is_active column SCD Type 1 or soft-delete aware merge
Historical tracking needed Business requires point-in-time queries SCD Type 2
Small table (< 10K rows) Row count from profiling Full refresh (simpler, fast enough)
Large table (> 1M rows) Row count from profiling Incremental strongly recommended

Output

  1. Recommended strategy with confidence and reasoning
  2. Cursor column identified (timestamp, ID, or composite)
  3. SQL implementation: MERGE/INSERT/snapshot query for the recommended approach
  4. dbt config: `incremental_strategy`, `unique_key`, `updated_at` config block
  5. Alternative strategies: Ranked list with trade-offs explained
  6. Risk assessment: What happens if the strategy is wrong (duplicates, missed updates, etc.)

Example Output

Recommended Strategy: MERGE/UPSERT
Confidence: HIGH

Reasoning:
- Table has 2.3M rows (too large for full refresh)
- Column 'updated_at' is a TIMESTAMP, populated for 99.8% of rows
- Column 'id' is unique (2,300,000 distinct values = row count)
- No soft delete column detected

Cursor: updated_at
Merge Key: id

dbt config:
  {{config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='id',
    updated_at='updated_at'
  )}}

Risks:
- If 'updated_at' is not reliably updated on changes, some rows may be missed
- Consider adding a fallback full refresh on a weekly schedule

Technical Notes

  • Input: `Source.metadata_.sample_profile` — use column types, null counts, unique counts, min/max dates, row count
  • Heuristics first: Rule-based analysis for common patterns, then LLM for nuanced recommendations
  • New endpoint: `POST /api/sources/{id}/suggest-incremental-strategy` with optional `table_name` param
  • Multi-table: Run analysis across all tables in a source and return per-table recommendations

Acceptance Criteria

  • Detect timestamp cursor columns automatically from profiling
  • Detect primary key / unique key columns from uniqueness stats
  • Recommend one of: append, merge, SCD Type 1, SCD Type 2, snapshot, full refresh
  • Confidence score with reasoning for the recommendation
  • Generate SQL implementation for the recommended strategy
  • Generate dbt incremental config block
  • Show alternative strategies with trade-offs
  • Support batch analysis across all tables in a source

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions