-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Analyze a source table's profiling data and recommend the optimal incremental loading strategy: append-only, merge/upsert, SCD Type 1 or 2, snapshot, or full refresh. Generate the corresponding SQL/dbt configuration for the recommended strategy.
Problem
- Choosing the wrong incremental strategy causes data duplication, data loss, or unnecessary compute costs
- The decision requires understanding: presence of update timestamps, primary key behavior, change frequency, table size
- Junior data engineers often default to full refresh when incremental would save 90% of compute
- The platform already profiles columns that reveal incremental strategy signals (timestamps, unique IDs, monotonic sequences) but doesn't surface this insight
Proposed Solution
Analysis Signals
| Signal | What to check | Strategy implication |
|---|---|---|
| Monotonic ID column | id or auto_increment always increasing |
Append with cursor on ID |
| Update timestamp | updated_at column present and populated |
Merge/upsert with timestamp cursor |
| No update tracking | No timestamp or version column | Full refresh or snapshot |
| Soft deletes | deleted_at or is_active column |
SCD Type 1 or soft-delete aware merge |
| Historical tracking needed | Business requires point-in-time queries | SCD Type 2 |
| Small table (< 10K rows) | Row count from profiling | Full refresh (simpler, fast enough) |
| Large table (> 1M rows) | Row count from profiling | Incremental strongly recommended |
Output
- Recommended strategy with confidence and reasoning
- Cursor column identified (timestamp, ID, or composite)
- SQL implementation: MERGE/INSERT/snapshot query for the recommended approach
- dbt config: `incremental_strategy`, `unique_key`, `updated_at` config block
- Alternative strategies: Ranked list with trade-offs explained
- Risk assessment: What happens if the strategy is wrong (duplicates, missed updates, etc.)
Example Output
Recommended Strategy: MERGE/UPSERT
Confidence: HIGH
Reasoning:
- Table has 2.3M rows (too large for full refresh)
- Column 'updated_at' is a TIMESTAMP, populated for 99.8% of rows
- Column 'id' is unique (2,300,000 distinct values = row count)
- No soft delete column detected
Cursor: updated_at
Merge Key: id
dbt config:
{{config(
materialized='incremental',
incremental_strategy='merge',
unique_key='id',
updated_at='updated_at'
)}}
Risks:
- If 'updated_at' is not reliably updated on changes, some rows may be missed
- Consider adding a fallback full refresh on a weekly schedule
Technical Notes
- Input: `Source.metadata_.sample_profile` — use column types, null counts, unique counts, min/max dates, row count
- Heuristics first: Rule-based analysis for common patterns, then LLM for nuanced recommendations
- New endpoint: `POST /api/sources/{id}/suggest-incremental-strategy` with optional `table_name` param
- Multi-table: Run analysis across all tables in a source and return per-table recommendations
Acceptance Criteria
- Detect timestamp cursor columns automatically from profiling
- Detect primary key / unique key columns from uniqueness stats
- Recommend one of: append, merge, SCD Type 1, SCD Type 2, snapshot, full refresh
- Confidence score with reasoning for the recommendation
- Generate SQL implementation for the recommended strategy
- Generate dbt incremental config block
- Show alternative strategies with trade-offs
- Support batch analysis across all tables in a source
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request