-
Notifications
You must be signed in to change notification settings - Fork 15
Description
The current implementation of the Dataform SCD package only captures new or updated records based on last_updated_time. However, it does not handle deleted records, causing previously existing rows to remain open-ended (with scd_valid_to = NULL) even when the source data no longer contains them.
Steps to Reproduce
Load source data into a staging table (e.g. stg_table_1).
Run the SCD update process — records are inserted correctly.
Delete one of the records in the source table.
Re-run the SCD process.
Expected Behavior
Deleted records should be marked as inactive in the target SCD table by setting scd_valid_to (or similar) to the current timestamp.
Actual Behavior
Deleted records remain active indefinitely in the SCD table.
No change is detected since the deletion does not produce a last_updated_time update.
Technical Details
The generated SQL includes logic like:
select *
from ...stg_table_1
where last_updated_time > (
select max(last_updated_time)
from ...dim_table_1_scd_updates
)
This query filters only by updated timestamps, which excludes deleted records that no longer exist in the staging dataset.
Suggested Improvement
Add an optional configuration (e.g. track_deletions: true) that performs a LEFT JOIN between the existing SCD table and the source table to identify and close out deleted records:
select
scd.uniq_key,
current_timestamp() as scd_valid_to
from scd
left join source on scd.uniq_key = source.uniq_key
where source.uniq_key is null
and scd.scd_valid_to is null;