Skip to content

SCD package does not handle deleted records (missing SCD close-out for deletions) #16

@traychomilevdriven

Description

@traychomilevdriven

The current implementation of the Dataform SCD package only captures new or updated records based on last_updated_time. However, it does not handle deleted records, causing previously existing rows to remain open-ended (with scd_valid_to = NULL) even when the source data no longer contains them.

Steps to Reproduce

Load source data into a staging table (e.g. stg_table_1).

Run the SCD update process — records are inserted correctly.

Delete one of the records in the source table.

Re-run the SCD process.

Expected Behavior

Deleted records should be marked as inactive in the target SCD table by setting scd_valid_to (or similar) to the current timestamp.

Actual Behavior

Deleted records remain active indefinitely in the SCD table.
No change is detected since the deletion does not produce a last_updated_time update.

Technical Details

The generated SQL includes logic like:

select *
from ...stg_table_1
where last_updated_time > (
select max(last_updated_time)
from ...dim_table_1_scd_updates
)

This query filters only by updated timestamps, which excludes deleted records that no longer exist in the staging dataset.

Suggested Improvement

Add an optional configuration (e.g. track_deletions: true) that performs a LEFT JOIN between the existing SCD table and the source table to identify and close out deleted records:

select
scd.uniq_key,
current_timestamp() as scd_valid_to
from scd
left join source on scd.uniq_key = source.uniq_key
where source.uniq_key is null
and scd.scd_valid_to is null;

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions