-
Notifications
You must be signed in to change notification settings - Fork 15
Description
This is an issue occurs both when backfilling records (when deploying the SCD script for the first time) and when adding new records with no entries on the updates table which have different creation and update times.
My use case is we have a users table and users take actions and the users table has some slowly changing fields like their preferred language and their country of residence. So we create the updates table, we create the view with scd_valid_to and scd_valid_from and then we can join any action to the state of the user that existed at the time of the action (in theory). e.g.
SELECT u.language, u.country, a.*
FROM ${ref("action_tool", "action")} action
JOIN ${ref("scd", "user")} user
ON (user.id = action.user_id)
WHERE action.created_at BETWEEN user.scd_valid_from AND user.scd_valid_toThis is very convenient and fast except it doesn't work in two specific situations.
-
Back-filling old data: Consider a user, who joined in 2020, took dozens of actions over the last 4 years, has a recent updated_at of some time in late 2023. When I run the SCD script for the first time I get a single history record in 2024 with an updated_at value of 2023, and the view with scd_valid_from shows the same date in 2023. This means that ALL the action records from 2020 to late 2023 will fail to join in the above query. (This is what I mean by orphan records.)
-
Changes before the first update record: Similarly, if the SCD script is active and running daily, and a user signs up by taking an action at 12:00, not specifying a language so our system defaults them to English, and then at 12:05 they go into their profile and set their language to Spanish, and then at 2:00 the SCD job runs... it will insert a record with created_at: 12:00, updated_at: 12:05, and the view will show scd_valid_from 12:05. So the very first action they ever took (a very important one, from a business perspective) will be orphaned in the same way scenario 1.
It would be possible to complexify the views I'm using to join actions to histories, but this defeats the purpose of the convenience view. It might also be possible to add a configuration option for a created_at field so that the view detects when it's dealing with the first ever record update and set scd_valid_from to the creation date (figuring maybe there were changes in between there but we don't know so we have to be okay with some loss of specificity in exchange for not breaking all our other queries).
I think a better solution would be to add a config option for a created_at field, and have the insert script:
- In the query that identifies IDs to insert, also identify whether the record is completely new to the updates table or whether it's simply being updated.
- When the record is being updated only, use the present behavior.
- When the record has being inserted for the first time, if the created_at and updated_at values are different, insert TWO updates, one normal one (like the current logic) and one simulated original record, e.g.
select * EXCEPT(updated_at), created_at AS updated_at
With this logic the Scenario 2 situation would create two records, and the scd_valid_* view would look like this
| created_at | updated_at | user_id | scd_valid_from | scd_valid_to |
|---|---|---|---|---|
| 12:00 | 12:00 | 613 | 12:00 | 12:05 |
| 12:00 | 12:05 | 613 | 12:05 | NULL |
This way, trying to join the original action from 12:00 and the second action from 12:05 will both work, whereas under the current logic, the first one will not join.