-
Notifications
You must be signed in to change notification settings - Fork 3
Postgres Recipe Caching #2122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Postgres Recipe Caching #2122
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
16006aa to
51898b9
Compare
|
This is awesome |
7bd0e01 to
584c8e6
Compare
584c8e6 to
718c26f
Compare
| recipe, recipe_lock_path.parent, load_result=load_result | ||
| name=recipe.name, build_name=target_schema, datasets=imported_datasets | ||
| ) | ||
| if _write_metadata_file: # mostly an override for test cases, so we don't have build files lying around afterward |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
| return load_result | ||
|
|
||
|
|
||
| def dataset_exists_in_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - function name seems a bit inaccurate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... I agree. dataset_listed_in_source_versions?
|
So it seems like the cache_schema in general is not loaded to explicitly during routine operations, so we'd need to at some frequency run a build (or just load step) that loads to the cache schema? Because it would also make to have a flag for behavior when a cached version is not found
Also note on the flag for running with cache now... I take it we're happy just using the cache by default now? I'm happy to just run with this, personally. |
@fvankrieken all good thoughts! Are you thinking we should just add a build step to each product to incrementally update the cache, then change the load CLI to default to use the cache? Would make sense to me. Happy to do that here (and would definitely test it with a nightly-build run from this branch) |
We've discussed this before when looking at our networking bill, but most of our dataloading in builds could be eliminated. Additionally, we could save same database space by not duplicating data. Also, there's the build-time consideration: Dataloading in pluto currently takes 6-7 minutes, and we can eliminate most of that.
There are two approaches here:
For products where immutability is assumed (ie our DBT projects),
we can specify a schema to act as a cache, and if that schema contains a recipe input (matched on dataset name, version, and file-type) create a view in our build-schema targeting that table. There should be no query performance penalty for this. "Dataloading" would take a second or two when all tables were cached.
Otherwise (no immutability)
Unfortunately, in products like PLUTO the first thing we do is modify tables from the recipe. So for these cases, we need a cache with unmodified tables, and sadly, we need to copy them into the target schema. There's a performance penalty for this, but it does reduce PLUTO dataloading from 6 mins to 1.
Maintaining the Cache
We'd also need to maintain a cache. Maybe something like this for PLUTO:

Maybe we just hook into the nightly QA builds: perhaps we add a step to modify the cache. So the cache would always represent what's pulled in by
mainWhat we'd need to do
4. Additional code here: integration tests for postgres utils and cache interactions.In the Wild
Here's a PLUTO build with the reduced dataloading time - it's still a whole minute because of the full table copying from the cache.