Postgres Recipe Caching #2122

alexrichey · 2025-12-19T02:10:00Z

We've discussed this before when looking at our networking bill, but most of our dataloading in builds could be eliminated. Additionally, we could save same database space by not duplicating data. Also, there's the build-time consideration: Dataloading in pluto currently takes 6-7 minutes, and we can eliminate most of that.

There are two approaches here:

For products where immutability is assumed (ie our DBT projects),

we can specify a schema to act as a cache, and if that schema contains a recipe input (matched on dataset name, version, and file-type) create a view in our build-schema targeting that table. There should be no query performance penalty for this. "Dataloading" would take a second or two when all tables were cached.

Otherwise (no immutability)

Unfortunately, in products like PLUTO the first thing we do is modify tables from the recipe. So for these cases, we need a cache with unmodified tables, and sadly, we need to copy them into the target schema. There's a performance penalty for this, but it does reduce PLUTO dataloading from 6 mins to 1.

Maintaining the Cache

We'd also need to maintain a cache. Maybe something like this for PLUTO:

Maybe we just hook into the nightly QA builds: perhaps we add a step to modify the cache. So the cache would always represent what's pulled in by main

What we'd need to do

dcpy recipes: add a load command to refresh the cache. It should just upload the datasets that need to be updated, rather than pushing everything.
Add that to step to nightly builds
modify database cleanup scripts to not delete the recipes-cache schema
~~4. Additional code here: integration tests for postgres utils and cache interactions.~~

In the Wild

Here's a PLUTO build with the reduced dataloading time - it's still a whole minute because of the full table copying from the cache.

codecov · 2025-12-19T02:16:06Z

Codecov Report

❌ Patch coverage is 87.50000% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.81%. Comparing base (84a3020) to head (584c8e6).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
dcpy/lifecycle/builds/load.py	86.95%	4 Missing and 2 partials ⚠️
dcpy/utils/postgres.py	88.23%	2 Missing and 2 partials ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
dcpy/utils/postgres.py	`75.26% <88.23%> (+2.59%)`	⬆️
dcpy/lifecycle/builds/load.py	`85.48% <86.95%> (-0.24%)`	⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dcpy/utils/postgres.py

fvankrieken · 2025-12-22T16:55:15Z

This is awesome

fvankrieken · 2025-12-23T19:01:11Z

dcpy/lifecycle/builds/load.py

-        recipe, recipe_lock_path.parent, load_result=load_result
+        name=recipe.name, build_name=target_schema, datasets=imported_datasets
    )
+    if _write_metadata_file:  # mostly an override for test cases, so we don't have build files lying around afterward


fvankrieken · 2025-12-23T19:02:00Z

dcpy/lifecycle/builds/load.py

    return load_result


+def dataset_exists_in_schema(


nit - function name seems a bit inaccurate

yeah... I agree. dataset_listed_in_source_versions?

fvankrieken · 2025-12-23T19:12:57Z

So it seems like the cache_schema in general is not loaded to explicitly during routine operations, so we'd need to at some frequency run a build (or just load step) that loads to the cache schema?

Because it would also make to have a flag for behavior when a cached version is not found

load source data directly to target schema. This makes sense in cases of "rebuilds" - we're rebuilding from months ago, the cached schema is more recent, but for this one build, load older source data directly into the target schema
update cache when versions do not match (or ideally if "latest" version specifically is different than the cache). This would be nice because then just in any usual build, we can update the cache incrementally. CAMA updates, we run a pluto build, all other data sources can just be pulled from the cache, meanhwhile CAMA can be loaded to the cache so that future builds are already good to go

Also note on the flag for running with cache now... I take it we're happy just using the cache by default now? I'm happy to just run with this, personally.

alexrichey · 2025-12-24T17:12:56Z

So it seems like the cache_schema in general is not loaded to explicitly during routine operations, so we'd need to at some frequency run a build (or just load step) that loads to the cache schema?

Because it would also make to have a flag for behavior when a cached version is not found
* load source data directly to target schema. This makes sense in cases of "rebuilds" - we're rebuilding from months ago, the cached schema is more recent, but for this one build, load older source data directly into the target schema

* update cache when versions do not match (or ideally if "latest" version specifically is different than the cache). This would be nice because then just in any usual build, we can update the cache incrementally. CAMA updates, we run a pluto build, all other data sources can just be pulled from the cache, meanhwhile CAMA can be loaded to the cache so that future builds are already good to go
Also note on the flag for running with cache now... I take it we're happy just using the cache by default now? I'm happy to just run with this, personally.

@fvankrieken all good thoughts! Are you thinking we should just add a build step to each product to incrementally update the cache, then change the load CLI to default to use the cache? Would make sense to me. Happy to do that here (and would definitely test it with a nightly-build run from this branch)

alexrichey force-pushed the ar-build-caching-poc branch 3 times, most recently from 16006aa to 51898b9 Compare December 19, 2025 03:45

alexrichey changed the title ~~Ar build caching poc~~ Postgres Recipe Caching Dec 19, 2025

alexrichey assigned damonmcc and fvankrieken Dec 19, 2025

fvankrieken reviewed Dec 22, 2025

View reviewed changes

dcpy/utils/postgres.py Outdated Show resolved Hide resolved

alexrichey force-pushed the ar-build-caching-poc branch 12 times, most recently from 7bd0e01 to 584c8e6 Compare December 23, 2025 18:40

Implement Recipes Caching

718c26f

alexrichey force-pushed the ar-build-caching-poc branch from 584c8e6 to 718c26f Compare December 23, 2025 18:49

alexrichey marked this pull request as ready for review December 23, 2025 18:53

alexrichey requested a review from fvankrieken December 23, 2025 18:54

fvankrieken reviewed Dec 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Postgres Recipe Caching #2122

Postgres Recipe Caching #2122

Uh oh!

alexrichey commented Dec 19, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

fvankrieken commented Dec 22, 2025

Uh oh!

fvankrieken Dec 23, 2025

Uh oh!

fvankrieken Dec 23, 2025

Uh oh!

alexrichey Dec 23, 2025

Uh oh!

fvankrieken commented Dec 23, 2025

Uh oh!

alexrichey commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Postgres Recipe Caching #2122

Are you sure you want to change the base?

Postgres Recipe Caching #2122

Uh oh!

Conversation

alexrichey commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

For products where immutability is assumed (ie our DBT projects),

Otherwise (no immutability)

Maintaining the Cache

What we'd need to do

In the Wild

Uh oh!

codecov bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

fvankrieken commented Dec 22, 2025

Uh oh!

fvankrieken Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

fvankrieken Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

alexrichey Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

fvankrieken commented Dec 23, 2025

Uh oh!

alexrichey commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexrichey commented Dec 19, 2025 •

edited

Loading

codecov bot commented Dec 19, 2025 •

edited

Loading