Skip to content

Conversation

@alexrichey
Copy link
Contributor

@alexrichey alexrichey commented Dec 19, 2025

We've discussed this before when looking at our networking bill, but most of our dataloading in builds could be eliminated. Additionally, we could save same database space by not duplicating data. Also, there's the build-time consideration: Dataloading in pluto currently takes 6-7 minutes, and we can eliminate most of that.

There are two approaches here:

For products where immutability is assumed (ie our DBT projects),

we can specify a schema to act as a cache, and if that schema contains a recipe input (matched on dataset name, version, and file-type) create a view in our build-schema targeting that table. There should be no query performance penalty for this. "Dataloading" would take a second or two when all tables were cached.

Otherwise (no immutability)

Unfortunately, in products like PLUTO the first thing we do is modify tables from the recipe. So for these cases, we need a cache with unmodified tables, and sadly, we need to copy them into the target schema. There's a performance penalty for this, but it does reduce PLUTO dataloading from 6 mins to 1.

Maintaining the Cache

We'd also need to maintain a cache. Maybe something like this for PLUTO:
image

Maybe we just hook into the nightly QA builds: perhaps we add a step to modify the cache. So the cache would always represent what's pulled in by main

What we'd need to do

  1. dcpy recipes: add a load command to refresh the cache. It should just upload the datasets that need to be updated, rather than pushing everything.
  2. Add that to step to nightly builds
  3. modify database cleanup scripts to not delete the recipes-cache schema
    4. Additional code here: integration tests for postgres utils and cache interactions.

In the Wild

Here's a PLUTO build with the reduced dataloading time - it's still a whole minute because of the full table copying from the cache.

@codecov
Copy link

codecov bot commented Dec 19, 2025

Codecov Report

❌ Patch coverage is 87.50000% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.81%. Comparing base (84a3020) to head (584c8e6).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
dcpy/lifecycle/builds/load.py 86.95% 4 Missing and 2 partials ⚠️
dcpy/utils/postgres.py 88.23% 2 Missing and 2 partials ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
dcpy/utils/postgres.py 75.26% <88.23%> (+2.59%) ⬆️
dcpy/lifecycle/builds/load.py 85.48% <86.95%> (-0.24%) ⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alexrichey alexrichey force-pushed the ar-build-caching-poc branch 3 times, most recently from 16006aa to 51898b9 Compare December 19, 2025 03:45
@alexrichey alexrichey changed the title Ar build caching poc Postgres Recipe Caching Dec 19, 2025
@fvankrieken
Copy link
Contributor

This is awesome

@alexrichey alexrichey force-pushed the ar-build-caching-poc branch 12 times, most recently from 7bd0e01 to 584c8e6 Compare December 23, 2025 18:40
@alexrichey alexrichey marked this pull request as ready for review December 23, 2025 18:53
recipe, recipe_lock_path.parent, load_result=load_result
name=recipe.name, build_name=target_schema, datasets=imported_datasets
)
if _write_metadata_file: # mostly an override for test cases, so we don't have build files lying around afterward
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

return load_result


def dataset_exists_in_schema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - function name seems a bit inaccurate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... I agree. dataset_listed_in_source_versions?

@fvankrieken
Copy link
Contributor

So it seems like the cache_schema in general is not loaded to explicitly during routine operations, so we'd need to at some frequency run a build (or just load step) that loads to the cache schema?

Because it would also make to have a flag for behavior when a cached version is not found

  • load source data directly to target schema. This makes sense in cases of "rebuilds" - we're rebuilding from months ago, the cached schema is more recent, but for this one build, load older source data directly into the target schema
  • update cache when versions do not match (or ideally if "latest" version specifically is different than the cache). This would be nice because then just in any usual build, we can update the cache incrementally. CAMA updates, we run a pluto build, all other data sources can just be pulled from the cache, meanhwhile CAMA can be loaded to the cache so that future builds are already good to go

Also note on the flag for running with cache now... I take it we're happy just using the cache by default now? I'm happy to just run with this, personally.

@alexrichey
Copy link
Contributor Author

So it seems like the cache_schema in general is not loaded to explicitly during routine operations, so we'd need to at some frequency run a build (or just load step) that loads to the cache schema?

Because it would also make to have a flag for behavior when a cached version is not found

* load source data directly to target schema. This makes sense in cases of "rebuilds" - we're rebuilding from months ago, the cached schema is more recent, but for this one build, load older source data directly into the target schema

* update cache when versions do not match (or ideally if "latest" version specifically is different than the cache). This would be nice because then just in any usual build, we can update the cache incrementally. CAMA updates, we run a pluto build, all other data sources can just be pulled from the cache, meanhwhile CAMA can be loaded to the cache so that future builds are already good to go

Also note on the flag for running with cache now... I take it we're happy just using the cache by default now? I'm happy to just run with this, personally.

@fvankrieken all good thoughts! Are you thinking we should just add a build step to each product to incrementally update the cache, then change the load CLI to default to use the cache? Would make sense to me. Happy to do that here (and would definitely test it with a nightly-build run from this branch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants