Fix simulate_data.py reproducibility + remove synthetic CSVs by louismagowan · Pull Request #794 · pymc-labs/CausalPy

louismagowan · 2026-03-07T02:44:54Z

Closes #545

Summary

Refactors synthetic dataset generation to be reproducible and fully programmatic, removing the dependency on committed CSV files.

What changed

Bug fixes

All synthetic data generators now accept a seed parameter and use a local numpy.Generator instance instead of global np.random.* state — results are now reproducible
Fixed create_series() bug where length_scale parameter was ignored (hardcoded to 2)
Fixed deprecated pandas freq="M" → freq="ME"

Refactor

load_data() now generates synthetic datasets on-the-fly using a seeded generator rather than reading from CSV; real-world datasets (banks, brexit, etc.) still load from shipped CSVs
Split DATASETS dict into SYNTHETIC_DATASETS and REAL_WORLD_DATASETS
Removed circular import causalpy as cp dependency in datasets.py

Deleted files

8 synthetic CSVs removed (~172KB): did.csv, regression_discontinuity.csv, synthetic_control.csv, its.csv, its_simple.csv, ancova_generated.csv, geolift1.csv, geolift_multi_cell.csv
gt_social_media_data.csv removed (was never referenced in code)

API changes (low risk — not imported via __init__.py)

generate_seasonality, periodic_kernel, create_series renamed to _generate_seasonality, _periodic_kernel, _create_series (internal helpers)
generate_time_series_data deleted (superseded by generate_time_series_data_seasonal and generate_time_series_data_simple)

Tests

Added session-scoped fixtures for all 8 synthetic datasets — generated once per test session, avoiding redundant calls
Updated 8 test files to use fixtures instead of cp.load_data()
Rewrote test_data_loading.py to parametrize over all datasets with reproducibility and unknown-key tests

Notes for users

Calls to load_data() for synthetic datasets will now return different numeric values than before (expected — this is the fix). Users who pinned to specific values from the old CSVs will need to update their code.

- Add `seed` parameter to all generator functions that lacked it - Replace scipy.stats RNG calls (norm().rvs, uniform.rvs, dirichlet().rvs) with numpy Generator methods (rng.normal, rng.uniform, rng.dirichlet) - Replace np.random.* global state calls with local rng instances - Fix create_series() bug: hardcoded length_scale=2 now uses parameter - Rename create_series, generate_seasonality, periodic_kernel to private (_create_series, _generate_seasonality, _periodic_kernel) since they are internal helpers that require an rng instance from their caller - Fix deprecated pandas freq="M" → freq="ME" - Remove unused scipy.stats imports (norm, uniform, dirichlet) - Remove module-level global rng; keep RANDOM_SEED constant for use by load_data() and test fixtures - Standardize generate_staggered_did_data to use same rng pattern Closes pymc-labs#545 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Split DATASETS dict into SYNTHETIC_DATASETS (generated via seeded functions) and REAL_WORLD_DATASETS (loaded from CSV). This removes the dependency on synthetic CSV files while keeping real-world data as shipped CSVs. - Replace _get_data_home() with simple _DATA_DIR = Path(__file__).parent - Remove circular `import causalpy as cp` dependency - All 8 synthetic datasets now call their generator with RANDOM_SEED Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Eight fixtures (did_data, its_data, its_simple_data, rd_data, sc_data, anova1_data, geolift1_data, geolift_multi_cell_data) generate data once per test session using RANDOM_SEED, avoiding redundant calls to load_data() or generators in individual tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update 8 test files to use session-scoped fixtures (did_data, its_data, rd_data, sc_data, anova1_data, geolift1_data) instead of calling cp.load_data() for synthetic datasets. Real-world dataset loading (banks, brexit, drinking, risk, nhefs) remains unchanged. Also rewrite test_data_loading.py to parametrize over all datasets (synthetic + real-world) and add reproducibility + unknown-key tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove 8 synthetic CSV files (~172K) now generated programmatically: did.csv, regression_discontinuity.csv, synthetic_control.csv, its.csv, its_simple.csv, ancova_generated.csv, geolift1.csv, geolift_multi_cell.csv Also remove gt_social_media_data.csv which was never referenced in the DATASETS dict or any code. Real-world CSVs (12 files, ~3.2MB) remain in the repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

read-the-docs-community · 2026-03-07T02:50:05Z

Documentation build overview

📚 causalpy | 🛠️ Build #31703077 | 📁 Comparing e229425 against latest (5e8db5c)

🔍 Preview build

Show files changed (17 files in total): 📝 13 modified | ➕ 0 added | ➖ 4 deleted

File	Status
404.html	📝 modified
genindex.html	📝 modified
api/generated/causalpy.data.datasets.load_data.html	📝 modified
api/generated/causalpy.data.simulate_data.create_series.html	➖ deleted
api/generated/causalpy.data.simulate_data.generate_ancova_data.html	📝 modified
api/generated/causalpy.data.simulate_data.generate_did.html	📝 modified
api/generated/causalpy.data.simulate_data.generate_multicell_geolift_data.html	📝 modified
api/generated/causalpy.data.simulate_data.generate_regression_discontinuity_data.html	📝 modified
api/generated/causalpy.data.simulate_data.generate_seasonality.html	➖ deleted
api/generated/causalpy.data.simulate_data.generate_synthetic_control_data.html	📝 modified
api/generated/causalpy.data.simulate_data.generate_time_series_data.html	➖ deleted
api/generated/causalpy.data.simulate_data.generate_time_series_data_seasonal.html	📝 modified
api/generated/causalpy.data.simulate_data.generate_time_series_data_simple.html	📝 modified
api/generated/causalpy.data.simulate_data.html	📝 modified
api/generated/causalpy.data.simulate_data.periodic_kernel.html	➖ deleted
_modules/causalpy/data/datasets.html	📝 modified
_modules/causalpy/data/simulate_data.html	📝 modified

codecov · 2026-03-07T02:50:34Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.97%. Comparing base (5e8db5c) to head (e229425).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #794      +/-   ##
==========================================
+ Coverage   92.42%   92.97%   +0.55%     
==========================================
  Files          52       52              
  Lines        9230     9227       -3     
  Branches      562      561       -1     
==========================================
+ Hits         8531     8579      +48     
+ Misses        527      477      -50     
+ Partials      172      171       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

louismagowan · 2026-03-07T06:30:17Z

@drbenvincent wdyt?

Also:
Once this is merged, we could consider a follow-up PR to purge the 9 deleted CSV files from git history using git-filter-repo. These files are still baked into every historical commit, which means:

Repo clone size stays inflated — the ~172KB of synthetic CSVs is replicated across hundreds of commits in .git/objects
Slower clones and fetches on the repo (repo is currently pretty massive to clone, not sure it needs to be)

The trade-off is that git-filter-repo rewrites all commit hashes, so:

All open PRs/branches would need rebasing
All contributors would need to re-clone or git fetch --all && git reset --hard origin/main
Any external links to specific commit SHAs would break

Is this worth doing, or is the size saving too marginal to justify the disruption?

…tion Superseded by generate_time_series_data_seasonal and generate_time_series_data_simple. Not imported or called anywhere. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

drbenvincent · 2026-03-11T16:35:34Z

bugbot review

cursor · 2026-03-11T16:35:37Z

Bugbot couldn't run

Bugbot is not enabled for your user on this team.

Ask your team administrator to increase your team's hard limit for Bugbot seats or add you to the allowlist in the Cursor dashboard.

drbenvincent · 2026-03-11T16:39:56Z

bugbot run

cursor · 2026-03-11T16:39:59Z

Bugbot couldn't run

Bugbot is not enabled for your user on this team.

Ask your team administrator to increase your team's hard limit for Bugbot seats or add you to the allowlist in the Cursor dashboard.

drbenvincent

Thanks for this contribution, @louismagowan — nice work. The commit structure is clean, the approach is well thought-out, and it's great to see all CI checks green with coverage actually improving (+0.55%).

A few notes:

API breakages to be aware of

This PR removes or renames several functions that were previously accessible (and documented in the API docs):

generate_seasonality, periodic_kernel, create_series are renamed to private (_ prefix)
generate_time_series_data is deleted entirely

None of these are imported via __init__.py, so they're not top-level public API, but anyone doing from causalpy.data.simulate_data import generate_seasonality (etc.) would break. The risk is low, but we should mention these in the release notes.

`load_data()` is now dynamic — no caching

Previously load_data("sc") read a small CSV. Now it runs full generation (LOWESS smoothing, GP kernels, etc.) on every call. The test suite handles this well with session-scoped fixtures, but users calling load_data() in a loop will see a performance hit. Not a blocker — just something to be aware of. We could add @functools.lru_cache or similar in a follow-up if it becomes an issue.

Data values will change

This is inherent to the fix (and expected), but worth calling out: load_data("did") and friends now return completely different numeric values than the old CSVs. Any downstream user code that depended on specific values will see different results.

`git-filter-repo` suggestion

Appreciate the thought here, but I'd recommend against it. The savings (~172KB) are negligible and the disruption is severe — all commit hashes rewritten, all open PRs need rebasing, all contributors need to re-clone, external SHA links break. Please do raise this as a separate issue though so we have it tracked; it just won't be high priority given the trade-offs.

PR description

Could you flesh out the PR body with a summary of what changed and why? For a 20-file refactor like this, a brief bullet list (API changes, deleted files, new test fixtures, bug fixes) makes it much easier for maintainers and users browsing the changelog to understand the scope. The commit messages are well-structured — it's really just about surfacing that same information in the PR description so it's visible at a glance.

louismagowan · 2026-04-05T00:01:01Z

@drbenvincent apologies for the delay - work has been really busy these past months.

I've updated the PR description to outline what changes I've made
I've created a separate issue to track the git history csv part: Remove synthetic CSVs from git history with git-filter-repo #824

The only other part of your feedback that needs to be addressed, I believe, was around the release notes - would you be able to take that part when you write the next changelog?

Is there anything else I need to do to get this one merged? ☺️

louismagowan and others added 5 commits March 7, 2026 13:04

fix(datasets): use Callable type hint instead of callable for mypy

3c30c82

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor(simulate_data): remove unused generate_time_series_data func…

e229425

…tion Superseded by generate_time_series_data_seasonal and generate_time_series_data_simple. Not imported or called anywhere. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

juanitorduz requested a review from drbenvincent March 8, 2026 10:48

drbenvincent requested changes Mar 11, 2026

View reviewed changes

louismagowan mentioned this pull request Apr 4, 2026

Remove synthetic CSVs from git history with git-filter-repo #824

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix simulate_data.py reproducibility + remove synthetic CSVs #794

Fix simulate_data.py reproducibility + remove synthetic CSVs #794
louismagowan wants to merge 7 commits intopymc-labs:mainfrom
louismagowan:refactor/lm-simulate-data

louismagowan commented Mar 7, 2026 •

edited

Loading

Uh oh!

read-the-docs-community bot commented Mar 7, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 7, 2026 •

edited

Loading

Uh oh!

louismagowan commented Mar 7, 2026 •

edited

Loading

Uh oh!

drbenvincent commented Mar 11, 2026

Uh oh!

cursor bot commented Mar 11, 2026

Uh oh!

drbenvincent commented Mar 11, 2026

Uh oh!

cursor bot commented Mar 11, 2026

Uh oh!

drbenvincent left a comment

Uh oh!

louismagowan commented Apr 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

louismagowan commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Notes for users

Uh oh!

read-the-docs-community bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

codecov bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

louismagowan commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drbenvincent commented Mar 11, 2026

Uh oh!

cursor bot commented Mar 11, 2026

Bugbot couldn't run

Uh oh!

drbenvincent commented Mar 11, 2026

Uh oh!

cursor bot commented Mar 11, 2026

Bugbot couldn't run

Uh oh!

drbenvincent left a comment

Choose a reason for hiding this comment

API breakages to be aware of

load_data() is now dynamic — no caching

Data values will change

git-filter-repo suggestion

PR description

Uh oh!

louismagowan commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

louismagowan commented Mar 7, 2026 •

edited

Loading

read-the-docs-community bot commented Mar 7, 2026 •

edited

Loading

codecov bot commented Mar 7, 2026 •

edited

Loading

louismagowan commented Mar 7, 2026 •

edited

Loading

`load_data()` is now dynamic — no caching

`git-filter-repo` suggestion

louismagowan commented Apr 5, 2026 •

edited

Loading