Fix simulate_data.py reproducibility + remove synthetic CSVs #794
Fix simulate_data.py reproducibility + remove synthetic CSVs #794louismagowan wants to merge 7 commits intopymc-labs:mainfrom
Conversation
- Add `seed` parameter to all generator functions that lacked it - Replace scipy.stats RNG calls (norm().rvs, uniform.rvs, dirichlet().rvs) with numpy Generator methods (rng.normal, rng.uniform, rng.dirichlet) - Replace np.random.* global state calls with local rng instances - Fix create_series() bug: hardcoded length_scale=2 now uses parameter - Rename create_series, generate_seasonality, periodic_kernel to private (_create_series, _generate_seasonality, _periodic_kernel) since they are internal helpers that require an rng instance from their caller - Fix deprecated pandas freq="M" → freq="ME" - Remove unused scipy.stats imports (norm, uniform, dirichlet) - Remove module-level global rng; keep RANDOM_SEED constant for use by load_data() and test fixtures - Standardize generate_staggered_did_data to use same rng pattern Closes pymc-labs#545 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split DATASETS dict into SYNTHETIC_DATASETS (generated via seeded functions) and REAL_WORLD_DATASETS (loaded from CSV). This removes the dependency on synthetic CSV files while keeping real-world data as shipped CSVs. - Replace _get_data_home() with simple _DATA_DIR = Path(__file__).parent - Remove circular `import causalpy as cp` dependency - All 8 synthetic datasets now call their generator with RANDOM_SEED Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eight fixtures (did_data, its_data, its_simple_data, rd_data, sc_data, anova1_data, geolift1_data, geolift_multi_cell_data) generate data once per test session using RANDOM_SEED, avoiding redundant calls to load_data() or generators in individual tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update 8 test files to use session-scoped fixtures (did_data, its_data, rd_data, sc_data, anova1_data, geolift1_data) instead of calling cp.load_data() for synthetic datasets. Real-world dataset loading (banks, brexit, drinking, risk, nhefs) remains unchanged. Also rewrite test_data_loading.py to parametrize over all datasets (synthetic + real-world) and add reproducibility + unknown-key tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove 8 synthetic CSV files (~172K) now generated programmatically: did.csv, regression_discontinuity.csv, synthetic_control.csv, its.csv, its_simple.csv, ancova_generated.csv, geolift1.csv, geolift_multi_cell.csv Also remove gt_social_media_data.csv which was never referenced in the DATASETS dict or any code. Real-world CSVs (12 files, ~3.2MB) remain in the repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #794 +/- ##
==========================================
+ Coverage 92.42% 92.97% +0.55%
==========================================
Files 52 52
Lines 9230 9227 -3
Branches 562 561 -1
==========================================
+ Hits 8531 8579 +48
+ Misses 527 477 -50
+ Partials 172 171 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@drbenvincent wdyt? Also:
The trade-off is that git-filter-repo rewrites all commit hashes, so:
Is this worth doing, or is the size saving too marginal to justify the disruption? |
…tion Superseded by generate_time_series_data_seasonal and generate_time_series_data_simple. Not imported or called anywhere. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
bugbot review |
Bugbot couldn't runBugbot is not enabled for your user on this team. Ask your team administrator to increase your team's hard limit for Bugbot seats or add you to the allowlist in the Cursor dashboard. |
|
bugbot run |
Bugbot couldn't runBugbot is not enabled for your user on this team. Ask your team administrator to increase your team's hard limit for Bugbot seats or add you to the allowlist in the Cursor dashboard. |
drbenvincent
left a comment
There was a problem hiding this comment.
Thanks for this contribution, @louismagowan — nice work. The commit structure is clean, the approach is well thought-out, and it's great to see all CI checks green with coverage actually improving (+0.55%).
A few notes:
API breakages to be aware of
This PR removes or renames several functions that were previously accessible (and documented in the API docs):
generate_seasonality,periodic_kernel,create_seriesare renamed to private (_prefix)generate_time_series_datais deleted entirely
None of these are imported via __init__.py, so they're not top-level public API, but anyone doing from causalpy.data.simulate_data import generate_seasonality (etc.) would break. The risk is low, but we should mention these in the release notes.
load_data() is now dynamic — no caching
Previously load_data("sc") read a small CSV. Now it runs full generation (LOWESS smoothing, GP kernels, etc.) on every call. The test suite handles this well with session-scoped fixtures, but users calling load_data() in a loop will see a performance hit. Not a blocker — just something to be aware of. We could add @functools.lru_cache or similar in a follow-up if it becomes an issue.
Data values will change
This is inherent to the fix (and expected), but worth calling out: load_data("did") and friends now return completely different numeric values than the old CSVs. Any downstream user code that depended on specific values will see different results.
git-filter-repo suggestion
Appreciate the thought here, but I'd recommend against it. The savings (~172KB) are negligible and the disruption is severe — all commit hashes rewritten, all open PRs need rebasing, all contributors need to re-clone, external SHA links break. Please do raise this as a separate issue though so we have it tracked; it just won't be high priority given the trade-offs.
PR description
Could you flesh out the PR body with a summary of what changed and why? For a 20-file refactor like this, a brief bullet list (API changes, deleted files, new test fixtures, bug fixes) makes it much easier for maintainers and users browsing the changelog to understand the scope. The commit messages are well-structured — it's really just about surfacing that same information in the PR description so it's visible at a glance.
|
@drbenvincent apologies for the delay - work has been really busy these past months.
The only other part of your feedback that needs to be addressed, I believe, was around the release notes - would you be able to take that part when you write the next changelog? Is there anything else I need to do to get this one merged? |
Closes #545
Summary
Refactors synthetic dataset generation to be reproducible and fully programmatic, removing the dependency on committed CSV files.
What changed
Bug fixes
seedparameter and use a localnumpy.Generatorinstance instead of globalnp.random.*state — results are now reproduciblecreate_series()bug wherelength_scaleparameter was ignored (hardcoded to2)freq="M"→freq="ME"Refactor
load_data()now generates synthetic datasets on-the-fly using a seeded generator rather than reading from CSV; real-world datasets (banks, brexit, etc.) still load from shipped CSVsDATASETSdict intoSYNTHETIC_DATASETSandREAL_WORLD_DATASETSimport causalpy as cpdependency indatasets.pyDeleted files
did.csv,regression_discontinuity.csv,synthetic_control.csv,its.csv,its_simple.csv,ancova_generated.csv,geolift1.csv,geolift_multi_cell.csvgt_social_media_data.csvremoved (was never referenced in code)API changes (low risk — not imported via
__init__.py)generate_seasonality,periodic_kernel,create_seriesrenamed to_generate_seasonality,_periodic_kernel,_create_series(internal helpers)generate_time_series_datadeleted (superseded bygenerate_time_series_data_seasonalandgenerate_time_series_data_simple)Tests
cp.load_data()test_data_loading.pyto parametrize over all datasets with reproducibility and unknown-key testsNotes for users
Calls to
load_data()for synthetic datasets will now return different numeric values than before (expected — this is the fix). Users who pinned to specific values from the old CSVs will need to update their code.