Skip to content

Fix simulate_data.py reproducibility + remove synthetic CSVs #794

Open
louismagowan wants to merge 7 commits intopymc-labs:mainfrom
louismagowan:refactor/lm-simulate-data
Open

Fix simulate_data.py reproducibility + remove synthetic CSVs #794
louismagowan wants to merge 7 commits intopymc-labs:mainfrom
louismagowan:refactor/lm-simulate-data

Conversation

@louismagowan
Copy link
Copy Markdown
Contributor

@louismagowan louismagowan commented Mar 7, 2026

Closes #545

Summary

Refactors synthetic dataset generation to be reproducible and fully programmatic, removing the dependency on committed CSV files.

What changed

Bug fixes

  • All synthetic data generators now accept a seed parameter and use a local numpy.Generator instance instead of global np.random.* state — results are now reproducible
  • Fixed create_series() bug where length_scale parameter was ignored (hardcoded to 2)
  • Fixed deprecated pandas freq="M"freq="ME"

Refactor

  • load_data() now generates synthetic datasets on-the-fly using a seeded generator rather than reading from CSV; real-world datasets (banks, brexit, etc.) still load from shipped CSVs
  • Split DATASETS dict into SYNTHETIC_DATASETS and REAL_WORLD_DATASETS
  • Removed circular import causalpy as cp dependency in datasets.py

Deleted files

  • 8 synthetic CSVs removed (~172KB): did.csv, regression_discontinuity.csv, synthetic_control.csv, its.csv, its_simple.csv, ancova_generated.csv, geolift1.csv, geolift_multi_cell.csv
  • gt_social_media_data.csv removed (was never referenced in code)

API changes (low risk — not imported via __init__.py)

  • generate_seasonality, periodic_kernel, create_series renamed to _generate_seasonality, _periodic_kernel, _create_series (internal helpers)
  • generate_time_series_data deleted (superseded by generate_time_series_data_seasonal and generate_time_series_data_simple)

Tests

  • Added session-scoped fixtures for all 8 synthetic datasets — generated once per test session, avoiding redundant calls
  • Updated 8 test files to use fixtures instead of cp.load_data()
  • Rewrote test_data_loading.py to parametrize over all datasets with reproducibility and unknown-key tests

Notes for users

Calls to load_data() for synthetic datasets will now return different numeric values than before (expected — this is the fix). Users who pinned to specific values from the old CSVs will need to update their code.

louismagowan and others added 5 commits March 7, 2026 13:04
- Add `seed` parameter to all generator functions that lacked it
- Replace scipy.stats RNG calls (norm().rvs, uniform.rvs, dirichlet().rvs)
  with numpy Generator methods (rng.normal, rng.uniform, rng.dirichlet)
- Replace np.random.* global state calls with local rng instances
- Fix create_series() bug: hardcoded length_scale=2 now uses parameter
- Rename create_series, generate_seasonality, periodic_kernel to private
  (_create_series, _generate_seasonality, _periodic_kernel) since they
  are internal helpers that require an rng instance from their caller
- Fix deprecated pandas freq="M" → freq="ME"
- Remove unused scipy.stats imports (norm, uniform, dirichlet)
- Remove module-level global rng; keep RANDOM_SEED constant for use
  by load_data() and test fixtures
- Standardize generate_staggered_did_data to use same rng pattern

Closes pymc-labs#545

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split DATASETS dict into SYNTHETIC_DATASETS (generated via seeded
functions) and REAL_WORLD_DATASETS (loaded from CSV). This removes
the dependency on synthetic CSV files while keeping real-world data
as shipped CSVs.

- Replace _get_data_home() with simple _DATA_DIR = Path(__file__).parent
- Remove circular `import causalpy as cp` dependency
- All 8 synthetic datasets now call their generator with RANDOM_SEED

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eight fixtures (did_data, its_data, its_simple_data, rd_data, sc_data,
anova1_data, geolift1_data, geolift_multi_cell_data) generate data once
per test session using RANDOM_SEED, avoiding redundant calls to
load_data() or generators in individual tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update 8 test files to use session-scoped fixtures (did_data, its_data,
rd_data, sc_data, anova1_data, geolift1_data) instead of calling
cp.load_data() for synthetic datasets. Real-world dataset loading
(banks, brexit, drinking, risk, nhefs) remains unchanged.

Also rewrite test_data_loading.py to parametrize over all datasets
(synthetic + real-world) and add reproducibility + unknown-key tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove 8 synthetic CSV files (~172K) now generated programmatically:
  did.csv, regression_discontinuity.csv, synthetic_control.csv,
  its.csv, its_simple.csv, ancova_generated.csv, geolift1.csv,
  geolift_multi_cell.csv

Also remove gt_social_media_data.csv which was never referenced
in the DATASETS dict or any code.

Real-world CSVs (12 files, ~3.2MB) remain in the repo.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@read-the-docs-community
Copy link
Copy Markdown

read-the-docs-community bot commented Mar 7, 2026

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.97%. Comparing base (5e8db5c) to head (e229425).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #794      +/-   ##
==========================================
+ Coverage   92.42%   92.97%   +0.55%     
==========================================
  Files          52       52              
  Lines        9230     9227       -3     
  Branches      562      561       -1     
==========================================
+ Hits         8531     8579      +48     
+ Misses        527      477      -50     
+ Partials      172      171       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@louismagowan
Copy link
Copy Markdown
Contributor Author

louismagowan commented Mar 7, 2026

@drbenvincent wdyt?

Also:
Once this is merged, we could consider a follow-up PR to purge the 9 deleted CSV files from git history using git-filter-repo. These files are still baked into every historical commit, which means:

  • Repo clone size stays inflated — the ~172KB of synthetic CSVs is replicated across hundreds of commits in .git/objects
  • Slower clones and fetches on the repo (repo is currently pretty massive to clone, not sure it needs to be)

The trade-off is that git-filter-repo rewrites all commit hashes, so:

  • All open PRs/branches would need rebasing
  • All contributors would need to re-clone or git fetch --all && git reset --hard origin/main
  • Any external links to specific commit SHAs would break

Is this worth doing, or is the size saving too marginal to justify the disruption?

…tion

Superseded by generate_time_series_data_seasonal and
generate_time_series_data_simple. Not imported or called anywhere.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@juanitorduz juanitorduz requested a review from drbenvincent March 8, 2026 10:48
@drbenvincent
Copy link
Copy Markdown
Collaborator

bugbot review

@cursor
Copy link
Copy Markdown

cursor bot commented Mar 11, 2026

Bugbot couldn't run

Bugbot is not enabled for your user on this team.

Ask your team administrator to increase your team's hard limit for Bugbot seats or add you to the allowlist in the Cursor dashboard.

@drbenvincent
Copy link
Copy Markdown
Collaborator

bugbot run

@cursor
Copy link
Copy Markdown

cursor bot commented Mar 11, 2026

Bugbot couldn't run

Bugbot is not enabled for your user on this team.

Ask your team administrator to increase your team's hard limit for Bugbot seats or add you to the allowlist in the Cursor dashboard.

Copy link
Copy Markdown
Collaborator

@drbenvincent drbenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution, @louismagowan — nice work. The commit structure is clean, the approach is well thought-out, and it's great to see all CI checks green with coverage actually improving (+0.55%).

A few notes:

API breakages to be aware of

This PR removes or renames several functions that were previously accessible (and documented in the API docs):

  • generate_seasonality, periodic_kernel, create_series are renamed to private (_ prefix)
  • generate_time_series_data is deleted entirely

None of these are imported via __init__.py, so they're not top-level public API, but anyone doing from causalpy.data.simulate_data import generate_seasonality (etc.) would break. The risk is low, but we should mention these in the release notes.

load_data() is now dynamic — no caching

Previously load_data("sc") read a small CSV. Now it runs full generation (LOWESS smoothing, GP kernels, etc.) on every call. The test suite handles this well with session-scoped fixtures, but users calling load_data() in a loop will see a performance hit. Not a blocker — just something to be aware of. We could add @functools.lru_cache or similar in a follow-up if it becomes an issue.

Data values will change

This is inherent to the fix (and expected), but worth calling out: load_data("did") and friends now return completely different numeric values than the old CSVs. Any downstream user code that depended on specific values will see different results.

git-filter-repo suggestion

Appreciate the thought here, but I'd recommend against it. The savings (~172KB) are negligible and the disruption is severe — all commit hashes rewritten, all open PRs need rebasing, all contributors need to re-clone, external SHA links break. Please do raise this as a separate issue though so we have it tracked; it just won't be high priority given the trade-offs.

PR description

Could you flesh out the PR body with a summary of what changed and why? For a 20-file refactor like this, a brief bullet list (API changes, deleted files, new test fixtures, bug fixes) makes it much easier for maintainers and users browsing the changelog to understand the scope. The commit messages are well-structured — it's really just about surfacing that same information in the PR description so it's visible at a glance.

@louismagowan
Copy link
Copy Markdown
Contributor Author

louismagowan commented Apr 5, 2026

@drbenvincent apologies for the delay - work has been really busy these past months.

The only other part of your feedback that needs to be addressed, I believe, was around the release notes - would you be able to take that part when you write the next changelog?

Is there anything else I need to do to get this one merged? ☺️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix reproducibility, refactor simulate_data.py, use functions in tests

2 participants