Conversation
|
Cool. Thanks @ksolarski, just a quick reply from my phone... Don't do this for the synthetic control because I have an in progress PR that will change it. It won't have a formula input. But can I just get some clarification... does this change the API? Can we get the exact same functionality? If not, let's think again. Will try to look at the code properly when I can 👍🏻 |
|
I can't find where I saw it in the I'm not 100% sure that this is a problem, and apologies I can't find the relevant part in the docs. But does my concern make sense? |
|
You're right, Patsy has the power of preserving the transformation / encoding of variables through However, Patsy repo suggests migration to https://github.com/matthewwardrop/formulaic instead, which is capable of "reusing the encoding choices made during conversion of one data-set on other datasets." (see https://matthewwardrop.github.io/formulaic/latest/). There's also a migration guide from Patsy to Formulaic to switch would be easy. It also supports many operators: https://matthewwardrop.github.io/formulaic/latest/guides/grammar/ Did you check out this library before? What do you think about using this instead of formulae? |
|
@drbenvincent any strong opinions about using |
|
Sorry for the delayed response @ksolarski. So as far as I understand, Right now there are no use-cases for hierarchical modelling. That might change in the future, though I don't have any specific use cases in mind. So I guess the only choice at the moment is |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #463 +/- ##
==========================================
- Coverage 94.38% 94.35% -0.04%
==========================================
Files 44 44
Lines 7517 7510 -7
Branches 456 456
==========================================
- Hits 7095 7086 -9
- Misses 261 262 +1
- Partials 161 162 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@drbenvincent Yes, from the docs it seems that no hierarchical models are allowed in the import pandas as pd
from formulaic import model_matrix
import formulaic
# Create a training dataset
train_data = pd.DataFrame(
{
"feature1": ["A", "B", "C", "D"],
"target": [0, 1, 0, 1],
}
)
# Create a test dataset
test_data = pd.DataFrame(
{
"feature1": [
"A", # In training
"D", # In training
"E", # Not in training
],
"target": [0, 1, 0],
}
)
# Generate the model matrix for the training data
train_matrix = model_matrix("target ~ 0 + feature1", train_data)
# Print the training matrix and spec
print("Training Matrix:")
print(train_matrix)
# Use the same spec to transform the test data
test_matrix = model_matrix(spec=train_matrix.model_spec, data=test_data)
# Print the test matrix - see that columns are properly aligned from the training data transformation
print("\nTest Matrix:")
print(test_matrix)Is that the problem you had in mind or something else? |
@ksolarski Yes that's pretty much it. Turns out the phrase I was looking for was "stateful transforms" which you pretty much said here. I'm actually wondering - if we don't get the additional functionality of hierarchical modeling, is there much benefit from moving from patsy to formulaic? Not saying it's not worth it, but we should be clear about the rationale. Is it because patsy is no longer maintained and formulaic might see more features, or does it have a richer formula API? Sorry for the extended conversation on this by the way - but given the formula aspect is a core part of the API it's worth thinking it through :) |
|
Hi @ksolarski and @drbenvincent, sorry for the delay in my response. I can add a few pieces of information.
With that said, unless you need hierarchical models supported via the |
23221c2 to
0734c8b
Compare
|
@drbenvincent I went with |
|
Sorry for being slow on this @ksolarski ! After #641 is merged, we should revisit this. That will see increased modularity, so all the action can happen in the new |
|
This comment was generated with LLM assistance and may have been edited by the commenter. Issue EvaluationSummaryThis PR aims to replace
Current RelevanceThis issue is still relevant and valid. The codebase continues to use
Code references checked:
Related PRs:
Discussion Summary
Recommendation
Proposed next steps:
Migration Scope (for reference)Once #641 is merged, the migration will involve:
The migration pattern (from patsy to formulaic) would be: # Before (patsy)
from patsy import dmatrices, build_design_matrices
y, X = dmatrices(formula, data)
# After (formulaic)
from formulaic import model_matrix
mm = model_matrix(formula, data)
# Then use mm.model_spec for new datacc @ksolarski - Heads up that this is blocked on #641. Once that merges, this PR should be straightforward to complete! |
|
Sounds good, I see it's merged already so I'll continue here |
PR SummaryMedium Risk Overview Updates the PyMC models layer to use Written by Cursor Bugbot for commit 2f7e57e. This will update automatically on new commits. Configure here. |
|
@drbenvincent now I believe this is ready to be looked at. I removed Patsy from all the places except for notebooks. Let me know if you think I should also regenerate those then. |
Documentation build overview
Show files changed (11 files in total): 📝 11 modified | ➕ 0 added | ➖ 0 deleted
|
|
Thanks @ksolarski - I'm hoping to take a few looks at this because this is a major change :) At the moment I'll just flag up a few issues:
|
|
PS. I'm fine if you don't re-run or update notebooks in this PR. But given that it's a big change I think I would want to do that and push changes all in this one PR, so that this is done in one decisive PR. Reason is because I'd like to be able to be able to create a new release at any point without having a PR which breaks things until some future hypothetical PR fixes the notebooks :) PPS. I updated this branch from main by the way, so you'll need to pull changes to your local machine @ksolarski |
| "X": X.data, | ||
| "y": np.zeros( | ||
| (new_no_of_observations, n_treated_units), dtype=np.int32 | ||
| ), |
There was a problem hiding this comment.
Prediction data converted to buffer object
High Severity
_data_setter() now passes X.data into pm.set_data. When predict() is called with a NumPy array (common in several experiments), X.data is a raw buffer, not the feature matrix. This can corrupt X shape/content during posterior prediction and break out-of-sample predictions.




Solving issue #386
Starting with DiD, will continue with other methods if you with general design @drbenvincent
Seems like the key practical difference between
formulaeandpatsyis lack ofbuild_design_matricesmethod informulae. User has to then provide formula again.Edit: After discussion, it was decided that
formulaicsuits us best.📚 Documentation preview 📚: https://causalpy--463.org.readthedocs.build/en/463/