-
Notifications
You must be signed in to change notification settings - Fork 32
get_model_matrix for MULTISTAGE formula #264
Description
Hi @matthewwardrop, we're refactoring pyfixest's formula parser and are considering moving its instrumental variable functionality to formulaic's multi-stage syntax. I realise that it's still labelled experimental in the documentation, so the first question is whether you would advise in principle against relying on the syntax in production for now?
My second question is regarding the current implementation, specifically how the formula syntax is expected to interact with the construction of model matrices. For example, setting the appropriate FeatureFlags, the parsing of the multi-stage syntax works as expected. However, naively calling get_model_matrix on this formula, throws an error (because the second stage relies on the estimate of the first stage X2_hat). In #24, you wrote that
On a multipart formula like this one, calls to get_model_matrix will need to specify the part and stage for which the model matrix should be generated.
Unfortunately, I wasn't able to infer from the documentation or the code if this is still accurate (I think the quote dates from a few years before the feature was implemented).
I also had a quick look at linearmodels because it already uses formulaic's multi-stage syntax for IV estimation. However, my understanding after a preliminary glance at its codebase is that the multi-stage syntax is parsed internally into single-stage formulas before being passed to get_model_matrix (see here), so that the multi-stage syntax is currently not used.
import formulaic
from formulaic.parser import DefaultFormulaParser
import pyfixest as pf
data = pf.get_data(N=1_000, seed=0, model="Feols")
fml = "Y ~ X1 + [X2 ~ Z1]"
# parsing works as expected
formula = formulaic.Formula(
fml,
_parser=DefaultFormulaParser(feature_flags=DefaultFormulaParser.FeatureFlags.ALL),
)
# model matrix construction throws error:
# formulaic.errors.FactorEvaluationError: Unable to evaluate factor `X2_hat`. [NameError: `X2_hat` is not present in the dataset or evaluation context.]
formula(data=data)
(cc @s3alfisc)