-
Notifications
You must be signed in to change notification settings - Fork 32
Proposal: support columns representing multiple features #149
Description
First of all, thanks for the amazing package! I am working on extending a GLM package (glum) and the matrix library it uses as a backend (tabmat) with a formula interface, and formulaic is a great fit so far.
The only pain point we have is related to the handling of categorical columns, which can represent multiple features during model estimation*. Ideally, for such a column, we would like to have column names in ModelSpec.structure as if those categoricals were one-hot encoded, even though they are not. We did find a way to make it work (draft PR), but the current solution feels somewhat hacky, and we are overriding long methods just so we can change a couple of lines.
Would it make sense for formulaic to add more support for these kinds of columns? I'm thinking of some kind of interface through which a column can tell formulaic how many feature it represents and what their names are, and then _build_model_matrix and _enforce_structure could take those into account when constructing and checking the EncodedTermStructure. I think such a solution would be useful not only for our packages, but potentially many others, too (e.g. GLMs in H2O.ai and certain scikit-learn estimators have native categorical support, too. Also, might a potential solution be related to issue #46, too?
If you think such a feature would be in-scope for formulaic, I would be more than happy to help implement it.
* glum and tabmat allow matrix operations and model estimation without one-hot-encoding categorical columns. This has performance benefits over having potentially thousands of indicator variables.