Skip to content

Proposal: support columns representing multiple features #149

@ghost

Description

First of all, thanks for the amazing package! I am working on extending a GLM package (glum) and the matrix library it uses as a backend (tabmat) with a formula interface, and formulaic is a great fit so far.

The only pain point we have is related to the handling of categorical columns, which can represent multiple features during model estimation*. Ideally, for such a column, we would like to have column names in ModelSpec.structure as if those categoricals were one-hot encoded, even though they are not. We did find a way to make it work (draft PR), but the current solution feels somewhat hacky, and we are overriding long methods just so we can change a couple of lines.

Would it make sense for formulaic to add more support for these kinds of columns? I'm thinking of some kind of interface through which a column can tell formulaic how many feature it represents and what their names are, and then _build_model_matrix and _enforce_structure could take those into account when constructing and checking the EncodedTermStructure. I think such a solution would be useful not only for our packages, but potentially many others, too (e.g. GLMs in H2O.ai and certain scikit-learn estimators have native categorical support, too. Also, might a potential solution be related to issue #46, too?

If you think such a feature would be in-scope for formulaic, I would be more than happy to help implement it.

* glum and tabmat allow matrix operations and model estimation without one-hot-encoding categorical columns. This has performance benefits over having potentially thousands of indicator variables.

Metadata

Metadata

Labels

enhancementNew feature or requestquestionFurther information is requested

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions