Proposal: support columns representing multiple features

First of all, thanks for the amazing package! I am working on extending a GLM package ([glum](https://github.com/Quantco/glum)) and the matrix library it uses as a backend ([tabmat](https://github.com/Quantco/glum)) with a formula interface, and `formulaic` is a great fit so far.

The only pain point we have is related to the handling of categorical columns, which can represent multiple features during model estimation*. Ideally, for such a column, we would like to have column names in `ModelSpec.structure` *as if* those categoricals were one-hot encoded, even though they are not. We did find a way to make it work ([draft PR](https://github.com/Quantco/tabmat/pull/267)), but the current solution feels somewhat hacky, and we are [overriding long methods](https://github.com/Quantco/tabmat/blob/formula/src/tabmat/formula.py#L120-L123) just so we can change a [couple of lines](https://github.com/Quantco/tabmat/blob/formula/src/tabmat/formula.py#L180-L185).

Would it make sense for `formulaic` to add more support for these kinds of columns? I'm thinking of some kind of interface through which a column can tell `formulaic` how many feature it represents and what their names are, and then `_build_model_matrix` and `_enforce_structure` could take those into account when constructing and checking the `EncodedTermStructure`. I think such a solution would be useful not only for our packages, but potentially many others, too (e.g. [GLMs in H2O.ai](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html#handling-of-categorical-variables) and [certain scikit-learn estimators](https://scikit-learn.org/stable/modules/ensemble.html#categorical-features-support) have native categorical support, too. Also, might a potential solution be related to issue #46, too?

If you think such a feature would be in-scope for `formulaic`, I would be more than happy to help implement it.

 \* `glum` and `tabmat` allow matrix operations and model estimation without one-hot-encoding categorical columns. This has performance benefits over having potentially thousands of indicator variables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: support columns representing multiple features #149

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Proposal: support columns representing multiple features #149

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions