Handling individual columns that can expand into multiple columns

hi, thanks for the library!

I was wondering how best to handle a particular use-case, where a single column can "expand" into multiple categorical factors. The specific context is when analyzing datasets containing information about the effect of genetic mutations on the function of a protein (or other genetic targets). When there's a single starting sequence (sometimes called a "parent" or "wild-type"), it makes sense to represent these mutations as being relative to that parent sequence. So for a single mutation you might have a string like "A11K" which means changing the parent sequences alanine residue ("A") at site 11 to a lysine ("K"). Then if you have multiple mutations for a single protein, it might look something like "A11K;S28T" for that version of the protein. 

So an example, slightly modified from the published data of [this paper](https://www.nature.com/articles/nature17995), would look like this (subset of the full table):

| mutations | activity |
| -- | -- |
A108T:I150V | 2.63384474 |  
A108T:K129E:S145C:E170G:T184S | 1.31862971 |  
A108T:K156R:F163L:V174A:Q202L:K207E | 1.30103208 |  
A108T:Q202R :M216L:V222A | 1.30103059 |  


The most typical way to use this information in machine learning is a one-hot encoding for each possible mutation. But I'd like to be able to leverage `formulaic` to:
* have more control on these encodings beyond a simple one-hot encoding
* more easily integrate mutational encoding in larger formulae with other terms
* take advantage of `model_spec` resuse on new datasets to have a consistent encoding
* leverage the ability of `formulaic` to generate higher-order interactions between different factors, as this is often a goal of statistical analyses of these data

So I'm trying to figure out how best fit this into a pipeline relying on `formulaic`. There are essentially two steps that have to be done in sequence:
1. Expand the condensed representation of individual mutations into a per-position representation, essentially a different categorical column for each possible mutation position
2. Build a formula that categorically encodes each individual site. 

I could already do this "by hand" by transforming the input data for step 1 and then feeding this new matrix into a `model_matrix` call. E.g. something like
```python
model_matrix("C(site108) + C(site129) + ... ", transformed(data))
```
but the challenge here is somewhat already apparent: there's a large number of categorical factors (one for each site) that will be generated (it's not uncommon to have hundreds or thousands of sites mutated in a given dataset). So, the bookkeeping on these sites, and generating an equation that represents each one, can be somewhat unruly. There's two potential solutions I had in mind, and wanted to know what makes the most sense.

## Option 1 (preferred): Make it possible to implement a stateful transform that expand into multiple factors

My ideal API here would be to define a stateful transform that looks something like this:
```python
model_matrix("M(mutations)", data)
```
where `M` would both do the per-site encoding as well as generate a set of individual categorical factors. I looked into the internals of how `C` works and it doesn't seem like stateful transforms can operate in this way? Or more generally is it not possible for an individual formula factor to expand into multiple factors?

The reason I'd prefer this option as it would simplify usage in more complicated analysis where other terms might be included in the formula. It also could potentially make re-use of `model_spec` on a new dataset more convenient (for example when a new dataset is missing mutations at a particular site).

## Option 2: Build a formula programmatically for a transformed version of the dataset

I'm still digging into how you can accomplish this, but if I understand correctly `formulaic` already supports programmatic generation of formulas, so I could do the site expansion upstream of processing with `formulaic` and then generate the categorical terms for each individual site. This would be a reasonable solution, but I just wanted to double check something like option 1 couldn't work as I think it would simplify use for end-users.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling individual columns that can expand into multiple columns #163

Option 1 (preferred): Make it possible to implement a stateful transform that expand into multiple factors

Option 2: Build a formula programmatically for a transformed version of the dataset

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

mutations	activity
A108T:I150V	2.63384474
A108T:K129E:S145C:E170G:T184S	1.31862971
A108T:K156R:F163L:V174A:Q202L:K207E	1.30103208
A108T:Q202R :M216L:V222A	1.30103059

Handling individual columns that can expand into multiple columns #163

Description

Option 1 (preferred): Make it possible to implement a stateful transform that expand into multiple factors

Option 2: Build a formula programmatically for a transformed version of the dataset

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions