-
Notifications
You must be signed in to change notification settings - Fork 32
Handling individual columns that can expand into multiple columns #163
Description
hi, thanks for the library!
I was wondering how best to handle a particular use-case, where a single column can "expand" into multiple categorical factors. The specific context is when analyzing datasets containing information about the effect of genetic mutations on the function of a protein (or other genetic targets). When there's a single starting sequence (sometimes called a "parent" or "wild-type"), it makes sense to represent these mutations as being relative to that parent sequence. So for a single mutation you might have a string like "A11K" which means changing the parent sequences alanine residue ("A") at site 11 to a lysine ("K"). Then if you have multiple mutations for a single protein, it might look something like "A11K;S28T" for that version of the protein.
So an example, slightly modified from the published data of this paper, would look like this (subset of the full table):
| mutations | activity |
|---|---|
| A108T:I150V | 2.63384474 |
| A108T:K129E:S145C:E170G:T184S | 1.31862971 |
| A108T:K156R:F163L:V174A:Q202L:K207E | 1.30103208 |
| A108T:Q202R :M216L:V222A | 1.30103059 |
The most typical way to use this information in machine learning is a one-hot encoding for each possible mutation. But I'd like to be able to leverage formulaic to:
- have more control on these encodings beyond a simple one-hot encoding
- more easily integrate mutational encoding in larger formulae with other terms
- take advantage of
model_specresuse on new datasets to have a consistent encoding - leverage the ability of
formulaicto generate higher-order interactions between different factors, as this is often a goal of statistical analyses of these data
So I'm trying to figure out how best fit this into a pipeline relying on formulaic. There are essentially two steps that have to be done in sequence:
- Expand the condensed representation of individual mutations into a per-position representation, essentially a different categorical column for each possible mutation position
- Build a formula that categorically encodes each individual site.
I could already do this "by hand" by transforming the input data for step 1 and then feeding this new matrix into a model_matrix call. E.g. something like
model_matrix("C(site108) + C(site129) + ... ", transformed(data))but the challenge here is somewhat already apparent: there's a large number of categorical factors (one for each site) that will be generated (it's not uncommon to have hundreds or thousands of sites mutated in a given dataset). So, the bookkeeping on these sites, and generating an equation that represents each one, can be somewhat unruly. There's two potential solutions I had in mind, and wanted to know what makes the most sense.
Option 1 (preferred): Make it possible to implement a stateful transform that expand into multiple factors
My ideal API here would be to define a stateful transform that looks something like this:
model_matrix("M(mutations)", data)where M would both do the per-site encoding as well as generate a set of individual categorical factors. I looked into the internals of how C works and it doesn't seem like stateful transforms can operate in this way? Or more generally is it not possible for an individual formula factor to expand into multiple factors?
The reason I'd prefer this option as it would simplify usage in more complicated analysis where other terms might be included in the formula. It also could potentially make re-use of model_spec on a new dataset more convenient (for example when a new dataset is missing mutations at a particular site).
Option 2: Build a formula programmatically for a transformed version of the dataset
I'm still digging into how you can accomplish this, but if I understand correctly formulaic already supports programmatic generation of formulas, so I could do the site expansion upstream of processing with formulaic and then generate the categorical terms for each individual site. This would be a reasonable solution, but I just wanted to double check something like option 1 couldn't work as I think it would simplify use for end-users.