Skip to content

Handling individual columns that can expand into multiple columns #163

@ptonner

Description

@ptonner

hi, thanks for the library!

I was wondering how best to handle a particular use-case, where a single column can "expand" into multiple categorical factors. The specific context is when analyzing datasets containing information about the effect of genetic mutations on the function of a protein (or other genetic targets). When there's a single starting sequence (sometimes called a "parent" or "wild-type"), it makes sense to represent these mutations as being relative to that parent sequence. So for a single mutation you might have a string like "A11K" which means changing the parent sequences alanine residue ("A") at site 11 to a lysine ("K"). Then if you have multiple mutations for a single protein, it might look something like "A11K;S28T" for that version of the protein.

So an example, slightly modified from the published data of this paper, would look like this (subset of the full table):

mutations activity
A108T:I150V 2.63384474
A108T:K129E:S145C:E170G:T184S 1.31862971
A108T:K156R:F163L:V174A:Q202L:K207E 1.30103208
A108T:Q202R :M216L:V222A 1.30103059

The most typical way to use this information in machine learning is a one-hot encoding for each possible mutation. But I'd like to be able to leverage formulaic to:

  • have more control on these encodings beyond a simple one-hot encoding
  • more easily integrate mutational encoding in larger formulae with other terms
  • take advantage of model_spec resuse on new datasets to have a consistent encoding
  • leverage the ability of formulaic to generate higher-order interactions between different factors, as this is often a goal of statistical analyses of these data

So I'm trying to figure out how best fit this into a pipeline relying on formulaic. There are essentially two steps that have to be done in sequence:

  1. Expand the condensed representation of individual mutations into a per-position representation, essentially a different categorical column for each possible mutation position
  2. Build a formula that categorically encodes each individual site.

I could already do this "by hand" by transforming the input data for step 1 and then feeding this new matrix into a model_matrix call. E.g. something like

model_matrix("C(site108) + C(site129) + ... ", transformed(data))

but the challenge here is somewhat already apparent: there's a large number of categorical factors (one for each site) that will be generated (it's not uncommon to have hundreds or thousands of sites mutated in a given dataset). So, the bookkeeping on these sites, and generating an equation that represents each one, can be somewhat unruly. There's two potential solutions I had in mind, and wanted to know what makes the most sense.

Option 1 (preferred): Make it possible to implement a stateful transform that expand into multiple factors

My ideal API here would be to define a stateful transform that looks something like this:

model_matrix("M(mutations)", data)

where M would both do the per-site encoding as well as generate a set of individual categorical factors. I looked into the internals of how C works and it doesn't seem like stateful transforms can operate in this way? Or more generally is it not possible for an individual formula factor to expand into multiple factors?

The reason I'd prefer this option as it would simplify usage in more complicated analysis where other terms might be included in the formula. It also could potentially make re-use of model_spec on a new dataset more convenient (for example when a new dataset is missing mutations at a particular site).

Option 2: Build a formula programmatically for a transformed version of the dataset

I'm still digging into how you can accomplish this, but if I understand correctly formulaic already supports programmatic generation of formulas, so I could do the site expansion upstream of processing with formulaic and then generate the categorical terms for each individual site. This would be a reasonable solution, but I just wanted to double check something like option 1 couldn't work as I think it would simplify use for end-users.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions