Skip to content

ModelSpec.variables_by_source['data'] fails to capture variables wrapped in the quote operator Q() #266

@136s

Description

@136s

Description

I have encountered an inconsistency in how ModelSpec.variables_by_source identifies data columns. When a variable is wrapped in the quote operator Q()—which is necessary to handle Python reserved words (e.g., class) or columns with special characters—it is no longer categorized under 'data' in the variables_by_source dictionary.

Minimal Reproducible Example

import pandas as pd
from formulaic import Formula

# Sample data
data = pd.DataFrame({"condition": ["A", "A", "B", "B"], "value": [10, 12, 15, 18]})

# Case 1: Standard variable (works as expected)
# Using 'condition' as a proxy for a non-reserved name
FormulaicContrasts(data, "C(condition, contr.treatment('B'))").design_matrix.model_spec.variables_by_source
# Output: {'transforms': {'C', 'contr.treatment'}, 'data': {'condition'}}

# Case 2: Using Q()
# This is the recommended way to handle such columns, but 'class' disappears from 'data'
FormulaicContrasts(data, "C(Q('condition'), contr.treatment('B'))").design_matrix.model_spec.variables_by_source
# Output: {'transforms': {'C', 'Q', 'contr.treatment'}}

Actual Behavior

When using Q('condition'), the 'data' key is missing from the dictionary (or doesn't contain 'condition'), and 'Q' is added to 'transforms'.

Expected Behavior

The variable inside Q() should still be identified as a data source: {'transforms': {'C', 'Q', 'contr.treatment'}, 'data': {'condition'}}

Context: Why this matters

I cannot simply avoid using Q(). If we use a reserved word like class without Q(), formulaic (rightfully) raises a SyntaxError because it's interpreted as a Python keyword.

However, because the resulting ModelSpec does not list the column in variables_by_source['data'], downstream tools like PyDESeq2 fail when they try to verify the existence of the factor during contrast analysis:

# Example of downstream failure in PyDESeq2
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
from pydeseq2.utils import load_example_data

dds = DeseqDataSet(
    counts=load_example_data(modality="raw_counts", dataset="synthetic"),
    metadata=load_example_data(modality="metadata", dataset="synthetic").rename(columns={"condition": "class"}),
    design="C(Q('class'), contr.treatment('B'))",
)
dds.deseq2()
DeseqStats(dds, contrast=["class", "A", "B"])
# This will fail later
File ~/.cache/pypoetry/virtualenvs/pyrite-env-Lnz-QZIB-py3.11/lib/python3.11/site-packages/formulaic_contrasts/_contrasts.py:30, in FormulaicContrasts.variables(self)
     27 @property
     28 def variables(self):
     29     """Get the names of the variables used in the model definition."""
---> 30     return self.design_matrix.model_spec.variables_by_source["data"]

KeyError: 'data'

Environment info

Formulaic version: 1.2.1
Python version: 3.11.13

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions