Description
I have encountered an inconsistency in how ModelSpec.variables_by_source identifies data columns. When a variable is wrapped in the quote operator Q()—which is necessary to handle Python reserved words (e.g., class) or columns with special characters—it is no longer categorized under 'data' in the variables_by_source dictionary.
Minimal Reproducible Example
import pandas as pd
from formulaic import Formula
# Sample data
data = pd.DataFrame({"condition": ["A", "A", "B", "B"], "value": [10, 12, 15, 18]})
# Case 1: Standard variable (works as expected)
# Using 'condition' as a proxy for a non-reserved name
FormulaicContrasts(data, "C(condition, contr.treatment('B'))").design_matrix.model_spec.variables_by_source
# Output: {'transforms': {'C', 'contr.treatment'}, 'data': {'condition'}}
# Case 2: Using Q()
# This is the recommended way to handle such columns, but 'class' disappears from 'data'
FormulaicContrasts(data, "C(Q('condition'), contr.treatment('B'))").design_matrix.model_spec.variables_by_source
# Output: {'transforms': {'C', 'Q', 'contr.treatment'}}
Actual Behavior
When using Q('condition'), the 'data' key is missing from the dictionary (or doesn't contain 'condition'), and 'Q' is added to 'transforms'.
Expected Behavior
The variable inside Q() should still be identified as a data source: {'transforms': {'C', 'Q', 'contr.treatment'}, 'data': {'condition'}}
Context: Why this matters
I cannot simply avoid using Q(). If we use a reserved word like class without Q(), formulaic (rightfully) raises a SyntaxError because it's interpreted as a Python keyword.
However, because the resulting ModelSpec does not list the column in variables_by_source['data'], downstream tools like PyDESeq2 fail when they try to verify the existence of the factor during contrast analysis:
# Example of downstream failure in PyDESeq2
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
from pydeseq2.utils import load_example_data
dds = DeseqDataSet(
counts=load_example_data(modality="raw_counts", dataset="synthetic"),
metadata=load_example_data(modality="metadata", dataset="synthetic").rename(columns={"condition": "class"}),
design="C(Q('class'), contr.treatment('B'))",
)
dds.deseq2()
DeseqStats(dds, contrast=["class", "A", "B"])
# This will fail later
File ~/.cache/pypoetry/virtualenvs/pyrite-env-Lnz-QZIB-py3.11/lib/python3.11/site-packages/formulaic_contrasts/_contrasts.py:30, in FormulaicContrasts.variables(self)
27 @property
28 def variables(self):
29 """Get the names of the variables used in the model definition."""
---> 30 return self.design_matrix.model_spec.variables_by_source["data"]
KeyError: 'data'
Environment info
Formulaic version: 1.2.1
Python version: 3.11.13
Description
I have encountered an inconsistency in how
ModelSpec.variables_by_sourceidentifies data columns. When a variable is wrapped in the quote operatorQ()—which is necessary to handle Python reserved words (e.g.,class) or columns with special characters—it is no longer categorized under'data'in thevariables_by_sourcedictionary.Minimal Reproducible Example
Actual Behavior
When using
Q('condition'), the'data'key is missing from the dictionary (or doesn't contain'condition'), and'Q'is added to'transforms'.Expected Behavior
The variable inside
Q()should still be identified as a data source:{'transforms': {'C', 'Q', 'contr.treatment'}, 'data': {'condition'}}Context: Why this matters
I cannot simply avoid using
Q(). If we use a reserved word likeclasswithoutQ(),formulaic(rightfully) raises aSyntaxErrorbecause it's interpreted as a Python keyword.However, because the resulting
ModelSpecdoes not list the column invariables_by_source['data'], downstream tools likePyDESeq2fail when they try to verify the existence of the factor during contrast analysis:Environment info
Formulaic version: 1.2.1
Python version: 3.11.13