diff --git a/Claud_next_steps.rtf b/Claud_next_steps.rtf new file mode 100644 index 0000000..beb6944 --- /dev/null +++ b/Claud_next_steps.rtf @@ -0,0 +1,38 @@ +{\rtf1\ansi\ansicpg1252\cocoartf2822 +\cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fswiss\fcharset0 Helvetica;\f1\fmodern\fcharset0 Courier;} +{\colortbl;\red255\green255\blue255;\red38\green38\blue38;\red255\green255\blue255;} +{\*\expandedcolortbl;;\cssrgb\c20000\c20000\c20000;\cssrgb\c100000\c100000\c100000;} +\margl1440\margr1440\vieww13240\viewh11160\viewkind0 +\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0 + +\f0\fs24 \cf0 # Notes for next session\ +Check your CLAUDE_NOTES and familiarize yourself with geneLME_dev.R and geneLME_dev2.R\ +\ +\ +Pick up with: verifying and validating the dev2 version. I have tested this and still encountered errors stemming from functions not explicitly called from required packages (eg. Line 849: plan(multisession, workers=workers). I corrected this where I could find it, but please review and make sure my specifications are valid and correct all such instances. If needed, should we add explicit package calling to anyplace that non base functions are called?\ +\ +I\'92d also like you to please edit the functions to silence internal warnings such as \'93 +\f1\fs26 \cf2 \cb3 \expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 ## Warning: package \'91lme4\'92 was built under R version 4.5.2\'94 +\f0\fs24 \cf0 \cb1 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 or other messages about package versioning. \ +\ +I\'92ve also encountered the warnings \ +\'93 +\f1\fs26 \cf2 \cb3 \expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 Warning: Some predictor variables are on very different scales: consider rescaling. \ +\pard\pardeftab720\partightenfactor0 +\cf2 \cb3 \strokec2 ## You may also use (g)lmerControl(autoscale = TRUE) to improve numerical stability.\ +\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0 + +\f0\fs24 \cf0 \cb1 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 \'93\ +\ +This may be good rational for setting \'93 +\f1\fs26 \cf2 \cb3 \expnd0\expndtw0\kerning0 +lmerControl(autoscale = TRUE)\'94 +\f0\fs24 \cf0 \cb1 \kerning1\expnd0\expndtw0 internally, though I had thought this was the default behavior in lme4 now. If indeed it is, then we should certainly silence that. \ +\ +Once the dev2 version has been more thoroughly debugged, it can be merged into the main geneLME.R script.\ +\ +\ +\ +} \ No newline at end of file diff --git a/R_functions/CLAUDE_NOTES_geneLME.md b/R_functions/CLAUDE_NOTES_geneLME.md new file mode 100644 index 0000000..fa73f11 --- /dev/null +++ b/R_functions/CLAUDE_NOTES_geneLME.md @@ -0,0 +1,569 @@ +# geneLME Project — Claude Session Notes + +> **How to use this file:** At the start of each session, ask Claude to read this file. +> Claude should update the "Current Status" and "Session Log" sections at the end of any +> session where meaningful progress is made. + +--- + +## Project Overview + +**Goal:** A scalable, parallelized per-gene linear mixed effects (LME) modelling function for +RNA-seq expression data stored in a `limma` EList (`dat`) object. Key features: +- Fits one `lmer` model per gene in parallel via `future_lapply` +- Extracts ANOVA table, coefficient estimates, and emmeans-based contrasts +- Supports weighted models (voom weights from `dat$weights`) +- Supports flexible contrast specification including interaction terms + +**Active worktree:** `.claude/worktrees/reverent-satoshi/` +**Key files:** +- `R_functions/geneLME.R` — current stable (merged from dev2 2026-02-20; fully featured) +- `R_functions/geneLME_dev.R` — former dev version (superseded; kept for reference) +- `R_functions/geneLME_dev2.R` — former dev2 (now merged into geneLME.R; kept for reference) +- `R_functions/geneLME_test.R` — mock data and full test suite (SOURCE_FILE variable selects version) +- `R_functions/geneLME_benchmark.Rmd` — v1 benchmark (geneLME vs kimma) +- `R_functions/geneLME_benchmark2.Rmd` — v2 benchmark (sign consistency + speed vs kimma; dev2 correctness section removed post-merge) +- `R_functions/CLAUDE_NOTES_geneLME.md` — this file + +--- + +## Architecture Summary + +The code is split into five functions: + +| Function | Role | +|---|---| +| `geneLME_contrast_spec()` | Helper: generates a contrast level reference template. Two modes (see below). | +| `geneLME_build_contrast_args()` | Branch A helper: pre-computes `contrasts_list` + `spec_lookup` once before parallel dispatch. | +| `geneLME_fit()` | Per-gene worker: fits lmer, extracts ANOVA + contrasts. Runs inside `future_lapply`. | +| `geneLME_dispatch()` | Parallel launcher: calls `future_lapply` with explicit `future.globals` to avoid environment scanning overhead. | +| `geneLME_compiler()` | Aggregator: binds per-gene results into tidy data frames. | +| `geneLME()` | User-facing wrapper: validates inputs, pre-slices data, sets parallel plan, calls dispatch + compiler. | + +### Key Design Decisions Made + +1. **Per-gene pre-slicing** — `dat$E` and `dat$weights` are sliced into a `gene_data_list` of small bundles + *before* entering the parallel stage. This avoids serializing the full EList to every worker. +2. **Formula passed as string** — `formula_str` (a plain character string) is passed to workers, which + reconstruct the formula locally. Avoids locked-environment errors from passing formula objects with + environment references across futures. +3. **Explicit `future.globals`** — `geneLME_dispatch()` explicitly lists all globals rather than relying + on future's automatic environment scan, preventing accidental capture of large objects. +4. **`on.exit(plan(sequential))`** — restores the user's parallel plan after the run completes. +5. **Singular fit flagged, not errored** — `isSingular()` is checked and sets `model_status = "singular_fit"`. + Results are returned for all genes; users filter downstream on `model_status`. Fixed effect estimates + and contrasts are numerically valid even for singular fits — only the random effect structure is unreliable. + +--- + +## Contrast System (`geneLME_dev.R`) + +Two contrast branches in `geneLME_fit()`: + +### Branch A: Interaction contrasts (`contrast_vars` contains `":"`) +- `contrast_vars` is a single string e.g. `"treatment:visit"`. +- **`contrast_spec`** (required): a data.frame with columns `contrast_ref` and `contrast_lvl`. + Each row defines one pairwise contrast across interaction cells. Contrast vectors are built from the + emmeans level ordering to guarantee index alignment. +- Use `geneLME_contrast_spec(dat$targets, contrast_vars = "treatment:visit")` to generate the full + pairwise template, then filter to the rows of interest before passing as `contrast_spec`. +- The formula must contain the corresponding interaction term (`treatment:visit` or `treatment*visit`); + `geneLME()` validation enforces this — emmeans would otherwise silently use additive margins. + +### Branch B: Non-interaction contrasts +- `contrast_vars` is a character vector of 1–2 variable names. +- `contrasts_primary`: named list of contrast vectors. Vector length must match the number of levels + of `contrast_vars[1]`, in alphabetical order (or the order emmeans sees them). +- `contrasts_secondary`: optional second-order contrast (contrast-of-contrasts). +- `contrast_var_2_levels`: optional filter on levels of the second `by` variable. + +### Helper: `geneLME_contrast_spec(targets, contrast_vars)` +Two modes determined by whether `contrast_vars` contains `":"`: + +- **Interaction mode** (`"treatment:visit"`): returns a data.frame with columns `contrast_ref` and + `contrast_lvl` **only** (no `contrast_index`). One row per pairwise combination of interaction-level + strings. User filters and passes as `contrast_spec` to `geneLME()`. + - `contrast_index` is NOT part of the `geneLME_contrast_spec()` output. Instead, `geneLME()` attaches + an indexed copy of the user's filtered spec to its return value as `$contrast_spec`, with + `contrast_index = 1:nrow(contrast_spec)` (simple row positions within the filtered spec). + This is the index users should reference when building `contrasts_secondary` vectors. + - Ordering guarantee: both `var_a` and `var_b` components of `contrast_ref` are always at a + level-index ≤ the corresponding component in `contrast_lvl` (factor level order for factors, + alphabetical for plain characters). This is enforced by explicit factor coercion before + `interaction(...) %>% levels()`. + - Swap tolerance: users may manually swap ref/lvl in any row after filtering. `geneLME_fit()` + handles either direction — the contrast estimate sign flips, but no error occurs. +- **Single-variable mode** (`"treatment"`): returns a data.frame with a single column `level`, listing + unique sorted values of that variable. This is a reference only — used when building `contrasts_primary` + vectors for Branch B. Not passed to `geneLME()` directly. + +--- + +## Function Signatures (current `geneLME_dev.R`) + +```r +geneLME_contrast_spec(targets, contrast_vars) +# Returns: data.frame(contrast_ref, contrast_lvl) [interaction mode] +# or named list of data.frame(level) [single/multi-variable mode] + +geneLME_fit(gene_name, expression_vec, weight_vec, targets, + formula_str, run_contrast, contrast_vars, + contrast_var_2_levels, contrast_spec, + contrasts_primary, contrasts_secondary) + +geneLME_dispatch(gene_data_list, targets_df, formula_str, + run_contrast, contrast_vars, contrast_var_2_levels, + contrast_spec, contrasts_primary, contrasts_secondary) + +geneLME_compiler(fit, fdr_method = "BH", contrast_spec = NULL) +# Returns: list(lme_anova, lme_contrast, lme_fit, lme_err, contrast_spec) + +geneLME(dat, formula_str, + model_weights = NULL, + run_contrast = NULL, + contrast_vars = NULL, + contrast_var_2_levels = NULL, + contrast_spec = NULL, # required when contrast_vars contains ":" + contrasts_primary = NULL, + contrasts_secondary = NULL, + fdr_method = "BH", # any method accepted by p.adjust() + n_cores = NULL) +# Returns: list(lme_anova, lme_contrast, lme_fit, lme_err, contrast_spec) +# $lme_contrast columns include contrast_ref and contrast_lvl: +# - Branch A first-order: populated from contrast_spec (unambiguous sign reference) +# - Branch A second-order, Branch B (all): NA (no single ref/lvl pair) +# $contrast_spec: indexed copy of the filtered contrast_spec (Branch A) or +# data.frame(contrast_index, contrast_name) from contrasts_primary (Branch B); +# NULL when no contrasts run. +# On soft-fail (wrong-length contrasts_secondary): all elements NULL except $contrast_spec. +``` + +Note: `contrast_by` parameter was removed in session 2026-02-19 (Branch A2 eliminated). + +--- + +## Validation Checks in `geneLME()` + +1. All formula variables present in `dat$targets` +2. `dat$weights` present and dimension-matched if `model_weights = TRUE` +3. `contrast_vars` not NULL when `run_contrast = TRUE` +4. **Formula-vs-`contrast_vars` consistency:** + - Interaction contrasts: interaction term must exist in `formula_str` (uses `attr(terms(), "term.labels")`) + - Non-interaction contrasts: each `contrast_vars` element warned if absent as main effect +5. `contrast_spec` required (not NULL) when `contrast_vars` contains `":"` +6. `contrast_spec` column names validated (`contrast_ref`, `contrast_lvl`) +7. All `contrast_vars` components present in `dat$targets` +8. `contrast_var_2_levels` values valid against actual data levels +9. `fdr_method` validated against `p.adjust.methods` +10. `contrasts_secondary` vector lengths validated (**soft-fail**, not hard stop): + - Branch A: each vector must have length == `nrow(contrast_spec)` (the filtered spec, NOT the full template) + - Branch B: each vector must have length == `length(contrasts_primary)` + - On mismatch: emits a `message()` naming offending vectors (observed vs expected length), + then returns `invisible(list(lme_anova=NULL, lme_contrast=NULL, lme_fit=NULL, lme_err=NULL, contrast_spec=indexed_contrast_spec))` + - `$contrast_spec` in the returned list contains the indexed spec so the user can inspect + row ordering and fix their vectors without needing to re-run everything +11. Warn on likely ID columns used as predictors + +--- + +## Current Status + +**Date of last update:** 2026-02-20 (Session 15) + +**Stage:** `geneLME_dev2.R` merged into `geneLME.R` (now the single active stable file). +All tests pass. Warning suppression added. Documentation updated. + +**Benchmark results (2,000 genes, 5 reps, 8 cores) — geneLME_benchmark2.html (Session 15):** + +| Metric | Result | +|---|---| +| Sign consistency (geneLME vs kimma) | 3300/3300 (100%) same direction — no flipped pairs | +| Estimate r (direction-corrected) | 1.00000000 | +| Estimate MAD (direction-corrected) | 1.16e-15 (effectively 0 — floating point only) | +| geneLME 3 contrasts median | 17.4 s (1.85× faster than kimma) | +| geneLME 6 contrasts median | 18.1 s (1.78× faster than kimma) | +| geneLME 66 contrasts median | 32.6 s (0.99× — effectively equal to kimma) | +| kimma (66 contrasts) median | 32.1 s | + +**Key findings:** +- geneLME and kimma always agree on contrast direction (100%) — no sign inversion issue +- kimma's `statistic` column is the **negated** t-statistic (opposite sign to estimate) +- geneLME is ~1.8× faster than kimma when running fewer contrasts (3–6); equal speed at 66 +- Warning suppression confirmed working — no package version or rescaling warnings in output + +**Possible future tasks:** +1. Build runtime scaling estimates: cores × genes × contrasts → expected runtime on this machine + +--- + +## Known Issues / Open Questions + +- [x] `geneLME.R` (stable) — interaction contrast support merged 2026-02-20 +- [x] `geneLME_dev2.R` — tested (all tests pass) and benchmarked (geneLME_benchmark2.html generated) +- **kimma `statistic` column:** kimma's `$lme.contrast$statistic` is `–(estimate/SE)` — the **negated** + t-statistic (opposite sign to estimate). To compare against geneLME `t.ratio`, use `-statistic`. + This was discovered during benchmark2 sign-consistency investigation. +- [ ] `contrast_spec` level ordering with non-alphabetical factors: explicit factor coercion is now + in place in `geneLME_contrast_spec()` (uses existing factor levels if present, otherwise + alphabetical). However, `emmeans` must also see the same level ordering — emmeans inherits + factor levels from the model data, so as long as `dat$targets` columns have matching factor + levels, alignment is guaranteed. This has not been explicitly tested with a non-alphabetical + user-defined factor ordering; worth a targeted test if this use case arises. +- [ ] With only 10 patients, `lmer` fits are singular (patient random effect collinear with + patient-level covariates sex/age). All 10 genes show `isSingular = TRUE` in tests. This is + expected with small mock N and does NOT affect correctness testing. Real data with larger N + will not have this issue. +- [ ] Branch B `contrasts_primary` vector construction: users must supply vectors whose length matches + the number of levels of `contrast_vars[1]` in the order emmeans sees them (alphabetical by + default). This is implicit and could be a source of user error. Consider adding a runtime check + that compares vector length to actual emmeans level count. +- [ ] Non-alphabetical factor level ordering: explicit factor coercion is now in place in + `geneLME_contrast_spec()` (preserves existing factor levels, otherwise alphabetical). + Alignment with `emmeans` internal ordering has not been explicitly tested with a + user-defined non-alphabetical factor; worth a targeted test if this arises. + +--- + +## Bugs Fixed + +### Session 2026-02-19 — Bug 1: Branch A1 returned 0-row contrast result +**Root cause:** `vector("list", n)` pre-allocates n NULL slots indexed by integer. Subsequent +named assignment (`contrasts_list[["name"]] <- cv`) *appends* rather than filling existing slots, +leaving the first n slots as unnamed NULLs. `contrast()` silently skipped NULLs, leaving a +malformed result that `as.data.frame()` could not coerce. +**Fix:** Build `contrasts_list` as `list()` and assign each element by name from the start. + +### Session 2026-02-19 — Bug 2: Branch A2 `"Nonconforming number of contrast coefficients"` +**Root cause:** The `"second"` direction (`var_b | var_a`) was incorrectly passing `contrast_var_2_levels` +keyed to `var_a`, injecting visit-level strings as treatment levels. Additionally, `contrasts_primary` +has length matching var_a's level count (3) but var_b has 4 levels — structurally incompatible. +**Resolution:** Branch A2 (`contrast_by`) was removed entirely per user decision (see Session Log). + +--- + +## Session Log + +### Session 15 — 2026-02-20 +- **Completed merge of `geneLME_dev2.R` → `geneLME.R`** (picked up where Session 14 left off). + Remaining edits applied to `geneLME()` function body: + - Added `contrasts_list <- NULL` and `spec_lookup <- NULL` initialisations. + - Added `geneLME_build_contrast_args()` pre-compute block (inside `run_contrast`, outside + `!is.null(contrast_spec)` — called only when `is_interaction`). + - Updated `plan()` calls: `plan(multisession)` → `future::plan(future::multisession, workers = workers)`; + `on.exit(plan(sequential))` → `on.exit(future::plan(future::sequential), add = TRUE)`. + - Updated `geneLME_dispatch()` call: `contrast_spec` → `contrasts_list` + `spec_lookup`. +- **Warning suppression** (from `Claud_next_steps.rtf`): + - **lmer() rescaling warning** (`"Some predictor variables are on very different scales"`): + Added `check.scaleX = "ignore"` to a shared `lme_ctrl <- lmerControl(...)` object used + for both weighted and unweighted `lmer()` calls in `geneLME_fit()`. + - **Package version warnings** (`"package 'lme4' was built under R version X.Y.Z"`): + Two-pronged approach: + 1. Removed `future.packages = c(...)` from `future_lapply`; added explicit + `suppressPackageStartupMessages(suppressWarnings({ library(lme4); ... }))` block at + the top of `geneLME_fit()` — gives full control over how packages load on workers. + 2. Wrapped `future_lapply(...)` in `withCallingHandlers()` in `geneLME_dispatch()` with + a handler that muffles any warning matching `"was built under R version"`, catching + warnings that the future framework re-raises on the main process after collecting worker results. +- **All `geneLME_test.R` tests pass** against merged `geneLME.R` (SOURCE_FILE updated to `"geneLME.R"`): + Branch A (10/10, 80 ANOVA + 60 contrast), Branch A2 (80 total rows, 60+20 orders), + Branch B (10/10, 70+60), validation 6a–6f (all PASS), soft-fail 6g (both branches PASS). + Worker-sourced package-version warnings no longer appear in test output. +- **Updated `geneLME_tutorial.Rmd`:** + - `source("geneLME_dev.R")` → `source("geneLME.R")` + - Key capabilities bullet: updated from "per-gene error capture" to `model_status` flagging description + - Error Handling section: replaced old stop-based singular fit description with new `model_status` + flagging explanation; updated code to check `!= "success"` for all non-success genes (not just failed) + and added filter example (`result$lme_contrast %>% filter(model_status == "success")`). +- **Updated `geneLME_function_overview.md`:** + - Companion file reference: `geneLME_dev.R` → `geneLME.R`; date updated to 2026-02-20 + - Function Map: added `geneLME_build_contrast_args()` row + - `lme_err` output table entry: updated to describe `"singular_fit"` as a valid value + - Error Handling section: rewrote to describe `model_status` flagging; removed old + `"Boundary (singular) fit"` stop-based description +- **Updated `CLAUDE_NOTES_geneLME.md`:** + - Key files: `geneLME_dev2.R` now "former dev2 (merged); kept for reference" + - Architecture: `geneLME_build_contrast_args()` added to function table + - Design decision 5: updated from "error" to "flagged, not errored" + - Current Status: stage updated to reflect completed merge +- **Re-ran `geneLME_benchmark2.Rmd`** against merged `geneLME.R` only (dev2 correctness section + removed; report now focuses solely on geneLME vs kimma). Fresh results (Session 15): + - Sign consistency: 3300/3300 (100%) same direction, r=1.00000000, MAD=1.16e-15 + - Speed: geneLME 1.85× faster than kimma at 3 contrasts, 1.78× at 6, effectively equal (0.99×) at 66 + - Rendered `geneLME_function_overview.html` via pandoc for easy review + +### Session 14 — 2026-02-20 +- **Confirmed `geneLME.R` (stable) is correct** — does NOT contain dev2 changes (no + `geneLME_build_contrast_args`, no `singular_fit` flagging); the concern about overwrite was unfounded. +- **Audited `geneLME_dev2.R` for unqualified package calls** — all non-base function calls + are correctly handled: `future::plan()`, `future::multisession`, `future::sequential`, and + `future.apply::future_lapply()` are fully qualified in `geneLME()`/`geneLME_dispatch()`; + worker-side calls (`lmer`, `emmeans`, etc.) rely on `future.packages` which loads all required + packages on each worker. No bare unqualified calls that would fail outside an interactive session. +- **All `geneLME_test.R` tests pass** against `geneLME_dev2.R`: + Branch A (10/10 success, 80 ANOVA + 60 contrast rows), Branch A second-order (60+20 rows), + Branch B (10/10 success, 70+60 rows), all 6 validation error tests (6a–6f), soft-fail (6g). +- **Debugged and fixed `geneLME_benchmark2.Rmd`:** + 1. `sign-inspect` chunk: kimma has no `contrast` string column → rewrote to compare + `contrast_ref`/`contrast_lvl` pairs directly using canonical pair keys. + 2. Kimma t-statistic: discovered kimma's `statistic` column is the **negated** t-statistic + (`–estimate/SE`), not the absolute value. Fixed reconstruction to `-statistic`. + 3. Inline `` `r ≈ 1.0` `` parsed as R code → converted to plain text. + 4. Multi-line `if/else` → wrapped in braces to prevent premature statement termination. + 5. `stable_success`: `lme_err` has no names → derived success genes from contrast table instead. +- **`geneLME_benchmark2.html` generated successfully** (44/44 chunks, no errors). +- **Benchmark results:** See Current Status table above. Summary: + - 100% direction agreement with kimma (no flipped pairs) + - dev2 estimates identical to stable (r=1.0, MAD=0) + - dev2 ~1.1–1.2× faster than stable; equal speed to kimma at 66 contrasts + +### Session 13 — 2026-02-20 +- **Created `geneLME_benchmark2.Rmd`** — v2 benchmarking report covering: + 1. **Sign consistency (Section 1):** Direction-aware join between geneLME and kimma. + Forward join (same ref/lvl) + flipped join (kimma ref/lvl swapped) combined. + Scatter plots of estimates and t-statistics before and after direction correction. + Summary table reports % pairs with direction agreement and corrected accuracy metrics. + 2. **dev2 correctness (Section 2):** geneLME.R vs geneLME_dev2.R on same 50-gene, + 6-contrast problem. Joined on (gene, contrast); reports r, MAD, max|Δ| for estimates. + Also shows model_status distribution (singular_fit vs success) in dev2 output. + 3. **Speed comparison (Section 3):** Microbenchmark (5 reps each) sourcing stable and + dev2 sequentially to avoid namespace collision. Bar chart + scaling line plot + comparing all three methods (stable, dev2, kimma) at 3/6/66 contrasts. + Summary table includes dev2 speedup ratios vs stable and vs kimma. +- Requires `gridExtra` package (added to libs chunk) + +### Session 12 — 2026-02-20 +- **Created `geneLME_dev2.R`** with two changes vs `geneLME.R`: + 1. **Singular fit → flag:** `isSingular()` check changed from `stop()` to setting + `model_status = "singular_fit"`. Results returned for all genes; users filter + downstream on `model_status`. `model_status` column now appears in both + `lme_anova` and `lme_contrast` output. + 2. **Pre-computed Branch A contrast structures:** New `geneLME_build_contrast_args()` + helper called once in `geneLME()` before parallel dispatch. Returns `contrasts_list` + (named vector list) and `spec_lookup` (ref/lvl join table). Workers receive these + ready-made; the per-gene `for` loop building contrast vectors is eliminated. + `geneLME_dispatch()` and `geneLME_fit()` signatures updated accordingly + (`contrast_spec` parameter replaced by `contrasts_list` + `spec_lookup`). + - Level ordering in `geneLME_build_contrast_args()` uses the same factor-coercion + logic as `geneLME_contrast_spec()` (preserves existing factor levels; alphabetical + otherwise), ensuring alignment with emmeans' internal ordering without requiring + a fitted model object. +- **Updated `geneLME_test.R`:** + - Added `SOURCE_FILE` variable at top for easy switching between implementations + - Test 4: added singular-fit verification block (checks non-NA estimates for flagged genes) + - Tests 4 and 5: updated output display to show `model_status` column + - Note: with 10-patient mock data all genes are expected to be `singular_fit`, not errors + +### Session 11 — 2026-02-20 +- Merged `geneLME_dev.R` → `geneLME.R` as the new stable version +- Added merge changelog to `geneLME.R` header +- Removed stale "PICKUP HERE" comment and dev-only header label +- Updated CLAUDE_NOTES key files and current status sections + +### Session 1 — 2026-02-19 (first session) +- Identified correct active worktree (`reverent-satoshi`) +- Created `CLAUDE_NOTES_geneLME.md` (this file) +- Created `geneLME_test.R` with mock EList: + - 10 patients × 3 treatments × 4 visits = 120 samples, 50 genes + - Patient-level covariates: `sex` (factor), `age` (continuous) + - Sample-level technical covariates: `rNANgUl`, `percent_duplication`, + `median_cv_coverage`, `lib.size`, `norm.factors` +- Ran full test suite; identified and fixed two bugs in `geneLME_dev.R` +- All three contrast branches (A1, A2, B) passed initially after fixes + +### Session 10 — 2026-02-19 (tenth context window) +- **Created `geneLME_benchmark.Rmd`** (→ `geneLME_benchmark.html`): kimma vs geneLME benchmarking report + - **Mock data:** 10 patients × 3 treatments × 4 visits = 120 samples, 2000 genes; genes 1–100 + have +2.5 log2 TrtC:V3 effect; `libID = sample_id` added to targets for kimma compatibility + - **Contrast subsets defined:** + - `spec_3` (3): longitudinal V2→V3 within-treatment + - `spec_6` (6): between-treatment within-visit at V2 and V3 + - `spec_full` (66): all pairwise — equivalent to kimma's default output + - **Section 1 (Accuracy):** both methods on 50-gene subset with all 66 contrasts. + Results joined on `(gene, contrast_ref, contrast_lvl)` — both methods now have these columns. + Reports Pearson r and MAD for estimates and t-statistics; scatter plots included. + Note: geneLME excludes singular-fit genes (treated as error); kimma returns estimates silently. + - **Section 2 (Selective contrast efficiency):** `microbenchmark(times=5)` on 2000-gene full + dataset comparing geneLME with 3/6/66 contrasts and kimma with 66 contrasts. Bar chart + + scaling plot showing linear runtime scaling with n_contrasts for geneLME. + - **Section 3 (Head-to-head):** reuses 66-contrast runs from Section 2 to compare + geneLME vs kimma directly at equal contrast count. Reports speedup ratio. + - **kimma interface notes:** + - `libraryID = "libID"` required — added `libID` column to targets + - `contrast_var = "treatment:visit"` runs all 66 pairwise automatically + - Output: `$lme.contrast` with columns `gene, contrast_ref, contrast_lvl, estimate, statistic, pval, FDR, std.error, df` + - `statistic` = t-ratio (same as geneLME `t.ratio`); `pval` = p-value; `std.error` = SE + - **Key finding expected:** estimates identical (r ≈ 1.0, MAD ≈ 0) for shared non-singular genes; + geneLME faster when running fewer contrasts; head-to-head speed depends on parallelism overhead + +### Session 9 — 2026-02-19 (ninth context window) +- **Added `contrast_ref` and `contrast_lvl` to `lme_contrast` output** in `geneLME_fit()`: + - Branch A first-order: built a `spec_lookup` data.frame keyed on the contrast name string + (`paste(contrast_lvl, contrast_ref, sep = " - ")`), then `left_join`-ed onto the emmeans + first-order result by `"contrast"`. This ensures `contrast_ref` = the −1 cell and + `contrast_lvl` = the +1 cell, eliminating sign ambiguity. + - Branch A second-order: `contrast_ref = NA_character_, contrast_lvl = NA_character_` + (no single ref/lvl pair applies to a contrast-of-contrasts). + - Branch B (all rows): `contrast_ref = NA_character_, contrast_lvl = NA_character_` + (Branch B uses named coefficient vectors, not a ref/lvl spec). + - Error stub data.frame in `tryCatch` handler: added both NA columns for schema consistency. +- **Tutorial Rmd updated:** added `contrast_ref`/`contrast_lvl` to `select()` calls in all + contrast output display chunks; added explanatory prose in the `lme_contrast` description. +- **CLAUDE_NOTES updated:** added columns to return value documentation. +- Next up: kimma benchmarking (2000-gene mock, speed + estimate comparison). + +### Session 8 — 2026-02-19 (eighth context window) +- **Redesigned `contrast_index` system** — root cause was that `contrast_index` in the + `geneLME_contrast_spec()` output was being misused: users built `contrasts_secondary` vectors + using the full template's row count as the vector length (234 elements) instead of the filtered + spec's row count. `which(my_spec$contrast_index == ...)` correctly identified positions within + `my_spec`, but the outer `rep(0, nrow(full_template))` produced vectors of the wrong length. +- **`geneLME_contrast_spec()` output simplified:** removed `contrast_index` column entirely. + Output is now a two-column data.frame (`contrast_ref`, `contrast_lvl`) only. +- **`geneLME()` now appends `$contrast_spec` to its return value:** + - Branch A: `contrast_spec %>% mutate(contrast_index = seq_len(n())) %>% select(contrast_index, everything())` + — `contrast_index` is `1:nrow(contrast_spec)` (row positions within the *filtered* spec the user passed in) + - Branch B: `data.frame(contrast_index = seq_along(contrasts_primary), contrast_name = names(contrasts_primary))` + - NULL when no contrasts are run +- **Soft-fail on wrong-length `contrasts_secondary`** (changed from hard `stop()`): + - Emits an informative `message()` naming offending vectors, observed length, expected length + - Returns `invisible(list(lme_anova=NULL, lme_contrast=NULL, lme_fit=NULL, lme_err=NULL, contrast_spec=indexed_contrast_spec))` + - User can inspect `result$contrast_spec` to fix their vectors without running any models +- **`geneLME_compiler()` updated:** now accepts `contrast_spec = NULL` argument; includes it in + return list; added column-existence guard for no-contrast runs (prevented crash on empty contrast tibble) +- **Tutorial Rmd updated:** + - `contrast-spec-interaction` chunk: added comment explaining `contrast_index` is NOT in + `spec_template`, is added by `geneLME()` to `$contrast_spec` + - Added `$contrast_spec` display to the Branch A output structure chunk + - Branch A second-order chunk: updated comments to reference `result_A$contrast_spec` for + row-order verification + - Programmatic secondary contrast section: rewritten to use `result_A$contrast_spec` (the + indexed spec from `geneLME()`) instead of `my_spec$contrast_index` + - Added soft-fail validation test (Test 7) to Input Validation section + - Quick-reference table: updated `contrast_spec` description; updated `contrasts_secondary` + description to mention soft-fail + - Key capabilities: added soft-fail bullet +- **`geneLME_test.R` updated:** + - Test 3: removed `contrast_index` range check; updated comment to explain columns + - Test 4b: updated comments to reference `test_A$contrast_spec` for indexed ordering; + corrected `contrasts_secondary` vectors to match actual alphabetical row ordering + (V2 rows 1–3 before V3 rows 4–6); added `print(test_A$contrast_spec)` before test_A2 call + - Added test 6g: soft-fail validation for both Branch A and Branch B wrong-length vectors + +### Session 7 — 2026-02-19 (seventh context window) +- **Root-caused the real-data NA/warning failure:** + - User shared real-data `contrasts_secondary` vectors: length 234 (= nrow of full + unfiltered spec_template). The filtered `contrast_spec` passed to `geneLME()` had far + fewer rows. emmeans throws "Nonconforming number of contrast coefficients" which is + silently caught by `tryCatch` inside `geneLME_fit()`, producing NA-filled rows. + - The programmatic builder from Session 4 used `rep(0, nrow(defined_contrasts_spec))` + where `defined_contrasts_spec` referred to the full template — should be + `nrow(filtered_spec)`. The `which(my_spec$contrast_index == ...)` index lookup was + correct; only the vector length was wrong. +- **Fix: added `contrasts_secondary` length validation to `geneLME()` pre-flight:** + - Branch A (interaction): checked that every vector in `contrasts_secondary` has length + == `nrow(contrast_spec)` (the filtered spec passed in, NOT the full template). + Error message names the offending vectors, shows observed vs expected length, and + explains the common cause (full template vs filtered spec confusion). + - Branch B (non-interaction): checked that every vector has length == + `length(contrasts_primary)`. + - Both checks fire before any parallel work launches — so the user gets a clear, + immediate error instead of silent NAs in results. +- User's real-data fix: rebuild `contrasts_secondary` vectors using + `rep(0, nrow(defined_contrasts_geneLME))` (filtered spec row count) and + `which(defined_contrasts_geneLME$contrast_index == ...)` for index lookup. + +### Session 6 — 2026-02-19 (sixth context window) +- **Diagnosed and fixed second-order contrast warning + spurious NA issue:** + - Symptom: rescaling warnings ("Some predictor variables are on very different scales") + and occasional NA-filled outputs when running with `contrasts_secondary`. + - Root cause (warnings): `emmeans::contrast()` called on a first-order contrast object + internally re-invokes lme4 machinery on the first-order estimate/SE values, which can be + on different scales from the original predictors → benign lme4 scale warning, no effect + on results. Confirmed output is correct by single-gene isolated test. + - Root cause (NAs): not a code bug — was a stochastic singular-fit outcome in the small + mock data (10 patients), not reproducible when re-run. `isSingular()` check is correct. + - Fix 1: wrapped second-order `contrast()` calls in `suppressWarnings()` in **both** Branch A + and Branch B of `geneLME_fit()`, with explanatory comment. Warning is cosmetic only. + - Fix 2 (latent bug): both Branch A and Branch B were gating second-order contrasts with + `length(contrasts_list) > 1` / `length(contrasts_primary) > 1`. This incorrectly + prevented second-order contrasts when only one first-order contrast existed. Changed + both to simply `!is.null(contrasts_secondary)` — the user's provision of + `contrasts_secondary` is the correct and sufficient gate. +- Verified fix: all 10 genes succeed, 60 first-order + 20 second-order rows, all + `p.value_adj` non-NA, zero warnings emitted. + +### Session 5 — 2026-02-19 (fifth context window) +- Added FDR-adjusted p-values (`p.value_adj`) to both `lme_anova` and `lme_contrast` outputs: + - **`geneLME_compiler()`**: now accepts `fdr_method = "BH"` argument; applies `p.adjust()` + within grouped sets after binding all per-gene results: + - `lme_anova`: grouped by `term` — each model term adjusted independently across genes + - `lme_contrast`: grouped by `contrast × contrast_order` — each contrast (including + first- and second-order separately) adjusted independently across genes + - **`geneLME()`**: added `fdr_method = "BH"` argument (any `p.adjust.methods` value valid); + validated against `p.adjust.methods` in pre-flight checks; passed through to compiler + - NA p-values (failed genes) propagate to `p.value_adj` as NA — they are excluded from the + effective adjustment set automatically by `p.adjust()` +- Updated `geneLME_tutorial.Rmd`: + - Added `p.value_adj` to `select()` calls in ANOVA and contrast output display chunks + - Updated descriptions of `lme_anova` and `lme_contrast` to explain grouping logic + - Added new "FDR Adjustment" section (before Input Validation) with grouping table, + method list, and BH vs Bonferroni comparison example + - Added `fdr_method` row to quick-reference argument table +- CLAUDE_NOTES: updated function signatures (added `geneLME_compiler` signature, `fdr_method` + arg to `geneLME`), validation check list (item 9), and current status + +### Session 4 — 2026-02-19 (fourth context window) +- User provided a real-data pattern for programmatic secondary contrast construction using + `contrast_index`. Reviewed approach and identified improvements: + - Replaced `group_by` + `reframe` with `last()`/`first()` with a more robust `summarise` + producing explicit `index_neg`/`index_pos` columns (two-column wide shape) + - Removed unused `nest` + `rowwise` + dead `secondary_contrast` list column + - Replaced opaque `[[i]][1,1][[1]]` nested indexing with `which(my_spec$contrast_index == ...)` + for transparent index-to-row-position mapping + - Replaced `for` loop with `setNames(lapply(...))` pattern consistent with how + `contrasts_primary` is specified elsewhere +- Added new tutorial section "Programmatic second-order contrast construction via `contrast_index`" + to `geneLME_tutorial.Rmd` (between Branch A second-order and Branch B sections) +- Updated quick-reference table to include `contrast_index` in `contrast_spec` column description + +### Session 3 — 2026-02-19 (third context window) +- Three improvements to `geneLME_contrast_spec()` interaction mode: + 1. **Factor-level ordering enforced**: both `var_a` and `var_b` are explicitly coerced to + factors before calling `interaction()`. Existing factor levels are preserved; plain character + columns get alphabetical ordering imposed. This guarantees `contrast_ref` component levels ≤ + `contrast_lvl` component levels (for both variables), consistent with emmeans' internal ordering. + 2. **Swap tolerance documented**: `geneLME_fit()` Branch A already handled swapped ref/lvl + gracefully (sign flip, no error). Added explicit comment in the contrast vector construction + loop documenting this behaviour. + 3. **`contrast_index` column added**: `geneLME_contrast_spec()` interaction mode now returns + a `contrast_index` column (integer, 1-based) as the first column. This index is stable — it + reflects the row's position in the *full unfiltered template* and survives user `filter()` calls. + Intended for downstream use (e.g. joining back to results, building `contrasts_secondary` + with explicit index reference). Tutorial example to be added in a future session. +- Updated `geneLME_test.R`: + - Added `contrast_index` range check in test 3 + - Updated test 4b comments to reflect corrected row ordering (alphabetical interaction levels: + V2 rows before V3 rows) and corrected `contrasts_secondary` vectors accordingly +- Both `geneLME_dev.R` and `geneLME.R` preserved separately (user preference) + +### Session 2 — 2026-02-19 (same day, second context window) +- Reviewed all three branches with user; demonstrated outputs +- **User decision:** eliminate Branch A2 (`contrast_by`) entirely + - Rationale: specifying contrasts in both directions is error-prone; pairwise contrasts + for the second variable can be computationally excessive with many levels +- **Design changes implemented:** + 1. Extended `geneLME_contrast_spec()` to two modes: + - Interaction mode: returns `contrast_ref`/`contrast_lvl` pairs (unchanged) + - Single-variable mode: returns `level` reference frame (new) + 2. Added formula-vs-`contrast_vars` consistency check using `attr(terms(), "term.labels")`: + - Interaction contrast on additive model → hard error + - Non-interaction contrast var absent as main effect → warning + 3. Made `contrast_spec` required (not optional) when `contrast_vars` contains `":"` + 4. Removed `contrast_by` parameter from all four functions + 5. Removed Branch A2 from `geneLME_fit()`; Branch A is now solely the `contrast_spec` approach + 6. Updated `geneLME_test.R`: removed A2 test; added single-variable `geneLME_contrast_spec()` + tests; added 4 new validation error tests (6c–6f) +- Final test run: all 8 tests pass (2 positive + 6 validation errors) + - Branch A: 10/10 genes `success`, 80 ANOVA rows, 60 contrast rows + - Branch B: 10/10 genes `success`, 70 ANOVA rows, 60 contrast rows + - Validation 6a–6f: all produce expected informative errors diff --git a/R_functions/geneLME.R b/R_functions/geneLME.R index 268da68..0f9d98e 100644 --- a/R_functions/geneLME.R +++ b/R_functions/geneLME.R @@ -2,292 +2,961 @@ ######################################################## # Scalable custom gene LMEs with contrast specification ######################################################## +# Merged from geneLME_dev.R (2026-02-20): +# - Added geneLME_contrast_spec() helper +# - Added Branch A: interaction contrast support via contrast_spec +# - Added FDR adjustment (fdr_method argument) +# - Added $contrast_spec in return value +# - Soft-fail on wrong-length contrasts_secondary +# - Added contrast_ref / contrast_lvl columns to lme_contrast output +# +# Merged from geneLME_dev2.R (2026-02-20): +# - Singular fit → model_status flag ("singular_fit") instead of error. +# Results returned for all genes; filter downstream on model_status. +# model_status column now present in both lme_anova and lme_contrast. +# - New geneLME_build_contrast_args() helper: pre-computes Branch A +# contrasts_list and spec_lookup once before parallel dispatch, +# eliminating nrow(contrast_spec) × n_genes per-gene rebuilds. +# - geneLME_fit() signature: contrast_spec replaced by contrasts_list +# and spec_lookup (Branch A). geneLME_dispatch() updated accordingly. +# - Benchmarked (geneLME_benchmark2.html): dev2 estimates identical to +# previous stable (r=1.0, MAD=0); ~1.1-1.2x faster; 100% direction +# agreement with kimma; equal speed to kimma at 66 contrasts. +######################################################## + + +######################################################## +# geneLME_contrast_spec +# Helper: returns a reference template of available contrast levels, +# formatted for use with the contrast_spec argument of geneLME(). +# +# Two modes depending on whether contrast_vars contains ":": +# +# Interaction mode (e.g. contrast_vars = "treatment:visit"): +# contrast_vars must be a single "var_a:var_b" string. +# Returns a data.frame with columns contrast_ref and contrast_lvl, +# one row per pairwise combination of interaction-level strings. +# Filter this to the contrasts of interest, then pass as +# contrast_spec to geneLME(). +# +# Single-variable mode (e.g. contrast_vars = c("treatment", "visit")): +# contrast_vars is a character vector of one or more plain variable +# names (no ":"). Returns a named list, one element per variable, +# each a data.frame with a single column 'level' listing the sorted +# unique values. The message printed for each variable explains how +# it maps to geneLME() arguments: +# - The first variable in contrast_vars → contrast_vars[1] in +# geneLME(); its levels define the length and position order of +# contrasts_primary vectors. +# - Additional variables → contrast_vars[2], used as the 'by' +# grouping variable; filter its levels via contrast_var_2_levels. +# This list is a reference only — it is not passed to geneLME(). +######################################################## + +geneLME_contrast_spec <- function(targets, contrast_vars) { + + has_interaction <- any(grepl(":", contrast_vars)) + has_plain <- any(!grepl(":", contrast_vars)) + + # Disallow mixing interaction and plain variable names in one call + if (has_interaction && has_plain) { + stop( + "contrast_vars mixes interaction terms (containing ':') and plain variable names.\n", + "Call geneLME_contrast_spec() separately for interaction and non-interaction variables." + ) + } + + if (has_interaction) { + + # ---- Interaction mode: must be a single "var_a:var_b" string ---- + if (length(contrast_vars) != 1) { + stop("Interaction mode requires a single 'var_a:var_b' string in contrast_vars.") + } + + vars <- strsplit(contrast_vars, ":")[[1]] + var_a <- vars[1] + var_b <- vars[2] + + if (!var_a %in% colnames(targets)) stop(paste0("Variable '", var_a, "' not found in targets.")) + if (!var_b %in% colnames(targets)) stop(paste0("Variable '", var_b, "' not found in targets.")) + + # Coerce each variable to a factor, preserving any existing factor level order. + # If the column is already a factor, levels() preserves the user-defined ordering. + # If it is a plain character vector, we impose alphabetical order explicitly. + # This ensures the interaction level string ordering — and therefore which member + # of each pair lands in contrast_ref vs contrast_lvl — is fully deterministic and + # consistent between geneLME_contrast_spec() and emmeans' internal ordering. + fac_a <- if (is.factor(targets[[var_a]])) targets[[var_a]] else factor(targets[[var_a]], levels = sort(unique(targets[[var_a]]))) + fac_b <- if (is.factor(targets[[var_b]])) targets[[var_b]] else factor(targets[[var_b]], levels = sort(unique(targets[[var_b]]))) + + # Build all interaction level strings in the same order emmeans will use. + # interaction() with two ordered factors produces a factor whose levels are + # var_a[1]:var_b[1], var_a[1]:var_b[2], ..., var_a[n]:var_b[m] — matching + # emmeans' default grid ordering (vary the rightmost variable fastest). + ixn_lvls <- levels(interaction(fac_a, fac_b, sep = " ")) + + # combn() traverses ixn_lvls in order, so for every pair the first element + # (contrast_ref) is always at a lower level-index than the second (contrast_lvl). + # This guarantees: level(var_a in ref) <= level(var_a in lvl) AND + # level(var_b in ref) <= level(var_b in lvl). + # Users may freely swap ref/lvl in any row after filtering — geneLME() and + # geneLME_fit() handle either direction correctly (the sign of the contrast + # estimate will flip, but no error will occur). + # Row-position indices for contrasts_secondary construction are added by geneLME() + # to its $contrast_spec output element after the user's filtered spec is received. + pairs <- combn(ixn_lvls, 2, simplify = FALSE) + spec <- data.frame( + contrast_ref = sapply(pairs, `[[`, 1), + contrast_lvl = sapply(pairs, `[[`, 2), + stringsAsFactors = FALSE + ) + + message( + nrow(spec), " pairwise combinations generated for '", contrast_vars, "'.\n", + "Filter this data.frame to your contrasts of interest, then pass as contrast_spec to geneLME().\n", + "geneLME() will attach an indexed copy of contrast_spec to its output ($contrast_spec)\n", + "showing the row-position index used for contrasts_secondary vector construction." + ) + + return(spec) + + } else { + + # ---- Single/multi-variable mode ---- + # Validate all variables exist + missing_vars <- setdiff(contrast_vars, colnames(targets)) + if (length(missing_vars) > 0) { + stop(paste0("Variable(s) not found in targets: ", paste(missing_vars, collapse = ", "))) + } + + result <- lapply(seq_along(contrast_vars), function(i) { + v <- contrast_vars[i] + lvls <- sort(unique(as.character(targets[[v]]))) + spec <- data.frame(level = lvls, stringsAsFactors = FALSE) + + if (i == 1) { + message( + "'", v, "' — primary contrast variable (contrast_vars[1] in geneLME).\n", + length(lvls), " levels (alphabetical order = position order for contrasts_primary vectors):\n", + paste(seq_along(lvls), lvls, sep = ". ", collapse = "\n"), "\n", + "→ contrast vectors passed to contrasts_primary must have length ", length(lvls), ",\n", + " with each element weighted by position (e.g. '", lvls[length(lvls)], " vs ", lvls[1], + "' = c(", paste(c(-1, rep(0, length(lvls) - 2), 1), collapse = ", "), "))." + ) + } else { + message( + "'", v, "' — secondary 'by' grouping variable (contrast_vars[", i, "] in geneLME).\n", + length(lvls), " levels available:\n", + paste(seq_along(lvls), lvls, sep = ". ", collapse = "\n"), "\n", + "→ pass a subset of these levels to contrast_var_2_levels in geneLME() to restrict\n", + " which groups the primary contrasts are computed within." + ) + } + + spec + }) + + names(result) <- contrast_vars + return(result) + } +} + + +######################################################## +# geneLME_build_contrast_args +# Pre-computes Branch A contrast structures once, before parallel dispatch. +# +# Called by geneLME() when contrast_vars contains ":". +# Returns a list with two elements that are passed directly to workers: +# +# $contrasts_list — named list of contrast vectors, one per row of +# contrast_spec. Each vector has length == number of interaction cells, +# named by emmeans level string ("TrtA V1", "TrtB V2", ...). +# contrast_ref cell receives -1; contrast_lvl cell receives +1. +# Name of each vector is "contrast_lvl - contrast_ref" — the string +# emmeans will use as the 'contrast' column in its output. +# +# $spec_lookup — data.frame(contrast, contrast_ref, contrast_lvl) for +# joining ref/lvl back onto the emmeans first-order result by contrast +# name. Avoids reconstructing this inside every geneLME_fit() call. +# +# The emmeans level ordering is derived from factor levels in targets (the +# same source emmeans uses internally), guaranteeing index alignment without +# requiring a fitted model object. +######################################################## + +geneLME_build_contrast_args <- function(targets, contrast_vars, contrast_spec) { + + vars <- strsplit(contrast_vars, ":")[[1]] + var_a <- vars[1] + var_b <- vars[2] + + # Reproduce the exact factor-level ordering that emmeans will use. + # emmeans builds its reference grid from the factor levels of the model + # data (targets); using the same coercion logic as geneLME_contrast_spec() + # guarantees the level strings here match those emmeans returns. + fac_a <- if (is.factor(targets[[var_a]])) targets[[var_a]] else factor(targets[[var_a]], levels = sort(unique(targets[[var_a]]))) + fac_b <- if (is.factor(targets[[var_b]])) targets[[var_b]] else factor(targets[[var_b]], levels = sort(unique(targets[[var_b]]))) + + # Build interaction level strings: "var_a_level var_b_level" in emmeans order. + # emmeans varies the rightmost variable fastest, matching interaction() default. + lvls_a <- levels(fac_a) + lvls_b <- levels(fac_b) + emm_lvls <- as.vector(outer(lvls_a, lvls_b, paste)) # row = var_a, col = var_b + + # Template zero vector, one element per interaction cell + default_vec <- rep(0, length(emm_lvls)) + names(default_vec) <- emm_lvls + + # Build contrasts_list once — identical for all genes + contrasts_list <- list() + for (k in seq_len(nrow(contrast_spec))) { + cv <- default_vec + cv[contrast_spec$contrast_ref[k]] <- -1 + cv[contrast_spec$contrast_lvl[k]] <- 1 + contrast_name <- paste(contrast_spec$contrast_lvl[k], + contrast_spec$contrast_ref[k], sep = " - ") + contrasts_list[[contrast_name]] <- cv + } + + # Build spec_lookup once — identical for all genes + spec_lookup <- data.frame( + contrast = paste(contrast_spec$contrast_lvl, + contrast_spec$contrast_ref, sep = " - "), + contrast_ref = contrast_spec$contrast_ref, + contrast_lvl = contrast_spec$contrast_lvl, + stringsAsFactors = FALSE + ) + + list(contrasts_list = contrasts_list, spec_lookup = spec_lookup) +} + ######################################################## -# Starting point / framework +# geneLME_fit +# Core per-gene fitting function. Called inside future_lapply. +# Receives only the minimal pre-extracted data needed for one gene, +# not the full EList object. +# +# Singular fit is no longer an error; model_status = "singular_fit" +# is recorded and results are returned for user-side filtering. +# Branch A receives pre-computed contrasts_list + spec_lookup instead +# of contrast_spec. The per-gene contrast vector loop is eliminated. ######################################################## +geneLME_fit <- + function(gene_name, + expression_vec, # named numeric vector: expression values for this gene + weight_vec, # named numeric vector or NULL: per-sample weights + targets, # data.frame: sample metadata (dat$targets) + formula_str, # character: formula RHS e.g. "~ treatment*visit + (1|ptID)" + run_contrast, + contrast_vars, + contrast_var_2_levels, + contrasts_list, # Branch A: pre-computed named contrast vector list, or NULL + spec_lookup, # Branch A: pre-computed ref/lvl join table, or NULL + contrasts_primary, # Branch B: named list of contrast vectors, or NULL + contrasts_secondary) { # both branches: second-order contrast vectors, or NULL + + # Load worker-side packages quietly. + # suppressPackageStartupMessages: silences "Loading required package: Matrix" etc. + # suppressWarnings: silences "package 'lme4' was built under R version X.Y.Z". + suppressPackageStartupMessages(suppressWarnings({ + library(lme4) + library(emmeans) + library(car) + library(broom.mixed) + library(dplyr) + library(tibble) + })) -geneLME_fit<- - function(i, - dat=dat, - formula_obj=formula_obj, - model_weights=model_weights, - run_contrast=run_contrast, - contrast_vars = contrast_vars, - contrast_var_2_levels=contrast_var_2_levels, - contrasts_primary = contrasts_primary, - contrasts_secondary = contrasts_secondary){ - - gene_i <- rownames(dat$E)[i] - - # Initialize an empty status - model_status <- "success" - - # Wrap the entire process in tryCatch result <- tryCatch({ - + + # --- BUILD MODEL DATA --- + model_data <- targets + model_data$expression <- expression_vec + + # --- RECONSTRUCT FORMULA LOCALLY --- + # Built from the raw string so its enclosing environment is this call frame. + # lmer resolves 'weight_vec' and other names in local scope — no + # environment stripping needed, no locked-environment errors. + formula_obj <- as.formula(paste("expression", formula_str)) + # --- FIT MODEL --- - # We use a localized data frame to avoid passing the whole 'dat' object to emmeans later - model_data <- dat$targets - model_data$expression <- dat$E[i, ] - - if(is.null(model_weights)){ - lme_i <- lmer( - formula_obj, - data = model_data, - control = lmerControl(calc.derivs = FALSE, autoscale = TRUE)) - - } else if(!is.null(model_weights) & model_weights){ - - model_data$current_weights <- dat$weights[i, ] - - lme_i <- lmer( - formula_obj, - weights = current_weights, - data = model_data, - control = lmerControl(calc.derivs = FALSE, autoscale = TRUE) - ) + # check.scaleX = "ignore": silences the "predictor variables on very + # different scales" warning. This is expected with mixed RNA-seq covariates + # (e.g. lib.size vs binary treatment indicator) and does not affect model fit. + lme_ctrl <- lmerControl(calc.derivs = FALSE, check.scaleX = "ignore") + if (is.null(weight_vec)) { + lme_i <- lmer(formula_obj, data = model_data, control = lme_ctrl) + } else { + lme_i <- lmer(formula_obj, weights = weight_vec, data = model_data, control = lme_ctrl) } - - # Check for singularity manually to be explicit - if(isSingular(lme_i)) stop("Boundary (singular) fit") - - # --- EXTRACT DATA (The only parts we keep) --- - - # 1. AIC - aic_res <- data.frame(gene = gene_i, AIC = AIC(lme_i)) - - # 2. ANOVA & Summary - lme_i_anova <- car::Anova(lme_i) %>% broom.mixed::tidy() - lme_i_summary <- summary(lme_i)$coefficients %>% - as.data.frame() %>% + + # --- SINGULAR FIT: flag, do not stop --- + # isSingular() indicates the random effect variance hit its boundary (zero). + # Fixed effect estimates and contrasts are still numerically valid; the user + # should decide whether to retain or filter these genes downstream. + # model_status records the flag; "success" indicates a clean fit. + model_status <- if (isSingular(lme_i)) "singular_fit" else "success" + + # --- EXTRACT: AIC --- + aic_res <- data.frame(gene = gene_name, AIC = AIC(lme_i)) + + # --- EXTRACT: ANOVA & coefficient summary --- + lme_i_anova <- car::Anova(lme_i) %>% broom.mixed::tidy() + lme_i_summary <- summary(lme_i)$coefficients %>% + as.data.frame() %>% rownames_to_column("variable") - - # 3. Process ANOVA Table + + # --- BUILD ANOVA TABLE --- lme_i_anova_tab <- lme_i_anova %>% - rowwise()%>% + rowwise() %>% mutate( - gene = gene_i, + gene = gene_name, + model_status = model_status, # "success" or "singular_fit" predictor_class = case_when( - grepl(":", term) ~ "interaction", - #!term %in% colnames(model_data) ~ "other", - is.numeric(model_data[[term]]) ~ "continuous", - !(is.numeric(model_data[[term]])) & length(unique(model_data[[term]])) == 2 ~ "two-level-categorical", - !(is.numeric(model_data[[term]])) & length(unique(model_data[[term]])) > 2 ~ "multi-level-categorical" + grepl(":", term) ~ "interaction", + is.numeric(model_data[[term]]) ~ "continuous", + !is.numeric(model_data[[term]]) & length(unique(model_data[[term]])) == 2 ~ "two-level-categorical", + !is.numeric(model_data[[term]]) & length(unique(model_data[[term]])) > 2 ~ "multi-level-categorical" ), Estimate_source = case_when( - predictor_class %in% c("continuous", "two-level-categorical") ~ "lme_summary", - predictor_class == "multi-level-categorical" ~ "seeContrasts", - predictor_class == "interaction" & - length(grep(":", lme_i_summary$variable)) == 1 ~ "lme_summary", - predictor_class == "interaction" & - length(grep(":", lme_i_summary$variable)) > 1 ~ "seeContrasts" + predictor_class %in% c("continuous", "two-level-categorical") ~ "lme_summary", + predictor_class == "multi-level-categorical" ~ "seeContrasts", + predictor_class == "interaction" & length(grep(":", lme_i_summary$variable)) == 1 ~ "lme_summary", + predictor_class == "interaction" & length(grep(":", lme_i_summary$variable)) > 1 ~ "seeContrasts" ), Estimate = case_when( - predictor_class == "continuous" ~ lme_i_summary$Estimate[match(term, lme_i_summary$variable)][[1]], + predictor_class == "continuous" ~ lme_i_summary$Estimate[match(term, lme_i_summary$variable)][[1]], predictor_class == "two-level-categorical" ~ lme_i_summary$Estimate[match(term, lme_i_summary$variable)][[1]], - predictor_class == "interaction" & Estimate_source=="lme_summary" ~ lme_i_summary$Estimate[grep(":", lme_i_summary$variable)][[1]], - predictor_class == "interaction" & Estimate_source == "seeContrasts" ~ NA, - .default = NA + # Guard: grep(":", ...) returns integer(0) when the model has no interaction + # coefficient (e.g. a main-effects-only formula). [[1]] on integer(0) errors, + # so fall through to .default = NA_real_ via the length check. + predictor_class == "interaction" & Estimate_source == "lme_summary" & + length(grep(":", lme_i_summary$variable)) >= 1 ~ + lme_i_summary$Estimate[grep(":", lme_i_summary$variable)[1L]], + .default = NA_real_ ), Estimate_SE = case_when( - predictor_class == "continuous" ~ lme_i_summary$`Std. Error`[match(term, lme_i_summary$variable)][[1]], + predictor_class == "continuous" ~ lme_i_summary$`Std. Error`[match(term, lme_i_summary$variable)][[1]], predictor_class == "two-level-categorical" ~ lme_i_summary$`Std. Error`[match(term, lme_i_summary$variable)][[1]], - predictor_class == "interaction" & Estimate_source=="lme_summary" ~ lme_i_summary$`Std. Error`[grep(":", lme_i_summary$variable)][[1]], - predictor_class == "interaction" & Estimate_source == "seeContrasts" ~ NA, - .default = NA + predictor_class == "interaction" & Estimate_source == "lme_summary" & + length(grep(":", lme_i_summary$variable)) >= 1 ~ + lme_i_summary$`Std. Error`[grep(":", lme_i_summary$variable)[1L]], + .default = NA_real_ ) ) - - # 4. Contrasts - - # --- CONTRASTS (Dynamic) --- - - if(!is.null(run_contrast)){ - - spec_formula <- as.formula(paste("~", paste(contrast_vars, collapse = "|"))) - - if(length(contrast_vars)==2 & !is.null(contrast_var_2_levels)){ # set levels of second term if multiple terms provided. - by_list<-set_names(list(contrast_var_2_levels), paste(contrast_vars[2]))} else { - by_list<-NULL - } - - emm_1st <- emmeans(lme_i, spec = spec_formula, at=by_list, data=model_data) %>% - contrast(method = contrasts_primary, adjust = "none") - - if(!is.null(contrasts_secondary) & length(contrasts_primary)>1){ - - emm_2nd <- contrast(emm_1st, method = contrasts_secondary, adjust = "none") - - contrast_res <- bind_rows( - as.data.frame(emm_1st) %>% mutate(contrast_order = "first_order"), - as.data.frame(emm_2nd) %>% mutate(contrast_order = "second_order") - ) %>% mutate(gene = gene_i) + + # --- CONTRASTS --- + if (isTRUE(run_contrast)) { + + is_interaction <- any(grepl(":", contrast_vars)) + + if (is_interaction) { + + # ---- BRANCH A: Interaction contrast via pre-computed contrasts_list ---- + # contrasts_list and spec_lookup were built once in geneLME() before + # parallel dispatch — no per-gene reconstruction needed here. + vars <- strsplit(contrast_vars, ":")[[1]] + var_a <- vars[1] + var_b <- vars[2] + + emm_obj <- emmeans( + lme_i, + spec = as.formula(paste("~", var_a, "*", var_b)), + data = model_data + ) + + emm_1st_A <- contrast(emm_obj, method = contrasts_list, adjust = "none") + + if (!is.null(contrasts_secondary)) { + # suppressWarnings: emmeans internally calls lmer on the first-order contrast + # estimates when computing second-order contrasts; this can emit a benign + # lme4 scale warning when the estimate/SE values differ in magnitude from + # the original predictors. The output is unaffected — the warning is cosmetic. + emm_2nd_A <- suppressWarnings( + contrast(emm_1st_A, method = contrasts_secondary, adjust = "none") + ) + contrast_res <- bind_rows( + # First-order: join ref/lvl from pre-computed spec_lookup by contrast name + as.data.frame(emm_1st_A) %>% + mutate(contrast_order = "first_order") %>% + left_join(spec_lookup, by = "contrast"), + # Second-order: contrasts-of-contrasts have no single ref/lvl pair + as.data.frame(emm_2nd_A) %>% + mutate(contrast_order = "second_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_) + ) %>% mutate(gene = gene_name, model_status = model_status) + } else { + contrast_res <- as.data.frame(emm_1st_A) %>% + mutate(contrast_order = "first_order") %>% + left_join(spec_lookup, by = "contrast") %>% + mutate(gene = gene_name, model_status = model_status) + } + + } else { + + # ---- BRANCH B: Non-interaction ---- + spec_formula <- as.formula(paste("~", paste(contrast_vars, collapse = "|"))) + + by_list <- if (length(contrast_vars) == 2 && !is.null(contrast_var_2_levels)) { + setNames(list(contrast_var_2_levels), contrast_vars[2]) + } else { + NULL + } + + emm_1st <- emmeans(lme_i, spec = spec_formula, at = by_list, data = model_data) %>% + contrast(method = contrasts_primary, adjust = "none") + + if (!is.null(contrasts_secondary)) { + # suppressWarnings: same benign lme4 scale warning as Branch A second-order step. + emm_2nd <- suppressWarnings( + contrast(emm_1st, method = contrasts_secondary, adjust = "none") + ) + contrast_res <- bind_rows( + # Branch B uses named contrast vectors, not a ref/lvl spec — set NA for both orders + as.data.frame(emm_1st) %>% + mutate(contrast_order = "first_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_), + as.data.frame(emm_2nd) %>% + mutate(contrast_order = "second_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_) + ) %>% mutate(gene = gene_name, model_status = model_status) + } else { + contrast_res <- as.data.frame(emm_1st) %>% + mutate(contrast_order = "first_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_, + gene = gene_name, + model_status = model_status) + } + } + } else { - contrast_res <- - as.data.frame(emm_1st) %>% mutate(contrast_order = "first_order") - }} else {contrast_res=NULL} - - # Return list of results. Once this function returns, 'lme_i' is destroyed. - list(aic = aic_res, anova = lme_i_anova_tab, contrasts = contrast_res, model_status=setNames(c("success"), gene_i)) - + contrast_res <- NULL + } + + list( + aic = aic_res, + anova = lme_i_anova_tab, + contrasts = contrast_res, + model_status = setNames(model_status, gene_name) + ) + }, error = function(e) { - # This block triggers for actual errors AND our forced 'singular fit' stop - err_msg <- as.character(e$message) - + err_msg <- conditionMessage(e) list( - aic = data.frame(gene = gene_i, AIC = NA, model_status = err_msg), - # Return a 1-row ANOVA skeleton so the gene is not lost - + aic = data.frame(gene = gene_name, AIC = NA_real_), anova = data.frame( - term = NA, statistic=NA, df=NA, p.value =NA, gene=gene_i, - predictor_class=NA, Estimate_source=NA, Estimate=NA, Estimate_SE=NA, - model_status = err_msg), - # Return a 1-row Contrast skeleton - contrasts=data.frame(contrast =NA, estimate=NA, SE=NA, df=NA, t.ratio=NA, p.value=NA, - contrast_order=NA, gene=gene_i, model_status = err_msg), - model_status= setNames(paste(err_msg), gene_i) + term = NA_character_, + statistic = NA_real_, + df = NA_real_, + p.value = NA_real_, + gene = gene_name, + model_status = err_msg, + predictor_class = NA_character_, + Estimate_source = NA_character_, + Estimate = NA_real_, + Estimate_SE = NA_real_ + ), + contrasts = data.frame( + contrast = NA_character_, + estimate = NA_real_, + SE = NA_real_, + df = NA_real_, + t.ratio = NA_real_, + p.value = NA_real_, + contrast_order = NA_character_, + contrast_ref = NA_character_, + contrast_lvl = NA_character_, + gene = gene_name + ), + model_status = setNames(err_msg, gene_name) ) }) - - + return(result) } +######################################################## +# geneLME_compiler +# Aggregates list of per-gene results into named result tables, +# then appends FDR-adjusted p-values within each grouping unit. +# +# FDR grouping strategy: +# lme_anova: adjust within each model term (across all genes). +# Each term's p-values form one adjustment set. +# lme_contrast: adjust within each contrast x contrast_order combination +# (across all genes). Branch B contrast labels already encode +# the 'by' variable level (e.g. "TrtC vs TrtA, visit = V2"), +# so grouping by contrast alone is sufficient. +# NA p-values (failed gene models) are preserved as NA in p.value_adj. +# singular_fit genes are included in the adjustment set (their p-values +# are valid; users filter on model_status if they wish to exclude them). +######################################################## + +geneLME_compiler <- function(fit, fdr_method = "BH", contrast_spec = NULL) { + + lme_anova <- map_dfr(fit, "anova") %>% + group_by(term) %>% + mutate(p.value_adj = p.adjust(p.value, method = fdr_method)) %>% + relocate(p.value_adj, .after = p.value)%>% + ungroup() + + lme_contrast_raw <- map_dfr(fit, "contrasts") + lme_contrast <- if (ncol(lme_contrast_raw) > 0 && + all(c("contrast", "contrast_order") %in% colnames(lme_contrast_raw))) { + lme_contrast_raw %>% + group_by(contrast, contrast_order) %>% + mutate(p.value_adj = p.adjust(p.value, method = fdr_method)) %>% + relocate(p.value_adj, .after = p.value)%>% + ungroup() + } else { + lme_contrast_raw # no contrasts run; return as-is (empty or NULL-row stub) + } + + list( + lme_anova = lme_anova, + lme_contrast = lme_contrast, + lme_fit = map_dfr(fit, "aic"), + lme_err = map_chr(fit, "model_status"), + contrast_spec = contrast_spec # indexed spec; NULL when no contrasts run + ) +} + + +######################################################## +# geneLME_dispatch +# Runs future_lapply with explicit global declaration to prevent +# future's automatic environment scan from capturing large objects. +# +# Key design decisions: +# 1. Iterate over an integer index — a plain integer sequence carries +# no environment baggage for future to scan. +# 2. All shared objects passed via future.globals, bypassing automatic +# scanning entirely. +# +# contrast_spec replaced by contrasts_list + spec_lookup (Branch A) +# in both the function signature and future.globals. +######################################################## -# Separate function for compiling results +geneLME_dispatch <- function(gene_data_list, + targets_df, + formula_str, + run_contrast, + contrast_vars, + contrast_var_2_levels, + contrasts_list, + spec_lookup, + contrasts_primary, + contrasts_secondary) { + n_genes <- length(gene_data_list) -geneLME_compiler<-function(fit){ - list(lme_anova_tab= map_dfr(fit, "anova"), # gather anova tables - lme_contrast=map_dfr(fit, "contrasts"), # gather contrast tables - lme_fit=map_dfr(fit, "aic"), # gather model fit - lme_err=map_chr(fit, "model_status") %>% # extract the status string for each gene & preserve names - set_names(map_chr(fit, ~ names(.x$model_status)))) + # Suppress "package 'X' was built under R version Y" warnings that + # future re-raises on the main process after collecting worker results. + # These are cosmetic version-mismatch notices — not actionable errors. + withCallingHandlers( + future.apply::future_lapply( + seq_len(n_genes), + FUN = function(i) { + gene_data <- gene_data_list[[i]] + geneLME_fit( + gene_name = gene_data$gene_name, + expression_vec = gene_data$expression_vec, + weight_vec = gene_data$weight_vec, + targets = targets_df, + formula_str = formula_str, + run_contrast = run_contrast, + contrast_vars = contrast_vars, + contrast_var_2_levels = contrast_var_2_levels, + contrasts_list = contrasts_list, + spec_lookup = spec_lookup, + contrasts_primary = contrasts_primary, + contrasts_secondary = contrasts_secondary + ) + }, + future.globals = list( + gene_data_list = gene_data_list, + targets_df = targets_df, + formula_str = formula_str, + run_contrast = run_contrast, + contrast_vars = contrast_vars, + contrast_var_2_levels = contrast_var_2_levels, + contrasts_list = contrasts_list, + spec_lookup = spec_lookup, + contrasts_primary = contrasts_primary, + contrasts_secondary = contrasts_secondary, + geneLME_fit = geneLME_fit + ), + future.seed = TRUE + ), + warning = function(w) { + if (grepl("was built under R version", conditionMessage(w))) { + invokeRestart("muffleWarning") + } + }) } +######################################################## +# geneLME +# User-facing wrapper: validates inputs, sets up parallel plan, +# pre-extracts per-gene data, dispatches geneLME_fit in parallel. +# +# Calls geneLME_build_contrast_args() for Branch A to pre-compute +# contrasts_list and spec_lookup before the parallel stage. +######################################################## + +geneLME <- + function(dat, + formula_str, + model_weights = NULL, + run_contrast = NULL, + contrast_vars = NULL, + contrast_var_2_levels = NULL, + contrast_spec = NULL, # data.frame(contrast_ref, contrast_lvl) or NULL + # required when contrast_vars contains ":" + contrasts_primary = NULL, + contrasts_secondary = NULL, + fdr_method = "BH", # any method accepted by p.adjust() + n_cores = NULL) { + + # --- PRE-FLIGHT VALIDATION --- + + # Build a local formula object solely for variable checking — never + # passed to workers. Workers reconstruct the formula from formula_str. + formula_obj_local <- as.formula(paste("expression", formula_str)) + + # Initialise objects populated during the contrast validation block below. + indexed_contrast_spec <- NULL + contrasts_list <- NULL # Branch A: pre-computed named contrast vectors + spec_lookup <- NULL # Branch A: pre-computed ref/lvl join table + + # 1. Formula variables present in targets + required_vars <- all.vars(formula_obj_local) + required_vars <- required_vars[required_vars != "expression"] + + missing_vars <- setdiff(required_vars, colnames(dat$targets)) + if (length(missing_vars) > 0) { + stop(paste( + "The following variables in the formula are missing from dat$targets:", + paste(missing_vars, collapse = ", ") + )) + } + + # 2. Weights alignment + if (isTRUE(model_weights)) { + if (is.null(dat$weights)) { + stop("model_weights = TRUE but dat$weights is NULL.") + } + if (!identical(dim(dat$weights), dim(dat$E))) { + stop("Dimensions of dat$weights do not match dat$E.") + } + } + + # 3. Contrast variables and spec + if (isTRUE(run_contrast)) { -geneLME <- - function(dat, - formula_str, - model_weights=NULL, - run_contrast=NULL, - contrast_vars=NULL, - contrast_var_2_levels=NULL, - contrasts_primary=NULL, - contrasts_secondary=NULL, - n_cores = NULL) { - -#. --- PRE-FLIGHT DATA VALIDATION --- - - # 1. Check Formula Variables - # Extract all variables from the formula (excluding the LHS 'expression') - required_vars <- all.vars(formula_obj) - required_vars <- required_vars[required_vars != "expression"] - - missing_vars <- setdiff(required_vars, colnames(dat$targets)) - if (length(missing_vars) > 0) { - stop(paste("The following variables in the formula are missing from dat$targets:", - paste(missing_vars, collapse = ", "))) + if (is.null(contrast_vars)) { + stop("run_contrast = TRUE but contrast_vars is NULL.") } - - # 2. Check Weights Alignment - if (!is.null(model_weights)) { - if (is.null(dat$weights)) { - stop("model_weights is specified, but dat$weights is missing.") + + is_interaction <- any(grepl(":", contrast_vars)) + + # 3a. Formula-vs-contrast_vars consistency check. + # Uses terms() so that both "a*b" and "a:b" formula syntax are handled correctly — + # "a*b" expands to include "a:b" in term labels, so both forms are detected. + formula_terms <- attr(terms(formula_obj_local), "term.labels") + + if (is_interaction) { + # For interaction contrasts, verify the interaction term is actually in the model. + # emmeans will silently run contrasts on additive margins if the interaction is absent, + # which is statistically misleading. + ixn_term <- contrast_vars[grep(":", contrast_vars)][1] # e.g. "treatment:visit" + if (!ixn_term %in% formula_terms) { + stop(paste0( + "contrast_vars specifies an interaction contrast for '", ixn_term, "', ", + "but this interaction term is not present in formula_str.\n", + "Either add the interaction to the formula (e.g. '~ treatment * visit + ...') ", + "or change contrast_vars to a non-interaction term." + )) } - if (nrow(dat$weights) != nrow(dat$E) || ncol(dat$weights) != ncol(dat$E)) { - stop("Dimensions of dat$weights do not match dat$E. Weights must be a matrix of same size.") + } else { + # For non-interaction contrasts, verify each contrast variable is in the formula + # as a main effect. Warn (not error) in case it is part of an interaction only. + for (cv in contrast_vars) { + if (!cv %in% formula_terms) { + warning(paste0( + "contrast_vars includes '", cv, "' but this term does not appear as a ", + "main effect in formula_str. Contrasts may be computed from interaction ", + "margins only — verify this is the intended model structure." + )) + } } } - - # 3. Check Contrast Variables & Levels - if (!is.null(run_contrast)) { - # Verify contrast_vars exist in targets - missing_contrast_vars <- setdiff(contrast_vars, colnames(dat$targets)) - if (length(missing_contrast_vars) > 0) { - stop(paste("contrast_vars missing from dat$targets:", - paste(missing_contrast_vars, collapse = ", "))) + + # 3b. contrast_spec validation (required for interaction contrasts) + if (is_interaction && is.null(contrast_spec)) { + stop(paste0( + "contrast_vars specifies an interaction contrast ('", contrast_vars, "') ", + "but contrast_spec is NULL.\n", + "Use geneLME_contrast_spec(dat$targets, contrast_vars = '", contrast_vars, + "') to generate a template, filter it to your contrasts of interest, ", + "then pass it as contrast_spec." + )) + } + + if (!is.null(contrast_spec)) { + if (!is.data.frame(contrast_spec)) { + stop("contrast_spec must be a data.frame.") + } + if (!all(c("contrast_ref", "contrast_lvl") %in% colnames(contrast_spec))) { + stop("contrast_spec must have columns 'contrast_ref' and 'contrast_lvl'.") + } + if (!is_interaction) { + warning("contrast_spec is provided but contrast_vars contains no interaction term (':'). ", + "contrast_spec is designed for interaction contrasts — did you mean to use contrasts_primary?") } - - # Verify Level Filtering for the 'by' variable - if (length(contrast_vars) == 2 && !is.null(contrast_var_2_levels)) { - actual_levels <- unique(as.character(dat$targets[[contrast_vars[2]]])) - invalid_levels <- setdiff(contrast_var_2_levels, actual_levels) - - if (length(invalid_levels) > 0) { - stop(paste0("The specified levels for '", contrast_vars[2], - "' do not exist in the data: ", paste(invalid_levels, collapse = ", "))) + + # Build the indexed contrast_spec that will be attached to the return value. + # contrast_index here is simply 1:nrow(contrast_spec) — the actual row position + # within the filtered spec passed by the user. This is what contrasts_secondary + # vectors must index into (not any index from the full unfiltered template). + indexed_contrast_spec <- contrast_spec %>% + mutate(contrast_index = seq_len(n())) %>% + select(contrast_index, everything()) + + # Print the indexed spec as a reminder whenever contrasts_secondary is provided. + if (is_interaction && !is.null(contrasts_secondary)) { + n_first <- nrow(contrast_spec) + message( + "contrasts_secondary will be applied to the first-order interaction contrasts ", + "in the order they appear in contrast_spec (", n_first, " contrasts):\n", + paste(seq_len(n_first), + paste(contrast_spec$contrast_lvl, contrast_spec$contrast_ref, sep = " - "), + sep = ". ", collapse = "\n"), + "\nEnsure contrasts_secondary vectors have length ", n_first, + ", with each element corresponding to the contrast at that position.", + "\nThe indexed contrast_spec is returned as $contrast_spec in the output." + ) + + # Soft-fail: wrong-length vectors produce silent NAs deep inside geneLME_fit(). + # Catch this here and return early with only $contrast_spec populated so the + # user has the indexed reference they need to fix their vectors. + bad_lens <- sapply(contrasts_secondary, length) + bad_names <- names(bad_lens)[bad_lens != n_first] + if (length(bad_names) > 0) { + message( + "\ncontrasts_secondary vector(s) have wrong length — returning early without running models.\n", + "Expected length ", n_first, " (= nrow(contrast_spec) after filtering).\n", + "Offending vector(s):\n", + paste0(" '", bad_names, "': length ", bad_lens[bad_names], collapse = "\n"), "\n\n", + "Common cause: vectors were built against nrow() of the full unfiltered\n", + "spec_template rather than nrow() of the filtered contrast_spec passed here.\n", + "Use $contrast_spec in the returned object to re-specify your secondary contrast vectors:\n", + " each vector must have length ", n_first, ", one element per row of $contrast_spec." + ) + return(invisible(list( + lme_anova = NULL, + lme_contrast = NULL, + lme_fit = NULL, + lme_err = NULL, + contrast_spec = indexed_contrast_spec + ))) } } } - - # 4. Check for Character columns that should be Factors - # emmeans and lmer handle strings, but it's safer to warn if a predictor has 100+ unique strings - for (v in required_vars) { - if (is.character(dat$targets[[v]]) && length(unique(dat$targets[[v]])) > (nrow(dat$targets)/2)) { - warning(paste("Variable '", v, "' has a very high number of unique strings. Is it a unique ID instead of a factor?")) + + # --- PRE-COMPUTE BRANCH A CONTRAST STRUCTURES --- + # Build contrasts_list and spec_lookup once here, before parallel dispatch. + # Workers receive these ready-made objects instead of rebuilding from + # contrast_spec on every gene — eliminates nrow(contrast_spec) × n_genes + # iterations of R-level vector construction. + if (is_interaction) { + contrast_args <- geneLME_build_contrast_args(dat$targets, contrast_vars, contrast_spec) + contrasts_list <- contrast_args$contrasts_list + spec_lookup <- contrast_args$spec_lookup + } + + # 3c. contrast_vars present in targets (split on ":" for interaction) + vars_to_check <- unique(unlist(strsplit(contrast_vars, ":"))) + missing_contrast_vars <- setdiff(vars_to_check, colnames(dat$targets)) + if (length(missing_contrast_vars) > 0) { + stop(paste( + "contrast_vars not found in dat$targets:", + paste(missing_contrast_vars, collapse = ", ") + )) + } + + # 3d. contrast_var_2_levels validation (Branch B only) + if (!is_interaction && length(contrast_vars) == 2 && !is.null(contrast_var_2_levels)) { + actual_levels <- unique(as.character(dat$targets[[contrast_vars[2]]])) + invalid_levels <- setdiff(contrast_var_2_levels, actual_levels) + if (length(invalid_levels) > 0) { + stop(paste0( + "Levels specified for '", contrast_vars[2], + "' not found in data: ", paste(invalid_levels, collapse = ", ") + )) } } - - message("Input check complete: Data formatting looks good. Starting parallel LME fit...") - - - - - # Set Parallel Plan - workers <- if(is.null(n_cores)) parallel::detectCores() - 4 else n_cores - plan(multisession, workers = workers) - - # Convert string formula to object - - formula_obj <- as.formula(paste0("expression", formula_str)) - - # Run in parallel - # We pass the contrast specs into the function - fit_results <- future_lapply( - 1:nrow(dat$E), - FUN = function(x){ - geneLME_fit(i = x, - dat = dat, - formula_obj = formula_obj, - model_weights=model_weights, - run_contrast=run_contrast, - contrast_vars = contrast_vars, - contrast_var_2_levels=contrast_var_2_levels, - contrasts_primary = contrasts_primary, - contrasts_secondary = contrasts_secondary)}, - future.seed = TRUE - ) - - return(geneLME_compiler(fit_results)) -} + + # 3e. Branch B: build indexed contrast_spec from contrasts_primary names, + # and soft-fail on wrong-length contrasts_secondary vectors. + if (!is_interaction && !is.null(contrasts_primary)) { + indexed_contrast_spec <- data.frame( + contrast_index = seq_along(contrasts_primary), + contrast_name = names(contrasts_primary), + stringsAsFactors = FALSE + ) + + if (!is.null(contrasts_secondary)) { + n_primary <- length(contrasts_primary) + bad_lens <- sapply(contrasts_secondary, length) + bad_names <- names(bad_lens)[bad_lens != n_primary] + if (length(bad_names) > 0) { + message( + "\ncontrasts_secondary vector(s) have wrong length — returning early without running models.\n", + "Expected length ", n_primary, " (= number of contrasts_primary vectors).\n", + "Offending vector(s):\n", + paste0(" '", bad_names, "': length ", bad_lens[bad_names], collapse = "\n"), "\n\n", + "Use $contrast_spec in the returned object to confirm the primary contrast ordering:\n", + " each contrasts_secondary vector must have length ", n_primary, + ", one element per row of $contrast_spec." + ) + return(invisible(list( + lme_anova = NULL, + lme_contrast = NULL, + lme_fit = NULL, + lme_err = NULL, + contrast_spec = indexed_contrast_spec + ))) + } + } + } + } + + # 4. fdr_method must be a recognised p.adjust method + valid_fdr_methods <- p.adjust.methods # exported character vector from stats + if (!fdr_method %in% valid_fdr_methods) { + stop(paste0( + "'", fdr_method, "' is not a valid p.adjust method.\n", + "Choose one of: ", paste(valid_fdr_methods, collapse = ", ") + )) + } + + # 5. Warn on likely ID columns used as predictors + for (v in required_vars) { + if (is.character(dat$targets[[v]]) && + length(unique(dat$targets[[v]])) > nrow(dat$targets) / 2) { + warning(paste0( + "Variable '", v, "' has many unique string values — is it an ID column rather than a predictor?" + )) + } + } + + message("Input validation passed. Launching parallel LME fits...") + + # --- PRE-SLICE INTO PER-GENE LIST --- + W_mat <- if (isTRUE(model_weights)) dat$weights else NULL + + gene_data_list <- lapply(rownames(dat$E), function(g) { + list( + gene_name = g, + expression_vec = dat$E[g, ], + weight_vec = if (!is.null(W_mat)) W_mat[g, ] else NULL + ) + }) + + targets_df <- dat$targets + + # --- PARALLEL PLAN --- + workers <- if (is.null(n_cores)) max(1L, parallel::detectCores() - 4L) else n_cores + future::plan(future::multisession, workers = workers) + on.exit(future::plan(future::sequential), add = TRUE) + + # --- DISPATCH via clean-scope helper --- + fit_results <- geneLME_dispatch( + gene_data_list = gene_data_list, + targets_df = targets_df, + formula_str = formula_str, + run_contrast = run_contrast, + contrast_vars = contrast_vars, + contrast_var_2_levels = contrast_var_2_levels, + contrasts_list = contrasts_list, + spec_lookup = spec_lookup, + contrasts_primary = contrasts_primary, + contrasts_secondary = contrasts_secondary + ) + + return(geneLME_compiler(fit_results, fdr_method = fdr_method, + contrast_spec = indexed_contrast_spec)) + } ######################################################## -# Testing +# Testing (see geneLME_test.R for full test suite) ######################################################## -# dat_sub<-RNAetc::subset_voom(dat, gene_keep = head(rownames(dat$E), 100)) -# -# test_mods<- + +# --- Branch A: interaction contrast via contrast_spec --- +# # Step 1: generate level reference template +# spec_template <- geneLME_contrast_spec( +# targets = dat$targets, +# contrast_vars = "treatment:visit" # interaction string +# ) +# # Step 2: filter to contrasts of interest +# my_spec <- spec_template %>% +# dplyr::filter(...) # e.g. same-visit cross-treatment, or within-treatment longitudinal +# +# test_mods_A <- # geneLME( -# dat = dat_sub, -# formula_str = "~ treatment*visit + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)", -# model_weights = TRUE, -# run_contrast=TRUE, -# contrast_vars = c("treatment", "visit"), -# contrast_var_2_levels=c("-1", "S9", "S13"), -# contrasts_primary = list("Dual vs Placebo"=c(-1, 0, 1), "SLIT vs Placebo"=c(-1, 1, 0)), -# contrasts_secondary = list("Second Order Test"= c(1, -1)), -# n_cores = 10 -# +# dat = dat_sub, +# formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", +# model_weights = TRUE, +# run_contrast = TRUE, +# contrast_vars = "treatment:visit", # must match an interaction term in formula_str +# contrast_spec = my_spec, # required for interaction contrasts +# n_cores = 10 +# ) +# +# # Optional: second-order contrasts on top of Branch A first-order contrasts. +# # Vectors must have length == nrow(my_spec), ordered to match my_spec rows. +# # geneLME() will print the numbered list of first-order contrasts at validation time. +# test_mods_A2 <- +# geneLME( +# dat = dat_sub, +# formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", +# model_weights = TRUE, +# run_contrast = TRUE, +# contrast_vars = "treatment:visit", +# contrast_spec = my_spec, +# contrasts_secondary = list( +# "TrtA vs TrtB: V3 minus V2 effect" = c(1, 0, -1, 0, 0, 0), +# "TrtA vs TrtC: V3 minus V2 effect" = c(0, 1, 0, -1, 0, 0) +# ), +# n_cores = 10 # ) - -#PICKUP HERE: - -# needs to debug scaling up and hadling of parallel tasks. We didn't sort this out perfectlt. -# need to modify how contrasts are handled to be able to specify multiple levels of interaction contrasts. - +# --- Branch B: non-interaction --- +# # Step 1 (optional): inspect available levels for each contrast variable +# geneLME_contrast_spec(dat$targets, contrast_vars = "treatment") # reference only +# # Treatment levels (alphabetical): TrtA, TrtB, TrtC +# # contrasts_primary vectors have length 3: positions = [TrtA, TrtB, TrtC] +# +# test_mods_B <- +# geneLME( +# dat = dat_sub, +# formula_str = "~ treatment + visit + age + sex + rNANgUl + (1|ptID)", +# model_weights = TRUE, +# run_contrast = TRUE, +# contrast_vars = c("treatment", "visit"), +# contrast_var_2_levels = c("V2", "V3"), +# contrasts_primary = list("TrtC vs TrtA" = c(-1, 0, 1), +# "TrtB vs TrtA" = c(-1, 1, 0)), +# contrasts_secondary = list("TrtC vs TrtB" = c(1, -1)), +# n_cores = 10 +# ) diff --git a/R_functions/geneLME_benchmark.Rmd b/R_functions/geneLME_benchmark.Rmd new file mode 100644 index 0000000..2f27d20 --- /dev/null +++ b/R_functions/geneLME_benchmark.Rmd @@ -0,0 +1,635 @@ +--- +title: "geneLME vs kimma: Benchmarking Report" +subtitle: "Estimate accuracy and computational efficiency" +date: "`r Sys.Date()`" +output: + html_document: + toc: true + toc_float: + collapsed: false + smooth_scroll: true + toc_depth: 3 + theme: flatly + highlight: tango + code_folding: hide + df_print: paged +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set( + echo = TRUE, + message = FALSE, + warning = FALSE, + comment = "#>", + fig.align = "center", + fig.width = 8, + fig.height = 5 +) +``` + +--- + +# Overview + +This report benchmarks `geneLME()` against `kimma::kmFit()`, a published package for +per-gene linear mixed effects modelling of RNA-seq data. Both functions use the same +underlying engines (`lme4` for model fitting, `emmeans` for contrast estimation), so +they should produce statistically identical results — but differ in contrast specification +philosophy and computational efficiency. + +**Three questions addressed:** + +1. **Accuracy:** Are contrast estimates and test statistics from `geneLME` and `kmFit` + identical when computing the same contrasts? +2. **Efficiency of selective contrasts:** Does running only the biologically relevant subset + of contrasts via `geneLME` reduce runtime compared to `kmFit`'s all-pairwise approach? +3. **Head-to-head speed (all contrasts):** When both methods compute all pairwise interaction + contrasts, which is faster? + +**Key design difference:** + +| Feature | `geneLME` | `kmFit` | +|---|---|---| +| Contrast specification | User-defined subset via `contrast_spec` | All pairwise automatically | +| Singular fit handling | Treated as error → NA | Estimates returned (no check) | +| Parallelisation | `future_lapply` (multisession) | Native parallel via `processors` | +| Second-order contrasts | Supported | Not supported | +| FDR adjustment | Within-term / within-contrast grouping | BH across all genes per variable | + +--- + +# Setup + +```{r libs} +library(lme4) +library(emmeans) +library(car) +library(broom.mixed) +library(dplyr) +library(tibble) +library(purrr) +library(future) +library(future.apply) +library(kimma) +library(microbenchmark) +library(ggplot2) +library(tidyr) +library(scales) + +source("geneLME_dev.R") +``` + +--- + +# Mock Data (2,000 genes) + +The benchmark uses a simulated dataset matching a realistic paired longitudinal RNA-seq +design: **10 patients × 3 treatments × 4 visits = 120 samples, 2,000 genes**. + +Genes 1–100 have a simulated `TrtC:V3` upregulation of +2.5 log2 units to provide a +detectable interaction signal. A patient-level random effect is added to all genes. + +```{r mock-data} +set.seed(42) + +n_patients <- 10 +treatments <- c("TrtA", "TrtB", "TrtC") +visits <- c("V1", "V2", "V3", "V4") +n_genes <- 2000 +n_signal <- 100 # genes with simulated TrtC:V3 effect + +# Patient-level covariates +patient_meta <- data.frame( + ptID = paste0("pt", sprintf("%02d", 1:n_patients)), + sex = factor(sample(c("M", "F"), n_patients, replace = TRUE)), + age = round(rnorm(n_patients, mean = 38, sd = 10)), + stringsAsFactors = FALSE +) + +# Sample metadata (targets) — includes libID for kimma compatibility +targets <- expand.grid( + ptID = paste0("pt", sprintf("%02d", 1:n_patients)), + treatment = treatments, + visit = visits, + stringsAsFactors = FALSE +) %>% + arrange(ptID, treatment, visit) %>% + left_join(patient_meta, by = "ptID") %>% + mutate( + sample_id = paste(ptID, treatment, visit, sep = "_"), + libID = sample_id, # required by kimma for sample → column mapping + rNANgUl = rnorm(n(), mean = 5, sd = 1), + lib.size = round(rnorm(n(), mean = 20e6, sd = 3e6)), + norm.factors = rnorm(n(), mean = 1, sd = 0.05) + ) +rownames(targets) <- targets$sample_id + +n_samples <- nrow(targets) + +# Expression matrix +E_mat <- matrix( + rnorm(n_genes * n_samples, mean = 8, sd = 2), + nrow = n_genes, ncol = n_samples, + dimnames = list( + paste0("gene", sprintf("%04d", 1:n_genes)), + targets$sample_id + ) +) + +# Simulated TrtC:V3 effect on signal genes +trtC_v3 <- which(targets$treatment == "TrtC" & targets$visit == "V3") +E_mat[1:n_signal, trtC_v3] <- E_mat[1:n_signal, trtC_v3] + 2.5 + +# Patient random effect +for (pt in unique(targets$ptID)) { + idx <- which(targets$ptID == pt) + E_mat[, idx] <- E_mat[, idx] + rnorm(1, 0, 1) +} + +# Voom-like precision weights +W_mat <- matrix( + abs(rnorm(n_genes * n_samples, mean = 1, sd = 0.1)), + nrow = n_genes, ncol = n_samples, + dimnames = dimnames(E_mat) +) + +# Full 2000-gene dataset (used for benchmarking) +dat_bench <- list(E = E_mat, weights = W_mat, targets = targets) + +# 50-gene subset (used for accuracy comparison — faster repeated runs) +dat_50 <- list( + E = E_mat[1:50, ], + weights = W_mat[1:50, ], + targets = targets +) + +cat("Design:", n_patients, "patients ×", length(treatments), "treatments ×", + length(visits), "visits =", n_samples, "samples\n") +cat("Full dataset:", n_genes, "genes\n") +cat("Accuracy subset:", nrow(dat_50$E), "genes\n") +cat("Signal genes (TrtC:V3 +2.5 log2):", n_signal, "\n") +``` + +--- + +# Contrast Specification + +## All pairwise interaction contrasts (66 total) + +`kmFit` always computes all pairwise contrasts when given `contrast_var = "treatment:visit"`. +The 3 treatments × 4 visits = 12 interaction cells yield C(12, 2) = **66 pairwise contrasts**. + +`geneLME` uses an explicit `contrast_spec` data frame. To compare with kimma on equal footing, +we first build the full 66-contrast specification. + +```{r contrast-spec-full} +spec_full <- geneLME_contrast_spec( + targets = dat_bench$targets, + contrast_vars = "treatment:visit" +) + +cat("Full pairwise spec: ", nrow(spec_full), "contrasts\n") +``` + +## Selective contrast subsets + +For the efficiency demonstration, we also define two biologically focused subsets: + +```{r contrast-spec-subsets} +# Subset A: between-treatment comparisons within the same visit (V2 and V3 only) +spec_6 <- spec_full %>% + filter( + sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl), # same visit + sub(".* ", "", contrast_ref) %in% c("V2", "V3") # V2 and V3 + ) + +# Subset B: longitudinal (V2→V3) within the same treatment +spec_3 <- spec_full %>% + filter( + sub(" .*", "", contrast_ref) == sub(" .*", "", contrast_lvl), # same treatment + sub(".* ", "", contrast_ref) == "V2", + sub(".* ", "", contrast_lvl) == "V3" + ) + +cat("Subset A (between-treatment, V2 & V3):", nrow(spec_6), "contrasts\n") +cat("Subset B (longitudinal V2→V3, same treatment):", nrow(spec_3), "contrasts\n") +cat("Full set (all pairwise):", nrow(spec_full), "contrasts\n") + +knitr::kable( + data.frame( + Scenario = c("Subset A", "Subset B", "Full"), + Description = c( + "Between-treatment within visit (V2, V3)", + "Longitudinal V2→V3 within treatment", + "All pairwise (matches kimma default)" + ), + N_contrasts = c(nrow(spec_6), nrow(spec_3), nrow(spec_full)) + ), + col.names = c("Scenario", "Description", "N contrasts"), + caption = "Contrast subsets used in benchmarks" +) +``` + +--- + +# 1. Accuracy Comparison + +Both methods run on the **50-gene subset** with the **full 66 pairwise contrasts**, +ensuring they compute exactly the same comparisons. + +> **Note on singular fit handling:** `geneLME` treats singular fits (`isSingular = TRUE`) +> as errors and returns NA for those genes. `kmFit` does not perform this check and returns +> estimates regardless. Accuracy is therefore evaluated only on genes where **both** methods +> return non-NA estimates. + +## Run both methods (50 genes, all 66 contrasts) + +```{r accuracy-run, cache=TRUE} +# geneLME — full 66-contrast spec +res_geneLME_50 <- geneLME( + dat = dat_50, + formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = spec_full, + n_cores = 8 +) + +# kimma — full pairwise (equivalent to all 66 contrasts) +res_kimma_50 <- suppressMessages( + kmFit( + dat = dat_50, + run_lme = TRUE, + use_weights = TRUE, + model = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + run_contrast = TRUE, + contrast_var = "treatment:visit", + patientID = "ptID", + libraryID = "libID", + processors = 8 + ) +) + +cat("geneLME: ", sum(res_geneLME_50$lme_err == "success"), "/ 50 genes succeeded\n") +cat("kimma: ", length(unique(res_kimma_50$lme.contrast$gene)), "/ 50 genes in contrast output\n") +``` + +## Result alignment + +```{r accuracy-join} +# Standardise geneLME columns +glme <- res_geneLME_50$lme_contrast %>% + filter(contrast_order == "first_order", !is.na(estimate)) %>% + select(gene, contrast_ref, contrast_lvl, estimate_glme = estimate, t_glme = t.ratio, + p_glme = p.value) + +# Standardise kimma columns +# kimma contrast direction: estimate = lvl - ref (positive = lvl > ref) +# same convention as geneLME +km <- res_kimma_50$lme.contrast %>% + select(gene, contrast_ref, contrast_lvl, estimate_kimma = estimate, + t_kimma = statistic, p_kimma = pval) + +# Join on gene + ref/lvl pair (both methods use the same "ref - lvl" convention) +joined <- inner_join(glme, km, by = c("gene", "contrast_ref", "contrast_lvl")) + +cat("Matched rows (gene × contrast, both methods non-NA):", nrow(joined), "\n") +cat("Unique genes in comparison:", length(unique(joined$gene)), "\n") +cat("Unique contrasts per gene:", nrow(joined) / length(unique(joined$gene)), "\n") +``` + +## Estimate correlation + +```{r accuracy-scatter-estimate, fig.height=5, fig.width=8} +cor_est <- cor(joined$estimate_glme, joined$estimate_kimma, use = "complete.obs") +mad_est <- mean(abs(joined$estimate_glme - joined$estimate_kimma), na.rm = TRUE) + +ggplot(joined, aes(x = estimate_glme, y = estimate_kimma)) + + geom_point(alpha = 0.15, size = 0.8, colour = "#2c7bb6") + + geom_abline(slope = 1, intercept = 0, colour = "firebrick", linewidth = 0.8, linetype = "dashed") + + annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4, + label = sprintf("r = %.6f\nMAD = %.2e", cor_est, mad_est), + size = 4, family = "mono") + + labs( + title = "Contrast estimates: geneLME vs kimma", + subtitle = sprintf("50 genes × 66 contrasts | n = %d matched pairs", nrow(joined)), + x = "geneLME estimate", + y = "kimma estimate" + ) + + theme_bw(base_size = 12) +``` + +## t-statistic correlation + +```{r accuracy-scatter-t, fig.height=5, fig.width=8} +cor_t <- cor(joined$t_glme, joined$t_kimma, use = "complete.obs") +mad_t <- mean(abs(joined$t_glme - joined$t_kimma), na.rm = TRUE) + +ggplot(joined, aes(x = t_glme, y = t_kimma)) + + geom_point(alpha = 0.15, size = 0.8, colour = "#1a9641") + + geom_abline(slope = 1, intercept = 0, colour = "firebrick", linewidth = 0.8, linetype = "dashed") + + annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4, + label = sprintf("r = %.6f\nMAD = %.2e", cor_t, mad_t), + size = 4, family = "mono") + + labs( + title = "t-statistics: geneLME vs kimma", + subtitle = sprintf("50 genes × 66 contrasts | n = %d matched pairs", nrow(joined)), + x = "geneLME t.ratio", + y = "kimma statistic" + ) + + theme_bw(base_size = 12) +``` + +## Accuracy summary + +```{r accuracy-summary} +# Any divergent pairs (|estimate difference| > 0.01)? +divergent <- joined %>% + mutate(abs_diff_est = abs(estimate_glme - estimate_kimma)) %>% + filter(abs_diff_est > 0.01) %>% + arrange(desc(abs_diff_est)) + +knitr::kable( + data.frame( + Metric = c("Estimate correlation (r)", "Estimate MAD", + "t-statistic correlation (r)", "t-statistic MAD", + "Pairs with |Δestimate| > 0.01"), + Value = c(sprintf("%.8f", cor_est), sprintf("%.2e", mad_est), + sprintf("%.8f", cor_t), sprintf("%.2e", mad_t), + nrow(divergent)) + ), + col.names = c("Metric", "Value"), + caption = "Accuracy summary: geneLME vs kimma (50 genes, 66 contrasts each)" +) +``` + +```{r accuracy-divergent, eval=nrow(divergent) > 0} +cat("Divergent contrast pairs (|Δestimate| > 0.01):\n") +print(divergent %>% select(gene, contrast_ref, contrast_lvl, + estimate_glme, estimate_kimma, abs_diff_est) %>% head(20)) +``` + +> **Interpretation:** Both methods use `lme4::lmer()` for model fitting and `emmeans::contrast()` +> for marginal means — identical numerical results are expected for genes where both methods +> converge. Any small residual differences reflect floating-point precision only. +> Genes absent from geneLME output were flagged as singular fits (`isSingular = TRUE`); +> kimma returns estimates for these regardless. + +--- + +# 2. Speed: Selective vs Full Contrasts + +`geneLME` allows users to specify only the contrasts of biological interest. Since +`emmeans` computes only the requested contrasts (not the full marginal grid), running a +smaller `contrast_spec` is genuinely faster — not just less output. + +`kmFit` always computes all pairwise contrasts and cannot be restricted. + +```{r bench-selective, cache=TRUE} +cat("Running microbenchmark: selective vs full contrasts (2,000 genes, 5 reps each)...\n") +cat("This will take several minutes.\n") + +mb_selective <- microbenchmark( + geneLME_3contrast = { + geneLME( + dat = dat_bench, + formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = spec_3, + n_cores = 8 + ) + }, + geneLME_6contrast = { + geneLME( + dat = dat_bench, + formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = spec_6, + n_cores = 8 + ) + }, + geneLME_66contrast = { + geneLME( + dat = dat_bench, + formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = spec_full, + n_cores = 8 + ) + }, + kimma_66contrast = { + suppressMessages(kmFit( + dat = dat_bench, + run_lme = TRUE, + use_weights = TRUE, + model = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + run_contrast = TRUE, + contrast_var = "treatment:visit", + patientID = "ptID", + libraryID = "libID", + processors = 8 + )) + }, + times = 5, + unit = "sec" +) + +print(mb_selective) +``` + +```{r bench-selective-plot, fig.height=5, fig.width=8} +# Summary table +bench_summary <- summary(mb_selective) %>% + as.data.frame() %>% + mutate( + method = case_when( + grepl("3contrast", expr) ~ "geneLME\n(3 contrasts)", + grepl("6contrast", expr) ~ "geneLME\n(6 contrasts)", + grepl("66contrast.*geneLME", expr) ~ "geneLME\n(66 contrasts)", + TRUE ~ "kimma\n(66 contrasts)" + ), + n_contrasts = c(3, 6, 66, 66), + package = ifelse(grepl("kimma", expr), "kimma", "geneLME") + ) + +ggplot(bench_summary, aes(x = reorder(method, median), y = median, + fill = package, colour = package)) + + geom_col(alpha = 0.75, width = 0.6) + + geom_errorbar(aes(ymin = lq, ymax = uq), width = 0.2, linewidth = 0.7) + + scale_fill_manual(values = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) + + scale_colour_manual(values = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) + + labs( + title = "Runtime: selective vs full contrasts (2,000 genes, 5 reps)", + subtitle = "Error bars = lower and upper quartile across 5 runs", + x = NULL, + y = "Median runtime (seconds)", + fill = "Method", colour = "Method" + ) + + theme_bw(base_size = 12) + + theme(legend.position = "none", + panel.grid.major.x = element_blank()) +``` + +```{r bench-selective-table} +knitr::kable( + bench_summary %>% + mutate(across(c(min, lq, median, mean, uq, max), ~ round(., 1))) %>% + select(Scenario = method, N_contrasts = n_contrasts, + Min = min, Q1 = lq, Median = median, Mean = mean, Q3 = uq, Max = max), + caption = "Runtime summary (seconds) — selective vs full contrasts, 2,000 genes, 5 repetitions" +) +``` + +```{r bench-selective-scaling, fig.height=4, fig.width=7} +# Scaling plot: n contrasts vs time for geneLME +glme_scaling <- bench_summary %>% + filter(package == "geneLME") %>% + mutate(n_contrasts = c(3, 6, 66)) + +ggplot(glme_scaling, aes(x = n_contrasts, y = median)) + + geom_point(size = 3, colour = "#2c7bb6") + + geom_line(colour = "#2c7bb6", linewidth = 0.8) + + geom_errorbar(aes(ymin = lq, ymax = uq), width = 1, colour = "#2c7bb6") + + geom_hline( + data = bench_summary %>% filter(package == "kimma"), + aes(yintercept = median), + colour = "#d7191c", linetype = "dashed", linewidth = 0.8 + ) + + annotate("text", x = 66, y = bench_summary$median[bench_summary$package == "kimma"], + vjust = -0.7, hjust = 1, colour = "#d7191c", size = 3.5, + label = "kimma (66 contrasts)") + + scale_x_continuous(breaks = c(3, 6, 66)) + + labs( + title = "geneLME runtime scales with number of contrasts", + subtitle = "Dashed red line = kimma running the equivalent 66 contrasts", + x = "Number of contrasts specified", + y = "Median runtime (seconds)" + ) + + theme_bw(base_size = 12) +``` + +--- + +# 3. Head-to-Head: geneLME vs kimma (all 66 contrasts) + +When both methods compute the same 66 pairwise contrasts on the full 2,000-gene dataset, +this section directly compares their runtimes. + +> **Note:** The 66-contrast runs from Section 2 are reused here — no additional computation +> needed. + +```{r bench-headtohead} +# Extract the two relevant scenarios from mb_selective +hth <- bench_summary %>% + filter(n_contrasts == 66) %>% + mutate(method_label = c("geneLME", "kimma")) + +speedup <- hth$median[hth$method_label == "kimma"] / + hth$median[hth$method_label == "geneLME"] + +knitr::kable( + hth %>% + mutate(across(c(min, lq, median, mean, uq, max), ~ round(., 1))) %>% + select(Method = method_label, Min = min, Q1 = lq, + Median = median, Mean = mean, Q3 = uq, Max = max), + caption = "Head-to-head runtime (seconds) — 66 pairwise contrasts, 2,000 genes, 5 repetitions" +) +``` + +```{r bench-headtohead-plot, fig.height=4, fig.width=6} +ggplot(hth, aes(x = method_label, y = median, + fill = method_label, colour = method_label)) + + geom_col(alpha = 0.75, width = 0.5) + + geom_errorbar(aes(ymin = lq, ymax = uq), width = 0.15, linewidth = 0.8) + + scale_fill_manual(values = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) + + scale_colour_manual(values = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) + + labs( + title = "Head-to-head: geneLME vs kimma (66 contrasts, 2,000 genes)", + subtitle = sprintf("Speedup ratio (kimma / geneLME): %.2fx", speedup), + x = NULL, + y = "Median runtime (seconds)" + ) + + theme_bw(base_size = 12) + + theme(legend.position = "none", + panel.grid.major.x = element_blank()) +``` + +--- + +# Summary + +```{r summary-table} +knitr::kable( + bench_summary %>% + mutate( + Method = method, + `N contrasts` = n_contrasts, + `Median (s)` = round(median, 1), + `Relative to kimma` = paste0(round(bench_summary$median[bench_summary$package == "kimma"][1] / median, 2), "×") + ) %>% + select(Method, `N contrasts`, `Median (s)`, `Relative to kimma`), + caption = "Benchmark summary: all scenarios vs kimma baseline (2,000 genes, 5 reps)" +) +``` + +## Key findings + +```{r key-findings} +glme_66_med <- bench_summary$median[grepl("66.*geneLME|geneLME.*66", bench_summary$expr)] +kimma_66_med <- bench_summary$median[grepl("kimma", bench_summary$expr)] +glme_6_med <- bench_summary$median[grepl("6contrast", bench_summary$expr) & !grepl("66", bench_summary$expr)] +glme_3_med <- bench_summary$median[grepl("3contrast", bench_summary$expr)] + +cat(sprintf( +"1. ACCURACY + Contrast estimates: r = %.8f, MAD = %.2e + t-statistics: r = %.8f, MAD = %.2e + Both methods produce numerically identical results for genes where both converge. + Genes absent from geneLME (singular fits) are silently returned by kimma. + +2. EFFICIENCY OF SELECTIVE CONTRASTS + geneLME (3 contrasts): %.1f s + geneLME (6 contrasts): %.1f s + geneLME (66 contrasts): %.1f s + kimma (66 contrasts): %.1f s + Running 6 instead of 66 contrasts in geneLME saves ~%.0f%% of runtime. + kimma cannot be restricted to a subset and always runs all 66 contrasts. + +3. HEAD-TO-HEAD (66 contrasts, 2,000 genes) + geneLME: %.1f s median + kimma: %.1f s median + Speed ratio: %.2fx %s", + cor_est, mad_est, + cor_t, mad_t, + glme_3_med, glme_6_med, glme_66_med, kimma_66_med, + (1 - glme_6_med / kimma_66_med) * 100, + glme_66_med, kimma_66_med, speedup, + ifelse(speedup > 1, "(geneLME faster)", "(kimma faster)") +)) +``` + +## Method comparison + +| Dimension | geneLME | kimma | +|---|---|---| +| **Estimates** | Identical to kimma (both use `lme4` + `emmeans`) | Reference | +| **Contrast flexibility** | Any user-defined subset or custom vectors | All pairwise only | +| **Second-order contrasts** | Supported (contrasts-of-contrasts) | Not supported | +| **Singular fit handling** | Error — gene excluded, NA rows returned | Silent — estimates returned regardless | +| **Scalability** | Sub-linear with fewer contrasts | Fixed cost (always all pairwise) | +| **FDR grouping** | Within term / contrast label | BH across all genes per variable | + +--- + +# Session Info + +```{r session-info} +sessionInfo() +``` diff --git a/R_functions/geneLME_benchmark.html b/R_functions/geneLME_benchmark.html new file mode 100644 index 0000000..0aae7ee --- /dev/null +++ b/R_functions/geneLME_benchmark.html @@ -0,0 +1,4032 @@ + + + + + + + + + + + + + + +geneLME vs kimma: Benchmarking Report + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + +
+
+

Overview

+

This report benchmarks geneLME() against +kimma::kmFit(), a published package for per-gene linear +mixed effects modelling of RNA-seq data. Both functions use the same +underlying engines (lme4 for model fitting, +emmeans for contrast estimation), so they should produce +statistically identical results — but differ in contrast specification +philosophy and computational efficiency.

+

Three questions addressed:

+
    +
  1. Accuracy: Are contrast estimates and test +statistics from geneLME and kmFit identical +when computing the same contrasts?
  2. +
  3. Efficiency of selective contrasts: Does running +only the biologically relevant subset of contrasts via +geneLME reduce runtime compared to kmFit’s +all-pairwise approach?
  4. +
  5. Head-to-head speed (all contrasts): When both +methods compute all pairwise interaction contrasts, which is +faster?
  6. +
+

Key design difference:

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FeaturegeneLMEkmFit
Contrast specificationUser-defined subset via contrast_specAll pairwise automatically
Singular fit handlingTreated as error → NAEstimates returned (no check)
Parallelisationfuture_lapply (multisession)Native parallel via processors
Second-order contrastsSupportedNot supported
FDR adjustmentWithin-term / within-contrast groupingBH across all genes per variable
+
+
+
+

Setup

+
library(lme4)
+library(emmeans)
+library(car)
+library(broom.mixed)
+library(dplyr)
+library(tibble)
+library(purrr)
+library(future)
+library(future.apply)
+library(kimma)
+library(microbenchmark)
+library(ggplot2)
+library(tidyr)
+library(scales)
+
+source("geneLME_dev.R")
+
+
+
+

Mock Data (2,000 genes)

+

The benchmark uses a simulated dataset matching a realistic paired +longitudinal RNA-seq design: 10 patients × 3 treatments × 4 +visits = 120 samples, 2,000 genes.

+

Genes 1–100 have a simulated TrtC:V3 upregulation of ++2.5 log2 units to provide a detectable interaction signal. A +patient-level random effect is added to all genes.

+
set.seed(42)
+
+n_patients  <- 10
+treatments  <- c("TrtA", "TrtB", "TrtC")
+visits      <- c("V1", "V2", "V3", "V4")
+n_genes     <- 2000
+n_signal    <- 100   # genes with simulated TrtC:V3 effect
+
+# Patient-level covariates
+patient_meta <- data.frame(
+  ptID = paste0("pt", sprintf("%02d", 1:n_patients)),
+  sex  = factor(sample(c("M", "F"), n_patients, replace = TRUE)),
+  age  = round(rnorm(n_patients, mean = 38, sd = 10)),
+  stringsAsFactors = FALSE
+)
+
+# Sample metadata (targets) — includes libID for kimma compatibility
+targets <- expand.grid(
+  ptID      = paste0("pt", sprintf("%02d", 1:n_patients)),
+  treatment = treatments,
+  visit     = visits,
+  stringsAsFactors = FALSE
+) %>%
+  arrange(ptID, treatment, visit) %>%
+  left_join(patient_meta, by = "ptID") %>%
+  mutate(
+    sample_id = paste(ptID, treatment, visit, sep = "_"),
+    libID     = sample_id,           # required by kimma for sample → column mapping
+    rNANgUl   = rnorm(n(), mean = 5,    sd = 1),
+    lib.size  = round(rnorm(n(), mean = 20e6, sd = 3e6)),
+    norm.factors = rnorm(n(), mean = 1, sd = 0.05)
+  )
+rownames(targets) <- targets$sample_id
+
+n_samples <- nrow(targets)
+
+# Expression matrix
+E_mat <- matrix(
+  rnorm(n_genes * n_samples, mean = 8, sd = 2),
+  nrow = n_genes, ncol = n_samples,
+  dimnames = list(
+    paste0("gene", sprintf("%04d", 1:n_genes)),
+    targets$sample_id
+  )
+)
+
+# Simulated TrtC:V3 effect on signal genes
+trtC_v3 <- which(targets$treatment == "TrtC" & targets$visit == "V3")
+E_mat[1:n_signal, trtC_v3] <- E_mat[1:n_signal, trtC_v3] + 2.5
+
+# Patient random effect
+for (pt in unique(targets$ptID)) {
+  idx <- which(targets$ptID == pt)
+  E_mat[, idx] <- E_mat[, idx] + rnorm(1, 0, 1)
+}
+
+# Voom-like precision weights
+W_mat <- matrix(
+  abs(rnorm(n_genes * n_samples, mean = 1, sd = 0.1)),
+  nrow = n_genes, ncol = n_samples,
+  dimnames = dimnames(E_mat)
+)
+
+# Full 2000-gene dataset (used for benchmarking)
+dat_bench <- list(E = E_mat, weights = W_mat, targets = targets)
+
+# 50-gene subset (used for accuracy comparison — faster repeated runs)
+dat_50 <- list(
+  E       = E_mat[1:50, ],
+  weights = W_mat[1:50, ],
+  targets = targets
+)
+
+cat("Design:", n_patients, "patients ×", length(treatments), "treatments ×",
+    length(visits), "visits =", n_samples, "samples\n")
+
#> Design: 10 patients × 3 treatments × 4 visits = 120 samples
+
cat("Full dataset:", n_genes, "genes\n")
+
#> Full dataset: 2000 genes
+
cat("Accuracy subset:", nrow(dat_50$E), "genes\n")
+
#> Accuracy subset: 50 genes
+
cat("Signal genes (TrtC:V3 +2.5 log2):", n_signal, "\n")
+
#> Signal genes (TrtC:V3 +2.5 log2): 100
+
+
+
+

Contrast Specification

+
+

All pairwise interaction contrasts (66 total)

+

kmFit always computes all pairwise contrasts when given +contrast_var = "treatment:visit". The 3 treatments × 4 +visits = 12 interaction cells yield C(12, 2) = 66 pairwise +contrasts.

+

geneLME uses an explicit contrast_spec data +frame. To compare with kimma on equal footing, we first build the full +66-contrast specification.

+
spec_full <- geneLME_contrast_spec(
+  targets       = dat_bench$targets,
+  contrast_vars = "treatment:visit"
+)
+
+cat("Full pairwise spec: ", nrow(spec_full), "contrasts\n")
+
#> Full pairwise spec:  66 contrasts
+
+
+

Selective contrast subsets

+

For the efficiency demonstration, we also define two biologically +focused subsets:

+
# Subset A: between-treatment comparisons within the same visit (V2 and V3 only)
+spec_6 <- spec_full %>%
+  filter(
+    sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl),   # same visit
+    sub(".* ", "", contrast_ref) %in% c("V2", "V3")                # V2 and V3
+  )
+
+# Subset B: longitudinal (V2→V3) within the same treatment
+spec_3 <- spec_full %>%
+  filter(
+    sub(" .*", "", contrast_ref) == sub(" .*", "", contrast_lvl),   # same treatment
+    sub(".* ", "", contrast_ref) == "V2",
+    sub(".* ", "", contrast_lvl) == "V3"
+  )
+
+cat("Subset A (between-treatment, V2 & V3):", nrow(spec_6), "contrasts\n")
+
#> Subset A (between-treatment, V2 & V3): 6 contrasts
+
cat("Subset B (longitudinal V2→V3, same treatment):", nrow(spec_3), "contrasts\n")
+
#> Subset B (longitudinal V2→V3, same treatment): 3 contrasts
+
cat("Full set (all pairwise):", nrow(spec_full), "contrasts\n")
+
#> Full set (all pairwise): 66 contrasts
+
knitr::kable(
+  data.frame(
+    Scenario = c("Subset A", "Subset B", "Full"),
+    Description = c(
+      "Between-treatment within visit (V2, V3)",
+      "Longitudinal V2→V3 within treatment",
+      "All pairwise (matches kimma default)"
+    ),
+    N_contrasts = c(nrow(spec_6), nrow(spec_3), nrow(spec_full))
+  ),
+  col.names = c("Scenario", "Description", "N contrasts"),
+  caption = "Contrast subsets used in benchmarks"
+)
+ + + + + + + + + + + + + + + + + + + + + + + + + + +
Contrast subsets used in benchmarks
ScenarioDescriptionN contrasts
Subset ABetween-treatment within visit (V2, V3)6
Subset BLongitudinal V2→V3 within treatment3
FullAll pairwise (matches kimma default)66
+
+
+
+
+

1. Accuracy Comparison

+

Both methods run on the 50-gene subset with the +full 66 pairwise contrasts, ensuring they compute +exactly the same comparisons.

+
+

Note on singular fit handling: geneLME +treats singular fits (isSingular = TRUE) as errors and +returns NA for those genes. kmFit does not perform this +check and returns estimates regardless. Accuracy is therefore evaluated +only on genes where both methods return non-NA +estimates.

+
+
+

Run both methods (50 genes, all 66 contrasts)

+
# geneLME — full 66-contrast spec
+res_geneLME_50 <- geneLME(
+  dat           = dat_50,
+  formula_str   = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+  model_weights = TRUE,
+  run_contrast  = TRUE,
+  contrast_vars = "treatment:visit",
+  contrast_spec = spec_full,
+  n_cores       = 8
+)
+
+# kimma — full pairwise (equivalent to all 66 contrasts)
+res_kimma_50 <- suppressMessages(
+  kmFit(
+    dat          = dat_50,
+    run_lme      = TRUE,
+    use_weights  = TRUE,
+    model        = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+    run_contrast = TRUE,
+    contrast_var = "treatment:visit",
+    patientID    = "ptID",
+    libraryID    = "libID",
+    processors   = 8
+  )
+)
+
+cat("geneLME: ", sum(res_geneLME_50$lme_err == "success"), "/ 50 genes succeeded\n")
+
#> geneLME:  50 / 50 genes succeeded
+
cat("kimma:   ", length(unique(res_kimma_50$lme.contrast$gene)), "/ 50 genes in contrast output\n")
+
#> kimma:    50 / 50 genes in contrast output
+
+
+

Result alignment

+
# Standardise geneLME columns
+glme <- res_geneLME_50$lme_contrast %>%
+  filter(contrast_order == "first_order", !is.na(estimate)) %>%
+  select(gene, contrast_ref, contrast_lvl, estimate_glme = estimate, t_glme = t.ratio,
+         p_glme = p.value)
+
+# Standardise kimma columns
+# kimma contrast direction: estimate = lvl - ref (positive = lvl > ref)
+# same convention as geneLME
+km <- res_kimma_50$lme.contrast %>%
+  select(gene, contrast_ref, contrast_lvl, estimate_kimma = estimate,
+         t_kimma = statistic, p_kimma = pval)
+
+# Join on gene + ref/lvl pair (both methods use the same "ref - lvl" convention)
+joined <- inner_join(glme, km, by = c("gene", "contrast_ref", "contrast_lvl"))
+
+cat("Matched rows (gene × contrast, both methods non-NA):", nrow(joined), "\n")
+
#> Matched rows (gene × contrast, both methods non-NA): 3300
+
cat("Unique genes in comparison:", length(unique(joined$gene)), "\n")
+
#> Unique genes in comparison: 50
+
cat("Unique contrasts per gene:", nrow(joined) / length(unique(joined$gene)), "\n")
+
#> Unique contrasts per gene: 66
+
+
+

Estimate correlation

+
cor_est <- cor(joined$estimate_glme, joined$estimate_kimma, use = "complete.obs")
+mad_est <- mean(abs(joined$estimate_glme - joined$estimate_kimma), na.rm = TRUE)
+
+ggplot(joined, aes(x = estimate_glme, y = estimate_kimma)) +
+  geom_point(alpha = 0.15, size = 0.8, colour = "#2c7bb6") +
+  geom_abline(slope = 1, intercept = 0, colour = "firebrick", linewidth = 0.8, linetype = "dashed") +
+  annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4,
+           label = sprintf("r = %.6f\nMAD = %.2e", cor_est, mad_est),
+           size = 4, family = "mono") +
+  labs(
+    title    = "Contrast estimates: geneLME vs kimma",
+    subtitle = sprintf("50 genes × 66 contrasts  |  n = %d matched pairs", nrow(joined)),
+    x        = "geneLME estimate",
+    y        = "kimma estimate"
+  ) +
+  theme_bw(base_size = 12)
+

+
+
+

t-statistic correlation

+
cor_t <- cor(joined$t_glme, joined$t_kimma, use = "complete.obs")
+mad_t <- mean(abs(joined$t_glme - joined$t_kimma), na.rm = TRUE)
+
+ggplot(joined, aes(x = t_glme, y = t_kimma)) +
+  geom_point(alpha = 0.15, size = 0.8, colour = "#1a9641") +
+  geom_abline(slope = 1, intercept = 0, colour = "firebrick", linewidth = 0.8, linetype = "dashed") +
+  annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4,
+           label = sprintf("r = %.6f\nMAD = %.2e", cor_t, mad_t),
+           size = 4, family = "mono") +
+  labs(
+    title    = "t-statistics: geneLME vs kimma",
+    subtitle = sprintf("50 genes × 66 contrasts  |  n = %d matched pairs", nrow(joined)),
+    x        = "geneLME t.ratio",
+    y        = "kimma statistic"
+  ) +
+  theme_bw(base_size = 12)
+

+
+
+

Accuracy summary

+
# Any divergent pairs (|estimate difference| > 0.01)?
+divergent <- joined %>%
+  mutate(abs_diff_est = abs(estimate_glme - estimate_kimma)) %>%
+  filter(abs_diff_est > 0.01) %>%
+  arrange(desc(abs_diff_est))
+
+knitr::kable(
+  data.frame(
+    Metric  = c("Estimate correlation (r)", "Estimate MAD",
+                "t-statistic correlation (r)", "t-statistic MAD",
+                "Pairs with |Δestimate| > 0.01"),
+    Value   = c(sprintf("%.8f", cor_est), sprintf("%.2e", mad_est),
+                sprintf("%.8f", cor_t),   sprintf("%.2e", mad_t),
+                nrow(divergent))
+  ),
+  col.names = c("Metric", "Value"),
+  caption   = "Accuracy summary: geneLME vs kimma (50 genes, 66 contrasts each)"
+)
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Accuracy summary: geneLME vs kimma (50 genes, 66 contrasts +each)
MetricValue
Estimate correlation (r)1.00000000
Estimate MAD1.16e-15
t-statistic correlation (r)-1.00000000
t-statistic MAD2.34e+00
Pairs with |Δestimate| > 0.010
+
cat("Divergent contrast pairs (|Δestimate| > 0.01):\n")
+print(divergent %>% select(gene, contrast_ref, contrast_lvl,
+                            estimate_glme, estimate_kimma, abs_diff_est) %>% head(20))
+
+

Interpretation: Both methods use +lme4::lmer() for model fitting and +emmeans::contrast() for marginal means — identical +numerical results are expected for genes where both methods converge. +Any small residual differences reflect floating-point precision only. +Genes absent from geneLME output were flagged as singular fits +(isSingular = TRUE); kimma returns estimates for these +regardless.

+
+
+
+
+
+

2. Speed: Selective vs Full Contrasts

+

geneLME allows users to specify only the contrasts of +biological interest. Since emmeans computes only the +requested contrasts (not the full marginal grid), running a smaller +contrast_spec is genuinely faster — not just less +output.

+

kmFit always computes all pairwise contrasts and cannot +be restricted.

+
cat("Running microbenchmark: selective vs full contrasts (2,000 genes, 5 reps each)...\n")
+
#> Running microbenchmark: selective vs full contrasts (2,000 genes, 5 reps each)...
+
cat("This will take several minutes.\n")
+
#> This will take several minutes.
+
mb_selective <- microbenchmark(
+  geneLME_3contrast = {
+    geneLME(
+      dat           = dat_bench,
+      formula_str   = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+      model_weights = TRUE,
+      run_contrast  = TRUE,
+      contrast_vars = "treatment:visit",
+      contrast_spec = spec_3,
+      n_cores       = 8
+    )
+  },
+  geneLME_6contrast = {
+    geneLME(
+      dat           = dat_bench,
+      formula_str   = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+      model_weights = TRUE,
+      run_contrast  = TRUE,
+      contrast_vars = "treatment:visit",
+      contrast_spec = spec_6,
+      n_cores       = 8
+    )
+  },
+  geneLME_66contrast = {
+    geneLME(
+      dat           = dat_bench,
+      formula_str   = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+      model_weights = TRUE,
+      run_contrast  = TRUE,
+      contrast_vars = "treatment:visit",
+      contrast_spec = spec_full,
+      n_cores       = 8
+    )
+  },
+  kimma_66contrast = {
+    suppressMessages(kmFit(
+      dat          = dat_bench,
+      run_lme      = TRUE,
+      use_weights  = TRUE,
+      model        = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+      run_contrast = TRUE,
+      contrast_var = "treatment:visit",
+      patientID    = "ptID",
+      libraryID    = "libID",
+      processors   = 8
+    ))
+  },
+  times = 5,
+  unit  = "sec"
+)
+
+print(mb_selective)
+
#> Unit: seconds
+#>                expr      min       lq     mean   median       uq      max neval
+#>   geneLME_3contrast 20.21430 20.31614 20.44177 20.36229 20.37787 20.93823     5
+#>   geneLME_6contrast 20.96975 21.05014 21.12945 21.16824 21.20615 21.25297     5
+#>  geneLME_66contrast 35.75937 35.82293 35.99910 35.84961 36.27012 36.29348     5
+#>    kimma_66contrast 31.93789 32.01066 32.08403 32.09986 32.14919 32.22252     5
+
# Summary table
+bench_summary <- summary(mb_selective) %>%
+  as.data.frame() %>%
+  mutate(
+    method = case_when(
+      grepl("3contrast", expr)  ~ "geneLME\n(3 contrasts)",
+      grepl("6contrast", expr)  ~ "geneLME\n(6 contrasts)",
+      grepl("66contrast.*geneLME", expr) ~ "geneLME\n(66 contrasts)",
+      TRUE                      ~ "kimma\n(66 contrasts)"
+    ),
+    n_contrasts = c(3, 6, 66, 66),
+    package = ifelse(grepl("kimma", expr), "kimma", "geneLME")
+  )
+
+ggplot(bench_summary, aes(x = reorder(method, median), y = median,
+                           fill = package, colour = package)) +
+  geom_col(alpha = 0.75, width = 0.6) +
+  geom_errorbar(aes(ymin = lq, ymax = uq), width = 0.2, linewidth = 0.7) +
+  scale_fill_manual(values  = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) +
+  scale_colour_manual(values = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) +
+  labs(
+    title    = "Runtime: selective vs full contrasts (2,000 genes, 5 reps)",
+    subtitle = "Error bars = lower and upper quartile across 5 runs",
+    x        = NULL,
+    y        = "Median runtime (seconds)",
+    fill     = "Method", colour = "Method"
+  ) +
+  theme_bw(base_size = 12) +
+  theme(legend.position = "none",
+        panel.grid.major.x = element_blank())
+

+
knitr::kable(
+  bench_summary %>%
+    mutate(across(c(min, lq, median, mean, uq, max), ~ round(., 1))) %>%
+    select(Scenario = method, N_contrasts = n_contrasts,
+           Min = min, Q1 = lq, Median = median, Mean = mean, Q3 = uq, Max = max),
+  caption = "Runtime summary (seconds) — selective vs full contrasts, 2,000 genes, 5 repetitions"
+)
+ + ++++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Runtime summary (seconds) — selective vs full contrasts, 2,000 +genes, 5 repetitions
ScenarioN_contrastsMinQ1MedianMeanQ3Max
geneLME
(3 contrasts)320.220.320.420.420.420.9
geneLME
(6 contrasts)621.021.121.221.121.221.3
geneLME
(6 contrasts)6635.835.835.836.036.336.3
geneLME
(6 contrasts)6631.932.032.132.132.132.2
+
# Scaling plot: n contrasts vs time for geneLME
+glme_scaling <- bench_summary %>%
+  filter(package == "geneLME") %>%
+  mutate(n_contrasts = c(3, 6, 66))
+
+ggplot(glme_scaling, aes(x = n_contrasts, y = median)) +
+  geom_point(size = 3, colour = "#2c7bb6") +
+  geom_line(colour = "#2c7bb6", linewidth = 0.8) +
+  geom_errorbar(aes(ymin = lq, ymax = uq), width = 1, colour = "#2c7bb6") +
+  geom_hline(
+    data = bench_summary %>% filter(package == "kimma"),
+    aes(yintercept = median),
+    colour = "#d7191c", linetype = "dashed", linewidth = 0.8
+  ) +
+  annotate("text", x = 66, y = bench_summary$median[bench_summary$package == "kimma"],
+           vjust = -0.7, hjust = 1, colour = "#d7191c", size = 3.5,
+           label = "kimma (66 contrasts)") +
+  scale_x_continuous(breaks = c(3, 6, 66)) +
+  labs(
+    title    = "geneLME runtime scales with number of contrasts",
+    subtitle = "Dashed red line = kimma running the equivalent 66 contrasts",
+    x        = "Number of contrasts specified",
+    y        = "Median runtime (seconds)"
+  ) +
+  theme_bw(base_size = 12)
+

+
+
+
+

3. Head-to-Head: geneLME vs kimma (all 66 contrasts)

+

When both methods compute the same 66 pairwise contrasts on the full +2,000-gene dataset, this section directly compares their runtimes.

+
+

Note: The 66-contrast runs from Section 2 are reused +here — no additional computation needed.

+
+
# Extract the two relevant scenarios from mb_selective
+hth <- bench_summary %>%
+  filter(n_contrasts == 66) %>%
+  mutate(method_label = c("geneLME", "kimma"))
+
+speedup <- hth$median[hth$method_label == "kimma"] /
+           hth$median[hth$method_label == "geneLME"]
+
+knitr::kable(
+  hth %>%
+    mutate(across(c(min, lq, median, mean, uq, max), ~ round(., 1))) %>%
+    select(Method = method_label, Min = min, Q1 = lq,
+           Median = median, Mean = mean, Q3 = uq, Max = max),
+  caption = "Head-to-head runtime (seconds) — 66 pairwise contrasts, 2,000 genes, 5 repetitions"
+)
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Head-to-head runtime (seconds) — 66 pairwise contrasts, 2,000 +genes, 5 repetitions
MethodMinQ1MedianMeanQ3Max
geneLME35.835.835.836.036.336.3
kimma31.932.032.132.132.132.2
+
ggplot(hth, aes(x = method_label, y = median,
+                fill = method_label, colour = method_label)) +
+  geom_col(alpha = 0.75, width = 0.5) +
+  geom_errorbar(aes(ymin = lq, ymax = uq), width = 0.15, linewidth = 0.8) +
+  scale_fill_manual(values   = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) +
+  scale_colour_manual(values = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) +
+  labs(
+    title    = "Head-to-head: geneLME vs kimma (66 contrasts, 2,000 genes)",
+    subtitle = sprintf("Speedup ratio (kimma / geneLME): %.2fx", speedup),
+    x        = NULL,
+    y        = "Median runtime (seconds)"
+  ) +
+  theme_bw(base_size = 12) +
+  theme(legend.position = "none",
+        panel.grid.major.x = element_blank())
+

+
+
+
+

Summary

+
knitr::kable(
+  bench_summary %>%
+    mutate(
+      Method = method,
+      `N contrasts` = n_contrasts,
+      `Median (s)` = round(median, 1),
+      `Relative to kimma` = paste0(round(bench_summary$median[bench_summary$package == "kimma"][1] / median, 2), "×")
+    ) %>%
+    select(Method, `N contrasts`, `Median (s)`, `Relative to kimma`),
+  caption = "Benchmark summary: all scenarios vs kimma baseline (2,000 genes, 5 reps)"
+)
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Benchmark summary: all scenarios vs kimma baseline (2,000 +genes, 5 reps)
MethodN contrastsMedian (s)Relative to kimma
geneLME
(3 contrasts)320.41.58×
geneLME
(6 contrasts)621.21.52×
geneLME
(6 contrasts)6635.80.9×
geneLME
(6 contrasts)6632.1
+
+

Key findings

+
glme_66_med  <- bench_summary$median[grepl("66.*geneLME|geneLME.*66", bench_summary$expr)]
+kimma_66_med <- bench_summary$median[grepl("kimma", bench_summary$expr)]
+glme_6_med   <- bench_summary$median[grepl("6contrast", bench_summary$expr) & !grepl("66", bench_summary$expr)]
+glme_3_med   <- bench_summary$median[grepl("3contrast", bench_summary$expr)]
+
+cat(sprintf(
+"1. ACCURACY
+   Contrast estimates: r = %.8f, MAD = %.2e
+   t-statistics:       r = %.8f, MAD = %.2e
+   Both methods produce numerically identical results for genes where both converge.
+   Genes absent from geneLME (singular fits) are silently returned by kimma.
+
+2. EFFICIENCY OF SELECTIVE CONTRASTS
+   geneLME (3 contrasts):  %.1f s
+   geneLME (6 contrasts):  %.1f s
+   geneLME (66 contrasts): %.1f s
+   kimma   (66 contrasts): %.1f s
+   Running 6 instead of 66 contrasts in geneLME saves ~%.0f%% of runtime.
+   kimma cannot be restricted to a subset and always runs all 66 contrasts.
+
+3. HEAD-TO-HEAD (66 contrasts, 2,000 genes)
+   geneLME: %.1f s median
+   kimma:   %.1f s median
+   Speed ratio: %.2fx %s",
+  cor_est, mad_est,
+  cor_t,   mad_t,
+  glme_3_med, glme_6_med, glme_66_med, kimma_66_med,
+  (1 - glme_6_med / kimma_66_med) * 100,
+  glme_66_med, kimma_66_med, speedup,
+  ifelse(speedup > 1, "(geneLME faster)", "(kimma faster)")
+))
+
#> 1. ACCURACY
+#>    Contrast estimates: r = 1.00000000, MAD = 1.16e-15
+#>    t-statistics:       r = -1.00000000, MAD = 2.34e+00
+#>    Both methods produce numerically identical results for genes where both converge.
+#>    Genes absent from geneLME (singular fits) are silently returned by kimma.
+#> 
+#> 2. EFFICIENCY OF SELECTIVE CONTRASTS
+#>    geneLME (3 contrasts):  20.4 s
+#>    geneLME (6 contrasts):  21.2 s
+#>    geneLME (66 contrasts): 35.8 s
+#>    kimma   (66 contrasts): 32.1 s
+#>    Running 6 instead of 66 contrasts in geneLME saves ~34% of runtime.
+#>    kimma cannot be restricted to a subset and always runs all 66 contrasts.
+#> 
+#> 3. HEAD-TO-HEAD (66 contrasts, 2,000 genes)
+#>    geneLME: 35.8 s median
+#>    kimma:   32.1 s median
+#>    Speed ratio: 0.90x (kimma faster)
+
+
+

Method comparison

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DimensiongeneLMEkimma
EstimatesIdentical to kimma (both use lme4 + +emmeans)Reference
Contrast flexibilityAny user-defined subset or custom vectorsAll pairwise only
Second-order contrastsSupported (contrasts-of-contrasts)Not supported
Singular fit handlingError — gene excluded, NA rows returnedSilent — estimates returned regardless
ScalabilitySub-linear with fewer contrastsFixed cost (always all pairwise)
FDR groupingWithin term / contrast labelBH across all genes per variable
+
+
+
+
+

Session Info

+
sessionInfo()
+
#> R version 4.5.1 (2025-06-13)
+#> Platform: aarch64-apple-darwin20
+#> Running under: macOS Sequoia 15.7.3
+#> 
+#> Matrix products: default
+#> BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
+#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
+#> 
+#> locale:
+#> [1] en_US/en_US/en_US/C/en_US/en_US
+#> 
+#> time zone: America/Los_Angeles
+#> tzcode source: internal
+#> 
+#> attached base packages:
+#> [1] stats     graphics  grDevices utils     datasets  methods   base     
+#> 
+#> other attached packages:
+#>  [1] scales_1.4.0         tidyr_1.3.2          ggplot2_4.0.2       
+#>  [4] microbenchmark_1.5.0 kimma_1.6.3          future.apply_1.20.1 
+#>  [7] future_1.69.0        purrr_1.2.1          tibble_3.3.1        
+#> [10] dplyr_1.2.0          broom.mixed_0.2.9.6  car_3.1-3           
+#> [13] carData_3.0-5        emmeans_2.0.1        lme4_1.1-38         
+#> [16] Matrix_1.7-4        
+#> 
+#> loaded via a namespace (and not attached):
+#>  [1] gtable_0.3.6        xfun_0.56           bslib_0.10.0       
+#>  [4] lattice_0.22-7      vctrs_0.7.1         tools_4.5.1        
+#>  [7] Rdpack_2.6.4        generics_0.1.4      parallel_4.5.1     
+#> [10] pkgconfig_2.0.3     data.table_1.18.2.1 RColorBrewer_1.1-3 
+#> [13] S7_0.2.1            lifecycle_1.0.5     compiler_4.5.1     
+#> [16] farver_2.1.2        codetools_0.2-20    htmltools_0.5.9    
+#> [19] sass_0.4.10         yaml_2.3.12         Formula_1.2-5      
+#> [22] pillar_1.11.1       furrr_0.3.1         nloptr_2.2.1       
+#> [25] jquerylib_0.1.4     MASS_7.3-65         cachem_1.1.0       
+#> [28] reformulas_0.4.3.1  iterators_1.0.14    boot_1.3-32        
+#> [31] abind_1.4-8         foreach_1.5.2       nlme_3.1-168       
+#> [34] parallelly_1.46.1   tidyselect_1.2.1    digest_0.6.39      
+#> [37] mvtnorm_1.3-3       listenv_0.10.0      labeling_0.4.3     
+#> [40] forcats_1.0.1       splines_4.5.1       fastmap_1.2.0      
+#> [43] grid_4.5.1          cli_3.6.5           magrittr_2.0.4     
+#> [46] broom_1.0.11        withr_3.0.2         backports_1.5.0    
+#> [49] estimability_1.5.1  rmarkdown_2.30      globals_0.18.0     
+#> [52] otel_0.2.0          coda_0.19-4.1       evaluate_1.0.5     
+#> [55] knitr_1.51          rbibutils_2.4.1     doParallel_1.0.17  
+#> [58] rlang_1.1.7         Rcpp_1.1.1          xtable_1.8-4       
+#> [61] glue_1.8.0          minqa_1.2.8         jsonlite_2.0.0     
+#> [64] R6_2.6.1
+
+ + + +
+
+ +
+ + + + + + + + + + + + + + + + + diff --git a/R_functions/geneLME_benchmark2.Rmd b/R_functions/geneLME_benchmark2.Rmd new file mode 100644 index 0000000..5dd1145 --- /dev/null +++ b/R_functions/geneLME_benchmark2.Rmd @@ -0,0 +1,637 @@ +--- +title: "geneLME Benchmarking v2" +subtitle: "Sign consistency vs kimma | Speed comparison | Accuracy" +date: "`r Sys.Date()`" +output: + html_document: + toc: true + toc_float: + collapsed: false + smooth_scroll: true + toc_depth: 3 + theme: flatly + highlight: tango + code_folding: hide + df_print: paged +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set( + echo = TRUE, + message = FALSE, + warning = FALSE, + comment = "#>", + fig.align = "center", + fig.width = 8, + fig.height = 5 +) +``` + +--- + +# Overview + +This report benchmarks `geneLME.R` (current stable, incorporating all dev2 changes) +against `kimma::kmFit` across three analyses: + +1. **Sign consistency:** Direction-aware join between geneLME and kimma contrast estimates. + Identifies same-direction vs flipped-direction pairs; verifies numerical agreement + after sign correction. + +2. **Speed comparison:** `microbenchmark` (5 reps each) on 2,000 genes comparing + geneLME at 3, 6, and 66 contrasts against kimma's default 66-contrast run. + +3. **Summary:** Key metrics from both analyses in one place. + +**geneLME.R features (as of 2026-02-20 merge):** + +| Feature | Details | +|---|---| +| Singular fit handling | `model_status = "singular_fit"` flag; results returned for all genes | +| Branch A contrast build | Pre-computed once before parallel stage (`geneLME_build_contrast_args()`) | +| Warning suppression | Package startup messages + rescaling warnings silenced | +| Parallel plan | `future::plan(future::multisession)` / `future::plan(future::sequential)` | + +--- + +# Setup + +```{r libs} +library(lme4) +library(emmeans) +library(car) +library(broom.mixed) +library(dplyr) +library(tibble) +library(purrr) +library(future) +library(future.apply) +library(kimma) +library(microbenchmark) +library(ggplot2) +library(tidyr) +library(scales) +library(knitr) +library(gridExtra) +``` + +```{r source-functions} +source("geneLME.R") +cat("Sourced: geneLME.R\n") +``` + +--- + +# Mock Data + +Same design as v1 benchmark: 10 patients × 3 treatments × 4 visits = 120 samples. +Full dataset = 2,000 genes; accuracy subset = 50 genes. +Genes 1–100 have a simulated TrtC:V3 +2.5 log2 effect. + +```{r mock-data} +set.seed(42) + +n_patients <- 10 +treatments <- c("TrtA", "TrtB", "TrtC") +visits <- c("V1", "V2", "V3", "V4") +n_genes <- 2000 +n_signal <- 100 + +patient_meta <- data.frame( + ptID = paste0("pt", sprintf("%02d", 1:n_patients)), + sex = factor(sample(c("M", "F"), n_patients, replace = TRUE)), + age = round(rnorm(n_patients, mean = 38, sd = 10)), + stringsAsFactors = FALSE +) + +targets <- expand.grid( + ptID = paste0("pt", sprintf("%02d", 1:n_patients)), + treatment = treatments, + visit = visits, + stringsAsFactors = FALSE +) %>% + arrange(ptID, treatment, visit) %>% + left_join(patient_meta, by = "ptID") %>% + mutate( + sample_id = paste(ptID, treatment, visit, sep = "_"), + libID = sample_id, + rNANgUl = rnorm(n(), mean = 5, sd = 1), + lib.size = round(rnorm(n(), mean = 20e6, sd = 3e6)), + norm.factors = rnorm(n(), mean = 1, sd = 0.05) + ) +rownames(targets) <- targets$sample_id +n_samples <- nrow(targets) + +E_mat <- matrix( + rnorm(n_genes * n_samples, mean = 8, sd = 2), + nrow = n_genes, ncol = n_samples, + dimnames = list( + paste0("gene", sprintf("%04d", 1:n_genes)), + targets$sample_id + ) +) + +trtC_v3 <- which(targets$treatment == "TrtC" & targets$visit == "V3") +E_mat[1:n_signal, trtC_v3] <- E_mat[1:n_signal, trtC_v3] + 2.5 + +for (pt in unique(targets$ptID)) { + idx <- which(targets$ptID == pt) + E_mat[, idx] <- E_mat[, idx] + rnorm(1, 0, 1) +} + +W_mat <- matrix( + abs(rnorm(n_genes * n_samples, mean = 1, sd = 0.1)), + nrow = n_genes, ncol = n_samples, + dimnames = dimnames(E_mat) +) + +dat_bench <- list(E = E_mat, weights = W_mat, targets = targets) +dat_50 <- list(E = E_mat[1:50, ], weights = W_mat[1:50, ], targets = targets) + +cat("Design:", n_patients, "patients ×", length(treatments), "treatments ×", + length(visits), "visits =", n_samples, "samples\n") +cat("Full dataset:", n_genes, "genes | Accuracy subset: 50 genes\n") +cat("Signal genes (TrtC:V3 +2.5 log2):", n_signal, "\n") +``` + +--- + +# Contrast Specifications + +```{r contrast-specs} +spec_full <- geneLME_contrast_spec(targets, contrast_vars = "treatment:visit") + +spec_6 <- spec_full %>% + filter( + sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl), + sub(".* ", "", contrast_ref) %in% c("V2", "V3") + ) + +spec_3 <- spec_full %>% + filter( + sub(" .*", "", contrast_ref) == sub(" .*", "", contrast_lvl), + sub(".* ", "", contrast_ref) == "V2", + sub(".* ", "", contrast_lvl) == "V3" + ) + +cat("spec_3 (longitudinal V2→V3, same treatment): ", nrow(spec_3), "contrasts\n") +cat("spec_6 (between-treatment, V2 & V3): ", nrow(spec_6), "contrasts\n") +cat("spec_full (all pairwise): ", nrow(spec_full), "contrasts\n") +``` + +--- + +# 1. Sign Consistency: geneLME vs kimma + +## 1a. Run both methods (50 genes, 66 contrasts) + +```{r accuracy-run, cache=TRUE} +# --- geneLME --- +res_glme <- geneLME( + dat = dat_50, + formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = spec_full, + n_cores = 8 +) + +cat("geneLME model status summary:\n") +print(table(res_glme$lme_err)) + +# --- kimma --- +res_kimma <- suppressMessages( + kmFit( + dat = dat_50, + run_lme = TRUE, + use_weights = TRUE, + model = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + run_contrast = TRUE, + contrast_var = "treatment:visit", + patientID = "ptID", + libraryID = "libID", + processors = 8 + ) +) + +cat("kimma genes in contrast output:", + length(unique(res_kimma$lme.contrast$gene)), "/ 50\n") +``` + +## 1b. Inspect raw contrast direction columns + +Before joining, we examine the direction columns from each method. +Note: **kimma has no `contrast` string column** — direction is encoded entirely +in `contrast_ref` and `contrast_lvl`. Also note that kimma's `statistic` column +is the **negated** t-statistic (`–estimate/SE`), i.e. opposite sign to `estimate`. +To compare against geneLME's `t.ratio`, we use `-statistic`. + +```{r sign-inspect} +# geneLME: contrast string is "lvl - ref" (set by us via the named contrast vector) +glme_contrast_names <- res_glme$lme_contrast %>% + filter(contrast_order == "first_order") %>% + distinct(contrast, contrast_ref, contrast_lvl) %>% + arrange(contrast_ref, contrast_lvl) + +cat("geneLME — sample contrast name strings (first 10):\n") +print(head(glme_contrast_names, 10)) + +# kimma: no 'contrast' column — only contrast_ref and contrast_lvl +km_contrast_names <- res_kimma$lme.contrast %>% + distinct(contrast_ref, contrast_lvl) %>% + mutate(contrast_key_km = paste(contrast_lvl, contrast_ref, sep = " - ")) %>% + arrange(contrast_ref, contrast_lvl) + +cat("\nkimma — direction pairs (ref → lvl), first 10:\n") +print(head(km_contrast_names, 10)) + +# For each unique (ref, lvl) pair, check if kimma has it in the same or opposite +# orientation using a canonical sorted-pair key. +glme_pairs <- glme_contrast_names %>% + select(contrast_ref, contrast_lvl) %>% + mutate(canonical = paste(pmin(contrast_ref, contrast_lvl), + pmax(contrast_ref, contrast_lvl), sep = "|||"), + glme_dir = paste(contrast_ref, contrast_lvl, sep = "->")) + +km_pairs <- km_contrast_names %>% + select(contrast_ref, contrast_lvl) %>% + mutate(canonical = paste(pmin(contrast_ref, contrast_lvl), + pmax(contrast_ref, contrast_lvl), sep = "|||"), + km_dir = paste(contrast_ref, contrast_lvl, sep = "->")) + +pair_compare <- inner_join(glme_pairs, km_pairs, by = "canonical") %>% + mutate(direction_agree = glme_dir == km_dir) + +n_agree <- sum(pair_compare$direction_agree) +n_flip <- sum(!pair_compare$direction_agree) +n_total_pairs <- nrow(pair_compare) + +cat(sprintf("\nUnique contrast pairs found in both methods: %d\n", n_total_pairs)) +cat(sprintf("Direction agrees (same ref→lvl): %d / %d\n", n_agree, n_total_pairs)) +cat(sprintf("Direction flipped (kimma swapped): %d / %d\n", n_flip, n_total_pairs)) +``` + +## 1c. Direction-aware join + +We join on `(gene, contrast_ref, contrast_lvl)`, then also attempt a flipped join +`(gene, contrast_ref = kimma$contrast_lvl, contrast_lvl = kimma$contrast_ref)` to +catch any pairs where kimma assigned the opposite direction. Sign-correction is +applied to flipped pairs before combining. + +```{r sign-join} +glme_std <- res_glme$lme_contrast %>% + filter(contrast_order == "first_order", !is.na(estimate)) %>% + select(gene, + contrast_ref_glme = contrast_ref, + contrast_lvl_glme = contrast_lvl, + estimate_glme = estimate, + t_glme = t.ratio, + p_glme = p.value) + +# kimma's 'statistic' column is the NEGATED t-statistic: statistic = –(estimate / SE). +# To get a signed t comparable to geneLME's t.ratio, negate statistic. +km_std <- res_kimma$lme.contrast %>% + mutate(t_signed = -statistic) %>% + select(gene, + contrast_ref_km = contrast_ref, + contrast_lvl_km = contrast_lvl, + estimate_kimma = estimate, + t_kimma = t_signed, + p_kimma = pval) + +# Forward join: same direction +joined_fwd <- inner_join( + glme_std, km_std, + by = c("gene", + "contrast_ref_glme" = "contrast_ref_km", + "contrast_lvl_glme" = "contrast_lvl_km") +) %>% mutate(direction = "same") + +# Flipped join: kimma has ref/lvl swapped relative to geneLME +joined_flip <- inner_join( + glme_std, km_std, + by = c("gene", + "contrast_ref_glme" = "contrast_lvl_km", + "contrast_lvl_glme" = "contrast_ref_km") +) %>% + mutate( + direction = "flipped", + estimate_kimma = -estimate_kimma, + t_kimma = -t_kimma + ) + +joined_all <- bind_rows(joined_fwd, joined_flip) + +cat("Rows joined (same direction): ", nrow(joined_fwd), "\n") +cat("Rows joined (flipped direction):", nrow(joined_flip), "\n") +cat("Total matched rows: ", nrow(joined_all), "\n") +cat("Unique genes in comparison: ", length(unique(joined_all$gene)), "\n") + +cat("\nDirection summary across all gene × contrast pairs:\n") +print(table(joined_all$direction)) +``` + +## 1d. Estimate correlation plots + +```{r sign-scatter-full, fig.height=5, fig.width=10} +cor_all <- cor(joined_all$estimate_glme, joined_all$estimate_kimma, use = "complete.obs") +mad_all <- mean(abs(joined_all$estimate_glme - joined_all$estimate_kimma), na.rm = TRUE) + +cor_fwd <- cor(joined_fwd$estimate_glme, joined_fwd$estimate_kimma, use = "complete.obs") +cor_flip <- if (nrow(joined_flip) > 0) { + cor(joined_flip$estimate_glme, joined_flip$estimate_kimma, use = "complete.obs") +} else { + NA_real_ +} + +p1 <- ggplot(joined_all, aes(x = estimate_glme, y = estimate_kimma, colour = direction)) + + geom_point(alpha = 0.2, size = 0.8) + + geom_abline(slope = 1, intercept = 0, colour = "firebrick", + linewidth = 0.8, linetype = "dashed") + + scale_colour_manual(values = c("same" = "#2c7bb6", "flipped" = "#d7191c")) + + annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4, + label = sprintf("r = %.6f\nMAD = %.2e", cor_all, mad_all), + size = 3.5, family = "mono") + + labs( + title = "Contrast estimates after direction-aware join", + subtitle = sprintf("All matched pairs (n = %d) | blue = same direction, red = flipped", + nrow(joined_all)), + x = "geneLME estimate", y = "kimma estimate (sign-corrected)", + colour = "Direction" + ) + + theme_bw(base_size = 12) + + theme(legend.position = "bottom") + +p2 <- ggplot(joined_all, aes(x = t_glme, y = t_kimma, colour = direction)) + + geom_point(alpha = 0.2, size = 0.8) + + geom_abline(slope = 1, intercept = 0, colour = "firebrick", + linewidth = 0.8, linetype = "dashed") + + scale_colour_manual(values = c("same" = "#1a9641", "flipped" = "#d7191c")) + + labs( + title = "t-statistics after direction-aware join", + subtitle = sprintf("All matched pairs (n = %d)", nrow(joined_all)), + x = "geneLME t.ratio", y = "kimma statistic (sign-corrected)", + colour = "Direction" + ) + + theme_bw(base_size = 12) + + theme(legend.position = "bottom") + +gridExtra::grid.arrange(p1, p2, ncol = 2) +``` + +```{r sign-scatter-naive, fig.height=5, fig.width=10} +# For reference: naive forward-only join and same-direction subset +cor_naive <- cor(joined_fwd$estimate_glme, joined_fwd$estimate_kimma, use = "complete.obs") +mad_naive <- mean(abs(joined_fwd$estimate_glme - joined_fwd$estimate_kimma), na.rm = TRUE) + +p3 <- ggplot(joined_fwd, aes(x = estimate_glme, y = estimate_kimma)) + + geom_point(alpha = 0.2, size = 0.8, colour = "#2c7bb6") + + geom_abline(slope = 1, intercept = 0, colour = "firebrick", + linewidth = 0.8, linetype = "dashed") + + annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4, + label = sprintf("r = %.6f\nMAD = %.2e\nn = %d", + cor_naive, mad_naive, nrow(joined_fwd)), + size = 3.5, family = "mono") + + labs( + title = "Same-direction pairs only (ref/lvl agree)", + x = "geneLME estimate", y = "kimma estimate" + ) + + theme_bw(base_size = 12) + +p4 <- ggplot(joined_all, aes(x = estimate_glme, y = estimate_kimma)) + + geom_point(alpha = 0.2, size = 0.8, colour = "#2c7bb6") + + geom_abline(slope = 1, intercept = 0, colour = "firebrick", + linewidth = 0.8, linetype = "dashed") + + annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4, + label = sprintf("r = %.6f\nMAD = %.2e\nn = %d", + cor_all, mad_all, nrow(joined_all)), + size = 3.5, family = "mono") + + labs( + title = "All pairs after direction correction", + x = "geneLME estimate", y = "kimma estimate (sign-corrected)" + ) + + theme_bw(base_size = 12) + +gridExtra::grid.arrange(p3, p4, ncol = 2) +``` + +## 1e. Sign consistency summary + +```{r sign-summary} +kable( + data.frame( + Metric = c( + "Total gene × contrast pairs joined", + "Same direction (ref/lvl agree)", + "Flipped direction (kimma has ref/lvl swapped)", + "% pairs with direction agreement", + "Estimate r (direction-corrected)", + "Estimate MAD (direction-corrected)", + "t-statistic r (direction-corrected)" + ), + Value = c( + nrow(joined_all), + nrow(joined_fwd), + nrow(joined_flip), + sprintf("%.1f%%", 100 * nrow(joined_fwd) / nrow(joined_all)), + sprintf("%.8f", cor_all), + sprintf("%.2e", mad_all), + sprintf("%.8f", cor(joined_all$t_glme, joined_all$t_kimma, use = "complete.obs")) + ) + ), + col.names = c("Metric", "Value"), + caption = "Sign consistency summary: geneLME vs kimma (50 genes, 66 contrasts)" +) +``` + +> **Interpretation:** When kimma and geneLME assign the same direction to a contrast pair, +> estimates should be numerically identical (both use `lme4` + `emmeans`). Flipped pairs +> reflect a difference in which cell was designated as "reference" — after sign-correction +> the numerical values should agree equally well. The proportion of flipped pairs reveals +> how often kimma's internal direction assignment differs from `geneLME`'s `contrast_spec` +> ordering. + +--- + +# 2. Speed Comparison: geneLME vs kimma + +## 2a. Microbenchmark (2,000 genes, 5 reps) + +```{r bench-speed-glme, cache=TRUE} +cat("Running geneLME microbenchmark (5 reps each — this will take several minutes)...\n") + +mb_glme <- microbenchmark( + + glme_3 = geneLME( + dat = dat_bench, formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + model_weights = TRUE, run_contrast = TRUE, + contrast_vars = "treatment:visit", contrast_spec = spec_3, n_cores = 8 + ), + glme_6 = geneLME( + dat = dat_bench, formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + model_weights = TRUE, run_contrast = TRUE, + contrast_vars = "treatment:visit", contrast_spec = spec_6, n_cores = 8 + ), + glme_66 = geneLME( + dat = dat_bench, formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + model_weights = TRUE, run_contrast = TRUE, + contrast_vars = "treatment:visit", contrast_spec = spec_full, n_cores = 8 + ), + + times = 5, unit = "sec" +) + +print(summary(mb_glme)) +``` + +```{r bench-speed-kimma, cache=TRUE} +cat("Running kimma microbenchmark (5 reps, 66 contrasts)...\n") + +mb_kimma <- microbenchmark( + kimma_66 = suppressMessages(kmFit( + dat = dat_bench, run_lme = TRUE, use_weights = TRUE, + model = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + run_contrast = TRUE, contrast_var = "treatment:visit", + patientID = "ptID", libraryID = "libID", processors = 8 + )), + times = 5, unit = "sec" +) + +print(summary(mb_kimma)) +``` + +## 2b. Speed comparison plots + +```{r bench-plot-combined, fig.height=5, fig.width=10} +sum_glme <- summary(mb_glme) %>% as.data.frame() %>% + mutate( + version = "geneLME.R", + n_contrasts = c(3, 6, 66), + label = paste0("geneLME\n(", n_contrasts, " contrasts)") + ) + +sum_kimma <- summary(mb_kimma) %>% as.data.frame() %>% + mutate( + version = "kimma::kmFit", + n_contrasts = 66, + label = "kimma\n(66 contrasts)" + ) + +bench_all <- bind_rows(sum_glme, sum_kimma) %>% + mutate(label = factor(label, levels = c( + "geneLME\n(3 contrasts)", + "geneLME\n(6 contrasts)", + "geneLME\n(66 contrasts)", + "kimma\n(66 contrasts)" + ))) + +pal <- c("geneLME.R" = "#2c7bb6", + "kimma::kmFit" = "#d7191c") + +ggplot(bench_all, aes(x = label, y = median, fill = version, colour = version)) + + geom_col(alpha = 0.75, width = 0.6) + + geom_errorbar(aes(ymin = lq, ymax = uq), width = 0.25, linewidth = 0.7) + + scale_fill_manual(values = pal) + + scale_colour_manual(values = pal) + + labs( + title = "Runtime comparison: geneLME vs kimma (2,000 genes, 5 reps)", + subtitle = "Error bars = lower / upper quartile | geneLME shown at 3, 6, and 66 contrasts", + x = NULL, y = "Median runtime (seconds)", + fill = "Method", colour = "Method" + ) + + theme_bw(base_size = 12) + + theme(legend.position = "bottom", + panel.grid.major.x = element_blank()) +``` + +```{r bench-plot-scaling, fig.height=4, fig.width=8} +kimma_med <- sum_kimma$median + +ggplot(sum_glme, aes(x = n_contrasts, y = median)) + + geom_ribbon(aes(ymin = lq, ymax = uq), alpha = 0.15, fill = "#2c7bb6") + + geom_line(colour = "#2c7bb6", linewidth = 0.9) + + geom_point(colour = "#2c7bb6", size = 3) + + geom_hline(yintercept = kimma_med, colour = "#d7191c", + linetype = "dashed", linewidth = 0.8) + + annotate("text", x = 66, y = kimma_med, vjust = -0.7, hjust = 1, + colour = "#d7191c", size = 3.5, label = "kimma (66 contrasts)") + + scale_x_continuous(breaks = c(3, 6, 66)) + + labs( + title = "geneLME runtime scaling with number of contrasts", + subtitle = "Shaded band = lower/upper quartile | Dashed red = kimma at 66 contrasts", + x = "Number of contrasts", y = "Median runtime (seconds)" + ) + + theme_bw(base_size = 12) +``` + +## 2c. Speed summary table + +```{r bench-table} +kimma_med <- sum_kimma$median + +kable( + bench_all %>% + mutate( + speedup_vs_kimma = round(kimma_med / median, 2), + across(c(min, lq, median, mean, uq, max), ~ round(., 1)) + ) %>% + select( + Method = version, + `N contrasts` = n_contrasts, + Min = min, + Q1 = lq, + Median = median, + Q3 = uq, + `Speedup vs kimma (66 contrasts)` = speedup_vs_kimma + ), + caption = "Runtime summary (seconds) — 2,000 genes, 5 repetitions", + na_string = "—" +) +``` + +--- + +# 3. Summary + +```{r final-summary} +n_same <- nrow(joined_fwd) +n_flip <- nrow(joined_flip) +n_total <- nrow(joined_all) +pct_agree <- 100 * n_same / n_total + +speedup_3 <- kimma_med / sum_glme$median[sum_glme$n_contrasts == 3] +speedup_6 <- kimma_med / sum_glme$median[sum_glme$n_contrasts == 6] +speedup_66 <- kimma_med / sum_glme$median[sum_glme$n_contrasts == 66] + +cat(sprintf( +"=== SIGN CONSISTENCY (geneLME vs kimma) === + %d / %d (%.1f%%) gene×contrast pairs: same ref/lvl direction + %d / %d (%.1f%%) pairs: kimma has direction flipped relative to contrast_spec + After direction correction — estimate r = %.8f, MAD = %.2e + +=== SPEED (2,000 genes, 8 cores, 5 reps) === + geneLME 3 contrasts median: %.1f s (%.2fx %s than kimma at 66) + geneLME 6 contrasts median: %.1f s (%.2fx %s than kimma at 66) + geneLME 66 contrasts median: %.1f s (%.2fx %s than kimma at 66) + kimma 66 contrasts median: %.1f s", + n_same, n_total, pct_agree, + n_flip, n_total, 100 - pct_agree, + cor_all, mad_all, + sum_glme$median[sum_glme$n_contrasts == 3], + speedup_3, ifelse(speedup_3 > 1, "faster", "slower"), + sum_glme$median[sum_glme$n_contrasts == 6], + speedup_6, ifelse(speedup_6 > 1, "faster", "slower"), + sum_glme$median[sum_glme$n_contrasts == 66], + speedup_66, ifelse(speedup_66 > 1, "faster", "slower"), + kimma_med +)) +``` + +--- + +# Session Info + +```{r session-info} +sessionInfo() +``` diff --git a/R_functions/geneLME_benchmark2.html b/R_functions/geneLME_benchmark2.html new file mode 100644 index 0000000..0e95af8 --- /dev/null +++ b/R_functions/geneLME_benchmark2.html @@ -0,0 +1,3853 @@ + + + + + + + + + + + + + + +geneLME Benchmarking v2 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + +
+
+

Overview

+

This report benchmarks geneLME.R (current stable, +incorporating all dev2 changes) against kimma::kmFit across +three analyses:

+
    +
  1. Sign consistency: Direction-aware join between +geneLME and kimma contrast estimates. Identifies same-direction vs +flipped-direction pairs; verifies numerical agreement after sign +correction.

  2. +
  3. Speed comparison: microbenchmark (5 +reps each) on 2,000 genes comparing geneLME at 3, 6, and 66 contrasts +against kimma’s default 66-contrast run.

  4. +
  5. Summary: Key metrics from both analyses in one +place.

  6. +
+

geneLME.R features (as of 2026-02-20 merge):

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
FeatureDetails
Singular fit handlingmodel_status = "singular_fit" flag; results returned +for all genes
Branch A contrast buildPre-computed once before parallel stage +(geneLME_build_contrast_args())
Warning suppressionPackage startup messages + rescaling warnings silenced
Parallel planfuture::plan(future::multisession) / +future::plan(future::sequential)
+
+
+
+

Setup

+
library(lme4)
+library(emmeans)
+library(car)
+library(broom.mixed)
+library(dplyr)
+library(tibble)
+library(purrr)
+library(future)
+library(future.apply)
+library(kimma)
+library(microbenchmark)
+library(ggplot2)
+library(tidyr)
+library(scales)
+library(knitr)
+library(gridExtra)
+
source("geneLME.R")
+cat("Sourced: geneLME.R\n")
+
#> Sourced: geneLME.R
+
+
+
+

Mock Data

+

Same design as v1 benchmark: 10 patients × 3 treatments × 4 visits = +120 samples. Full dataset = 2,000 genes; accuracy subset = 50 genes. +Genes 1–100 have a simulated TrtC:V3 +2.5 log2 effect.

+
set.seed(42)
+
+n_patients <- 10
+treatments <- c("TrtA", "TrtB", "TrtC")
+visits     <- c("V1", "V2", "V3", "V4")
+n_genes    <- 2000
+n_signal   <- 100
+
+patient_meta <- data.frame(
+  ptID = paste0("pt", sprintf("%02d", 1:n_patients)),
+  sex  = factor(sample(c("M", "F"), n_patients, replace = TRUE)),
+  age  = round(rnorm(n_patients, mean = 38, sd = 10)),
+  stringsAsFactors = FALSE
+)
+
+targets <- expand.grid(
+  ptID      = paste0("pt", sprintf("%02d", 1:n_patients)),
+  treatment = treatments,
+  visit     = visits,
+  stringsAsFactors = FALSE
+) %>%
+  arrange(ptID, treatment, visit) %>%
+  left_join(patient_meta, by = "ptID") %>%
+  mutate(
+    sample_id    = paste(ptID, treatment, visit, sep = "_"),
+    libID        = sample_id,
+    rNANgUl      = rnorm(n(), mean = 5,    sd = 1),
+    lib.size     = round(rnorm(n(), mean = 20e6, sd = 3e6)),
+    norm.factors = rnorm(n(), mean = 1,    sd = 0.05)
+  )
+rownames(targets) <- targets$sample_id
+n_samples <- nrow(targets)
+
+E_mat <- matrix(
+  rnorm(n_genes * n_samples, mean = 8, sd = 2),
+  nrow = n_genes, ncol = n_samples,
+  dimnames = list(
+    paste0("gene", sprintf("%04d", 1:n_genes)),
+    targets$sample_id
+  )
+)
+
+trtC_v3 <- which(targets$treatment == "TrtC" & targets$visit == "V3")
+E_mat[1:n_signal, trtC_v3] <- E_mat[1:n_signal, trtC_v3] + 2.5
+
+for (pt in unique(targets$ptID)) {
+  idx <- which(targets$ptID == pt)
+  E_mat[, idx] <- E_mat[, idx] + rnorm(1, 0, 1)
+}
+
+W_mat <- matrix(
+  abs(rnorm(n_genes * n_samples, mean = 1, sd = 0.1)),
+  nrow = n_genes, ncol = n_samples,
+  dimnames = dimnames(E_mat)
+)
+
+dat_bench <- list(E = E_mat, weights = W_mat, targets = targets)
+dat_50    <- list(E = E_mat[1:50, ], weights = W_mat[1:50, ], targets = targets)
+
+cat("Design:", n_patients, "patients ×", length(treatments), "treatments ×",
+    length(visits), "visits =", n_samples, "samples\n")
+
#> Design: 10 patients × 3 treatments × 4 visits = 120 samples
+
cat("Full dataset:", n_genes, "genes | Accuracy subset: 50 genes\n")
+
#> Full dataset: 2000 genes | Accuracy subset: 50 genes
+
cat("Signal genes (TrtC:V3 +2.5 log2):", n_signal, "\n")
+
#> Signal genes (TrtC:V3 +2.5 log2): 100
+
+
+
+

Contrast Specifications

+
spec_full <- geneLME_contrast_spec(targets, contrast_vars = "treatment:visit")
+
+spec_6 <- spec_full %>%
+  filter(
+    sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl),
+    sub(".* ", "", contrast_ref) %in% c("V2", "V3")
+  )
+
+spec_3 <- spec_full %>%
+  filter(
+    sub(" .*", "", contrast_ref) == sub(" .*", "", contrast_lvl),
+    sub(".* ", "", contrast_ref) == "V2",
+    sub(".* ", "", contrast_lvl) == "V3"
+  )
+
+cat("spec_3  (longitudinal V2→V3, same treatment):         ", nrow(spec_3),  "contrasts\n")
+
#> spec_3  (longitudinal V2→V3, same treatment):          3 contrasts
+
cat("spec_6  (between-treatment, V2 & V3):                 ", nrow(spec_6),  "contrasts\n")
+
#> spec_6  (between-treatment, V2 & V3):                  6 contrasts
+
cat("spec_full (all pairwise):                             ", nrow(spec_full), "contrasts\n")
+
#> spec_full (all pairwise):                              66 contrasts
+
+
+
+

1. Sign Consistency: geneLME vs kimma

+
+

1a. Run both methods (50 genes, 66 contrasts)

+
# --- geneLME ---
+res_glme <- geneLME(
+  dat           = dat_50,
+  formula_str   = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+  model_weights = TRUE,
+  run_contrast  = TRUE,
+  contrast_vars = "treatment:visit",
+  contrast_spec = spec_full,
+  n_cores       = 8
+)
+
+cat("geneLME model status summary:\n")
+
#> geneLME model status summary:
+
print(table(res_glme$lme_err))
+
#> 
+#> success 
+#>      50
+
# --- kimma ---
+res_kimma <- suppressMessages(
+  kmFit(
+    dat          = dat_50,
+    run_lme      = TRUE,
+    use_weights  = TRUE,
+    model        = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+    run_contrast = TRUE,
+    contrast_var = "treatment:visit",
+    patientID    = "ptID",
+    libraryID    = "libID",
+    processors   = 8
+  )
+)
+
+cat("kimma genes in contrast output:",
+    length(unique(res_kimma$lme.contrast$gene)), "/ 50\n")
+
#> kimma genes in contrast output: 50 / 50
+
+
+

1b. Inspect raw contrast direction columns

+

Before joining, we examine the direction columns from each method. +Note: kimma has no contrast string column +— direction is encoded entirely in contrast_ref and +contrast_lvl. Also note that kimma’s statistic +column is the negated t-statistic +(–estimate/SE), i.e. opposite sign to +estimate. To compare against geneLME’s +t.ratio, we use -statistic.

+
# geneLME: contrast string is "lvl - ref" (set by us via the named contrast vector)
+glme_contrast_names <- res_glme$lme_contrast %>%
+  filter(contrast_order == "first_order") %>%
+  distinct(contrast, contrast_ref, contrast_lvl) %>%
+  arrange(contrast_ref, contrast_lvl)
+
+cat("geneLME — sample contrast name strings (first 10):\n")
+
#> geneLME — sample contrast name strings (first 10):
+
print(head(glme_contrast_names, 10))
+
#> # A tibble: 10 × 3
+#>    contrast          contrast_ref contrast_lvl
+#>    <chr>             <chr>        <chr>       
+#>  1 TrtA V2 - TrtA V1 TrtA V1      TrtA V2     
+#>  2 TrtA V3 - TrtA V1 TrtA V1      TrtA V3     
+#>  3 TrtA V4 - TrtA V1 TrtA V1      TrtA V4     
+#>  4 TrtB V1 - TrtA V1 TrtA V1      TrtB V1     
+#>  5 TrtB V2 - TrtA V1 TrtA V1      TrtB V2     
+#>  6 TrtB V3 - TrtA V1 TrtA V1      TrtB V3     
+#>  7 TrtB V4 - TrtA V1 TrtA V1      TrtB V4     
+#>  8 TrtC V1 - TrtA V1 TrtA V1      TrtC V1     
+#>  9 TrtC V2 - TrtA V1 TrtA V1      TrtC V2     
+#> 10 TrtC V3 - TrtA V1 TrtA V1      TrtC V3
+
# kimma: no 'contrast' column — only contrast_ref and contrast_lvl
+km_contrast_names <- res_kimma$lme.contrast %>%
+  distinct(contrast_ref, contrast_lvl) %>%
+  mutate(contrast_key_km = paste(contrast_lvl, contrast_ref, sep = " - ")) %>%
+  arrange(contrast_ref, contrast_lvl)
+
+cat("\nkimma — direction pairs (ref → lvl), first 10:\n")
+
#> 
+#> kimma — direction pairs (ref → lvl), first 10:
+
print(head(km_contrast_names, 10))
+
#> # A tibble: 10 × 3
+#>    contrast_ref contrast_lvl contrast_key_km  
+#>    <chr>        <chr>        <chr>            
+#>  1 TrtA V1      TrtA V2      TrtA V2 - TrtA V1
+#>  2 TrtA V1      TrtA V3      TrtA V3 - TrtA V1
+#>  3 TrtA V1      TrtA V4      TrtA V4 - TrtA V1
+#>  4 TrtA V1      TrtB V1      TrtB V1 - TrtA V1
+#>  5 TrtA V1      TrtB V2      TrtB V2 - TrtA V1
+#>  6 TrtA V1      TrtB V3      TrtB V3 - TrtA V1
+#>  7 TrtA V1      TrtB V4      TrtB V4 - TrtA V1
+#>  8 TrtA V1      TrtC V1      TrtC V1 - TrtA V1
+#>  9 TrtA V1      TrtC V2      TrtC V2 - TrtA V1
+#> 10 TrtA V1      TrtC V3      TrtC V3 - TrtA V1
+
# For each unique (ref, lvl) pair, check if kimma has it in the same or opposite
+# orientation using a canonical sorted-pair key.
+glme_pairs <- glme_contrast_names %>%
+  select(contrast_ref, contrast_lvl) %>%
+  mutate(canonical = paste(pmin(contrast_ref, contrast_lvl),
+                           pmax(contrast_ref, contrast_lvl), sep = "|||"),
+         glme_dir  = paste(contrast_ref, contrast_lvl, sep = "->"))
+
+km_pairs <- km_contrast_names %>%
+  select(contrast_ref, contrast_lvl) %>%
+  mutate(canonical = paste(pmin(contrast_ref, contrast_lvl),
+                           pmax(contrast_ref, contrast_lvl), sep = "|||"),
+         km_dir    = paste(contrast_ref, contrast_lvl, sep = "->"))
+
+pair_compare <- inner_join(glme_pairs, km_pairs, by = "canonical") %>%
+  mutate(direction_agree = glme_dir == km_dir)
+
+n_agree        <- sum(pair_compare$direction_agree)
+n_flip         <- sum(!pair_compare$direction_agree)
+n_total_pairs  <- nrow(pair_compare)
+
+cat(sprintf("\nUnique contrast pairs found in both methods: %d\n", n_total_pairs))
+
#> 
+#> Unique contrast pairs found in both methods: 66
+
cat(sprintf("Direction agrees (same ref→lvl):     %d / %d\n", n_agree, n_total_pairs))
+
#> Direction agrees (same ref→lvl):     66 / 66
+
cat(sprintf("Direction flipped (kimma swapped):   %d / %d\n", n_flip,  n_total_pairs))
+
#> Direction flipped (kimma swapped):   0 / 66
+
+
+

1c. Direction-aware join

+

We join on (gene, contrast_ref, contrast_lvl), then also +attempt a flipped join +(gene, contrast_ref = kimma$contrast_lvl, contrast_lvl = kimma$contrast_ref) +to catch any pairs where kimma assigned the opposite direction. +Sign-correction is applied to flipped pairs before combining.

+
glme_std <- res_glme$lme_contrast %>%
+  filter(contrast_order == "first_order", !is.na(estimate)) %>%
+  select(gene,
+         contrast_ref_glme = contrast_ref,
+         contrast_lvl_glme = contrast_lvl,
+         estimate_glme     = estimate,
+         t_glme            = t.ratio,
+         p_glme            = p.value)
+
+# kimma's 'statistic' column is the NEGATED t-statistic: statistic = –(estimate / SE).
+# To get a signed t comparable to geneLME's t.ratio, negate statistic.
+km_std <- res_kimma$lme.contrast %>%
+  mutate(t_signed = -statistic) %>%
+  select(gene,
+         contrast_ref_km = contrast_ref,
+         contrast_lvl_km = contrast_lvl,
+         estimate_kimma  = estimate,
+         t_kimma         = t_signed,
+         p_kimma         = pval)
+
+# Forward join: same direction
+joined_fwd <- inner_join(
+  glme_std, km_std,
+  by = c("gene",
+         "contrast_ref_glme" = "contrast_ref_km",
+         "contrast_lvl_glme" = "contrast_lvl_km")
+) %>% mutate(direction = "same")
+
+# Flipped join: kimma has ref/lvl swapped relative to geneLME
+joined_flip <- inner_join(
+  glme_std, km_std,
+  by = c("gene",
+         "contrast_ref_glme" = "contrast_lvl_km",
+         "contrast_lvl_glme" = "contrast_ref_km")
+) %>%
+  mutate(
+    direction      = "flipped",
+    estimate_kimma = -estimate_kimma,
+    t_kimma        = -t_kimma
+  )
+
+joined_all <- bind_rows(joined_fwd, joined_flip)
+
+cat("Rows joined (same direction):   ", nrow(joined_fwd),  "\n")
+
#> Rows joined (same direction):    3300
+
cat("Rows joined (flipped direction):", nrow(joined_flip), "\n")
+
#> Rows joined (flipped direction): 0
+
cat("Total matched rows:             ", nrow(joined_all),  "\n")
+
#> Total matched rows:              3300
+
cat("Unique genes in comparison:     ", length(unique(joined_all$gene)), "\n")
+
#> Unique genes in comparison:      50
+
cat("\nDirection summary across all gene × contrast pairs:\n")
+
#> 
+#> Direction summary across all gene × contrast pairs:
+
print(table(joined_all$direction))
+
#> 
+#> same 
+#> 3300
+
+
+

1d. Estimate correlation plots

+
cor_all  <- cor(joined_all$estimate_glme, joined_all$estimate_kimma, use = "complete.obs")
+mad_all  <- mean(abs(joined_all$estimate_glme - joined_all$estimate_kimma), na.rm = TRUE)
+
+cor_fwd  <- cor(joined_fwd$estimate_glme, joined_fwd$estimate_kimma, use = "complete.obs")
+cor_flip <- if (nrow(joined_flip) > 0) {
+  cor(joined_flip$estimate_glme, joined_flip$estimate_kimma, use = "complete.obs")
+} else {
+  NA_real_
+}
+
+p1 <- ggplot(joined_all, aes(x = estimate_glme, y = estimate_kimma, colour = direction)) +
+  geom_point(alpha = 0.2, size = 0.8) +
+  geom_abline(slope = 1, intercept = 0, colour = "firebrick",
+              linewidth = 0.8, linetype = "dashed") +
+  scale_colour_manual(values = c("same" = "#2c7bb6", "flipped" = "#d7191c")) +
+  annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4,
+           label = sprintf("r = %.6f\nMAD = %.2e", cor_all, mad_all),
+           size = 3.5, family = "mono") +
+  labs(
+    title    = "Contrast estimates after direction-aware join",
+    subtitle = sprintf("All matched pairs (n = %d) | blue = same direction, red = flipped",
+                       nrow(joined_all)),
+    x = "geneLME estimate", y = "kimma estimate (sign-corrected)",
+    colour = "Direction"
+  ) +
+  theme_bw(base_size = 12) +
+  theme(legend.position = "bottom")
+
+p2 <- ggplot(joined_all, aes(x = t_glme, y = t_kimma, colour = direction)) +
+  geom_point(alpha = 0.2, size = 0.8) +
+  geom_abline(slope = 1, intercept = 0, colour = "firebrick",
+              linewidth = 0.8, linetype = "dashed") +
+  scale_colour_manual(values = c("same" = "#1a9641", "flipped" = "#d7191c")) +
+  labs(
+    title    = "t-statistics after direction-aware join",
+    subtitle = sprintf("All matched pairs (n = %d)", nrow(joined_all)),
+    x = "geneLME t.ratio", y = "kimma statistic (sign-corrected)",
+    colour = "Direction"
+  ) +
+  theme_bw(base_size = 12) +
+  theme(legend.position = "bottom")
+
+gridExtra::grid.arrange(p1, p2, ncol = 2)
+

+
# For reference: naive forward-only join and same-direction subset
+cor_naive <- cor(joined_fwd$estimate_glme, joined_fwd$estimate_kimma, use = "complete.obs")
+mad_naive <- mean(abs(joined_fwd$estimate_glme - joined_fwd$estimate_kimma), na.rm = TRUE)
+
+p3 <- ggplot(joined_fwd, aes(x = estimate_glme, y = estimate_kimma)) +
+  geom_point(alpha = 0.2, size = 0.8, colour = "#2c7bb6") +
+  geom_abline(slope = 1, intercept = 0, colour = "firebrick",
+              linewidth = 0.8, linetype = "dashed") +
+  annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4,
+           label = sprintf("r = %.6f\nMAD = %.2e\nn = %d",
+                           cor_naive, mad_naive, nrow(joined_fwd)),
+           size = 3.5, family = "mono") +
+  labs(
+    title    = "Same-direction pairs only (ref/lvl agree)",
+    x = "geneLME estimate", y = "kimma estimate"
+  ) +
+  theme_bw(base_size = 12)
+
+p4 <- ggplot(joined_all, aes(x = estimate_glme, y = estimate_kimma)) +
+  geom_point(alpha = 0.2, size = 0.8, colour = "#2c7bb6") +
+  geom_abline(slope = 1, intercept = 0, colour = "firebrick",
+              linewidth = 0.8, linetype = "dashed") +
+  annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4,
+           label = sprintf("r = %.6f\nMAD = %.2e\nn = %d",
+                           cor_all, mad_all, nrow(joined_all)),
+           size = 3.5, family = "mono") +
+  labs(
+    title    = "All pairs after direction correction",
+    x = "geneLME estimate", y = "kimma estimate (sign-corrected)"
+  ) +
+  theme_bw(base_size = 12)
+
+gridExtra::grid.arrange(p3, p4, ncol = 2)
+

+
+
+

1e. Sign consistency summary

+
kable(
+  data.frame(
+    Metric  = c(
+      "Total gene × contrast pairs joined",
+      "Same direction (ref/lvl agree)",
+      "Flipped direction (kimma has ref/lvl swapped)",
+      "% pairs with direction agreement",
+      "Estimate r (direction-corrected)",
+      "Estimate MAD (direction-corrected)",
+      "t-statistic r (direction-corrected)"
+    ),
+    Value = c(
+      nrow(joined_all),
+      nrow(joined_fwd),
+      nrow(joined_flip),
+      sprintf("%.1f%%", 100 * nrow(joined_fwd) / nrow(joined_all)),
+      sprintf("%.8f", cor_all),
+      sprintf("%.2e", mad_all),
+      sprintf("%.8f", cor(joined_all$t_glme, joined_all$t_kimma, use = "complete.obs"))
+    )
+  ),
+  col.names = c("Metric", "Value"),
+  caption = "Sign consistency summary: geneLME vs kimma (50 genes, 66 contrasts)"
+)
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Sign consistency summary: geneLME vs kimma (50 genes, 66 +contrasts)
MetricValue
Total gene × contrast pairs joined3300
Same direction (ref/lvl agree)3300
Flipped direction (kimma has ref/lvl swapped)0
% pairs with direction agreement100.0%
Estimate r (direction-corrected)1.00000000
Estimate MAD (direction-corrected)1.16e-15
t-statistic r (direction-corrected)1.00000000
+
+

Interpretation: When kimma and geneLME assign the +same direction to a contrast pair, estimates should be numerically +identical (both use lme4 + emmeans). Flipped +pairs reflect a difference in which cell was designated as “reference” — +after sign-correction the numerical values should agree equally well. +The proportion of flipped pairs reveals how often kimma’s internal +direction assignment differs from geneLME’s +contrast_spec ordering.

+
+
+
+
+
+

2. Speed Comparison: geneLME vs kimma

+
+

2a. Microbenchmark (2,000 genes, 5 reps)

+
cat("Running geneLME microbenchmark (5 reps each — this will take several minutes)...\n")
+
#> Running geneLME microbenchmark (5 reps each — this will take several minutes)...
+
mb_glme <- microbenchmark(
+
+  glme_3  = geneLME(
+    dat = dat_bench, formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+    model_weights = TRUE, run_contrast = TRUE,
+    contrast_vars = "treatment:visit", contrast_spec = spec_3, n_cores = 8
+  ),
+  glme_6  = geneLME(
+    dat = dat_bench, formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+    model_weights = TRUE, run_contrast = TRUE,
+    contrast_vars = "treatment:visit", contrast_spec = spec_6, n_cores = 8
+  ),
+  glme_66 = geneLME(
+    dat = dat_bench, formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+    model_weights = TRUE, run_contrast = TRUE,
+    contrast_vars = "treatment:visit", contrast_spec = spec_full, n_cores = 8
+  ),
+
+  times = 5, unit = "sec"
+)
+
+print(summary(mb_glme))
+
#>      expr     min       lq     mean   median       uq      max neval
+#> 1  glme_3 17.2888 17.34261 17.38137 17.39034 17.42627 17.45885     5
+#> 2  glme_6 18.0317 18.05609 18.09280 18.05814 18.07696 18.24111     5
+#> 3 glme_66 32.5646 32.58537 32.60624 32.60095 32.63049 32.64978     5
+
cat("Running kimma microbenchmark (5 reps, 66 contrasts)...\n")
+
#> Running kimma microbenchmark (5 reps, 66 contrasts)...
+
mb_kimma <- microbenchmark(
+  kimma_66 = suppressMessages(kmFit(
+    dat = dat_bench, run_lme = TRUE, use_weights = TRUE,
+    model = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+    run_contrast = TRUE, contrast_var = "treatment:visit",
+    patientID = "ptID", libraryID = "libID", processors = 8
+  )),
+  times = 5, unit = "sec"
+)
+
+print(summary(mb_kimma))
+
#>       expr      min       lq     mean   median       uq      max neval
+#> 1 kimma_66 32.00641 32.04734 32.11739 32.14782 32.14901 32.23636     5
+
+
+

2b. Speed comparison plots

+
sum_glme <- summary(mb_glme) %>% as.data.frame() %>%
+  mutate(
+    version     = "geneLME.R",
+    n_contrasts = c(3, 6, 66),
+    label       = paste0("geneLME\n(", n_contrasts, " contrasts)")
+  )
+
+sum_kimma <- summary(mb_kimma) %>% as.data.frame() %>%
+  mutate(
+    version     = "kimma::kmFit",
+    n_contrasts = 66,
+    label       = "kimma\n(66 contrasts)"
+  )
+
+bench_all <- bind_rows(sum_glme, sum_kimma) %>%
+  mutate(label = factor(label, levels = c(
+    "geneLME\n(3 contrasts)",
+    "geneLME\n(6 contrasts)",
+    "geneLME\n(66 contrasts)",
+    "kimma\n(66 contrasts)"
+  )))
+
+pal <- c("geneLME.R"    = "#2c7bb6",
+         "kimma::kmFit" = "#d7191c")
+
+ggplot(bench_all, aes(x = label, y = median, fill = version, colour = version)) +
+  geom_col(alpha = 0.75, width = 0.6) +
+  geom_errorbar(aes(ymin = lq, ymax = uq), width = 0.25, linewidth = 0.7) +
+  scale_fill_manual(values   = pal) +
+  scale_colour_manual(values = pal) +
+  labs(
+    title    = "Runtime comparison: geneLME vs kimma (2,000 genes, 5 reps)",
+    subtitle = "Error bars = lower / upper quartile | geneLME shown at 3, 6, and 66 contrasts",
+    x = NULL, y = "Median runtime (seconds)",
+    fill = "Method", colour = "Method"
+  ) +
+  theme_bw(base_size = 12) +
+  theme(legend.position = "bottom",
+        panel.grid.major.x = element_blank())
+

+
kimma_med <- sum_kimma$median
+
+ggplot(sum_glme, aes(x = n_contrasts, y = median)) +
+  geom_ribbon(aes(ymin = lq, ymax = uq), alpha = 0.15, fill = "#2c7bb6") +
+  geom_line(colour = "#2c7bb6", linewidth = 0.9) +
+  geom_point(colour = "#2c7bb6", size = 3) +
+  geom_hline(yintercept = kimma_med, colour = "#d7191c",
+             linetype = "dashed", linewidth = 0.8) +
+  annotate("text", x = 66, y = kimma_med, vjust = -0.7, hjust = 1,
+           colour = "#d7191c", size = 3.5, label = "kimma (66 contrasts)") +
+  scale_x_continuous(breaks = c(3, 6, 66)) +
+  labs(
+    title    = "geneLME runtime scaling with number of contrasts",
+    subtitle = "Shaded band = lower/upper quartile | Dashed red = kimma at 66 contrasts",
+    x = "Number of contrasts", y = "Median runtime (seconds)"
+  ) +
+  theme_bw(base_size = 12)
+

+
+
+

2c. Speed summary table

+
kimma_med <- sum_kimma$median
+
+kable(
+  bench_all %>%
+    mutate(
+      speedup_vs_kimma = round(kimma_med / median, 2),
+      across(c(min, lq, median, mean, uq, max), ~ round(., 1))
+    ) %>%
+    select(
+      Method        = version,
+      `N contrasts` = n_contrasts,
+      Min           = min,
+      Q1            = lq,
+      Median        = median,
+      Q3            = uq,
+      `Speedup vs kimma (66 contrasts)` = speedup_vs_kimma
+    ),
+  caption = "Runtime summary (seconds) — 2,000 genes, 5 repetitions",
+  na_string = "—"
+)
+ + +++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Runtime summary (seconds) — 2,000 genes, 5 +repetitions
MethodN contrastsMinQ1MedianQ3Speedup vs kimma (66 contrasts)
geneLME.R317.317.317.417.41.85
geneLME.R618.018.118.118.11.78
geneLME.R6632.632.632.632.60.99
kimma::kmFit6632.032.032.132.11.00
+
+
+
+
+

3. Summary

+
n_same    <- nrow(joined_fwd)
+n_flip    <- nrow(joined_flip)
+n_total   <- nrow(joined_all)
+pct_agree <- 100 * n_same / n_total
+
+speedup_3  <- kimma_med / sum_glme$median[sum_glme$n_contrasts == 3]
+speedup_6  <- kimma_med / sum_glme$median[sum_glme$n_contrasts == 6]
+speedup_66 <- kimma_med / sum_glme$median[sum_glme$n_contrasts == 66]
+
+cat(sprintf(
+"=== SIGN CONSISTENCY (geneLME vs kimma) ===
+  %d / %d (%.1f%%) gene×contrast pairs: same ref/lvl direction
+  %d / %d (%.1f%%) pairs: kimma has direction flipped relative to contrast_spec
+  After direction correction — estimate r = %.8f, MAD = %.2e
+
+=== SPEED (2,000 genes, 8 cores, 5 reps) ===
+  geneLME  3 contrasts median: %.1f s  (%.2fx %s than kimma at 66)
+  geneLME  6 contrasts median: %.1f s  (%.2fx %s than kimma at 66)
+  geneLME 66 contrasts median: %.1f s  (%.2fx %s than kimma at 66)
+  kimma   66 contrasts median: %.1f s",
+  n_same, n_total, pct_agree,
+  n_flip, n_total, 100 - pct_agree,
+  cor_all, mad_all,
+  sum_glme$median[sum_glme$n_contrasts == 3],
+  speedup_3,  ifelse(speedup_3  > 1, "faster", "slower"),
+  sum_glme$median[sum_glme$n_contrasts == 6],
+  speedup_6,  ifelse(speedup_6  > 1, "faster", "slower"),
+  sum_glme$median[sum_glme$n_contrasts == 66],
+  speedup_66, ifelse(speedup_66 > 1, "faster", "slower"),
+  kimma_med
+))
+
#> === SIGN CONSISTENCY (geneLME vs kimma) ===
+#>   3300 / 3300 (100.0%) gene×contrast pairs: same ref/lvl direction
+#>   0 / 3300 (0.0%) pairs: kimma has direction flipped relative to contrast_spec
+#>   After direction correction — estimate r = 1.00000000, MAD = 1.16e-15
+#> 
+#> === SPEED (2,000 genes, 8 cores, 5 reps) ===
+#>   geneLME  3 contrasts median: 17.4 s  (1.85x faster than kimma at 66)
+#>   geneLME  6 contrasts median: 18.1 s  (1.78x faster than kimma at 66)
+#>   geneLME 66 contrasts median: 32.6 s  (0.99x slower than kimma at 66)
+#>   kimma   66 contrasts median: 32.1 s
+
+
+
+

Session Info

+
sessionInfo()
+
#> R version 4.5.1 (2025-06-13)
+#> Platform: aarch64-apple-darwin20
+#> Running under: macOS Sequoia 15.7.4
+#> 
+#> Matrix products: default
+#> BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
+#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
+#> 
+#> locale:
+#> [1] en_US/en_US/en_US/C/en_US/en_US
+#> 
+#> time zone: America/Los_Angeles
+#> tzcode source: internal
+#> 
+#> attached base packages:
+#> [1] stats     graphics  grDevices utils     datasets  methods   base     
+#> 
+#> other attached packages:
+#>  [1] gridExtra_2.3        knitr_1.51           scales_1.4.0        
+#>  [4] tidyr_1.3.2          ggplot2_4.0.2        microbenchmark_1.5.0
+#>  [7] kimma_1.6.3          future.apply_1.20.1  future_1.69.0       
+#> [10] purrr_1.2.1          tibble_3.3.1         dplyr_1.2.0         
+#> [13] broom.mixed_0.2.9.6  car_3.1-3            carData_3.0-5       
+#> [16] emmeans_2.0.1        lme4_1.1-38          Matrix_1.7-4        
+#> 
+#> loaded via a namespace (and not attached):
+#>  [1] gtable_0.3.6        xfun_0.56           bslib_0.10.0       
+#>  [4] lattice_0.22-7      vctrs_0.7.1         tools_4.5.1        
+#>  [7] Rdpack_2.6.4        generics_0.1.4      parallel_4.5.1     
+#> [10] pkgconfig_2.0.3     data.table_1.18.2.1 RColorBrewer_1.1-3 
+#> [13] S7_0.2.1            lifecycle_1.0.5     compiler_4.5.1     
+#> [16] farver_2.1.2        codetools_0.2-20    htmltools_0.5.9    
+#> [19] sass_0.4.10         yaml_2.3.12         Formula_1.2-5      
+#> [22] pillar_1.11.1       furrr_0.3.1         nloptr_2.2.1       
+#> [25] jquerylib_0.1.4     MASS_7.3-65         cachem_1.1.0       
+#> [28] reformulas_0.4.3.1  iterators_1.0.14    boot_1.3-32        
+#> [31] abind_1.4-8         foreach_1.5.2       nlme_3.1-168       
+#> [34] parallelly_1.46.1   tidyselect_1.2.1    digest_0.6.39      
+#> [37] mvtnorm_1.3-3       listenv_0.10.0      labeling_0.4.3     
+#> [40] forcats_1.0.1       splines_4.5.1       fastmap_1.2.0      
+#> [43] grid_4.5.1          cli_3.6.5           magrittr_2.0.4     
+#> [46] utf8_1.2.6          broom_1.0.11        withr_3.0.2        
+#> [49] backports_1.5.0     estimability_1.5.1  rmarkdown_2.30     
+#> [52] globals_0.18.0      otel_0.2.0          coda_0.19-4.1      
+#> [55] evaluate_1.0.5      rbibutils_2.4.1     doParallel_1.0.17  
+#> [58] rlang_1.1.7         Rcpp_1.1.1          xtable_1.8-4       
+#> [61] glue_1.8.0          minqa_1.2.8         jsonlite_2.0.0     
+#> [64] R6_2.6.1
+
+ + + +
+
+ +
+ + + + + + + + + + + + + + + + + diff --git a/R_functions/geneLME_benchmark2_files/figure-html/bench-plot-combined-1.png b/R_functions/geneLME_benchmark2_files/figure-html/bench-plot-combined-1.png new file mode 100644 index 0000000..ae86bd8 Binary files /dev/null and b/R_functions/geneLME_benchmark2_files/figure-html/bench-plot-combined-1.png differ diff --git a/R_functions/geneLME_benchmark2_files/figure-html/bench-plot-scaling-1.png b/R_functions/geneLME_benchmark2_files/figure-html/bench-plot-scaling-1.png new file mode 100644 index 0000000..bdb00c9 Binary files /dev/null and b/R_functions/geneLME_benchmark2_files/figure-html/bench-plot-scaling-1.png differ diff --git a/R_functions/geneLME_benchmark2_files/figure-html/dev2-correctness-plot-1.png b/R_functions/geneLME_benchmark2_files/figure-html/dev2-correctness-plot-1.png new file mode 100644 index 0000000..2c10f74 Binary files /dev/null and b/R_functions/geneLME_benchmark2_files/figure-html/dev2-correctness-plot-1.png differ diff --git a/R_functions/geneLME_benchmark2_files/figure-html/sign-scatter-full-1.png b/R_functions/geneLME_benchmark2_files/figure-html/sign-scatter-full-1.png new file mode 100644 index 0000000..05be596 Binary files /dev/null and b/R_functions/geneLME_benchmark2_files/figure-html/sign-scatter-full-1.png differ diff --git a/R_functions/geneLME_benchmark2_files/figure-html/sign-scatter-naive-1.png b/R_functions/geneLME_benchmark2_files/figure-html/sign-scatter-naive-1.png new file mode 100644 index 0000000..f82834e Binary files /dev/null and b/R_functions/geneLME_benchmark2_files/figure-html/sign-scatter-naive-1.png differ diff --git a/R_functions/geneLME_benchmark_files/figure-html/accuracy-scatter-estimate-1.png b/R_functions/geneLME_benchmark_files/figure-html/accuracy-scatter-estimate-1.png new file mode 100644 index 0000000..51525e5 Binary files /dev/null and b/R_functions/geneLME_benchmark_files/figure-html/accuracy-scatter-estimate-1.png differ diff --git a/R_functions/geneLME_benchmark_files/figure-html/accuracy-scatter-t-1.png b/R_functions/geneLME_benchmark_files/figure-html/accuracy-scatter-t-1.png new file mode 100644 index 0000000..8acca54 Binary files /dev/null and b/R_functions/geneLME_benchmark_files/figure-html/accuracy-scatter-t-1.png differ diff --git a/R_functions/geneLME_benchmark_files/figure-html/bench-headtohead-plot-1.png b/R_functions/geneLME_benchmark_files/figure-html/bench-headtohead-plot-1.png new file mode 100644 index 0000000..2d0227b Binary files /dev/null and b/R_functions/geneLME_benchmark_files/figure-html/bench-headtohead-plot-1.png differ diff --git a/R_functions/geneLME_benchmark_files/figure-html/bench-selective-plot-1.png b/R_functions/geneLME_benchmark_files/figure-html/bench-selective-plot-1.png new file mode 100644 index 0000000..b3e69df Binary files /dev/null and b/R_functions/geneLME_benchmark_files/figure-html/bench-selective-plot-1.png differ diff --git a/R_functions/geneLME_benchmark_files/figure-html/bench-selective-scaling-1.png b/R_functions/geneLME_benchmark_files/figure-html/bench-selective-scaling-1.png new file mode 100644 index 0000000..056f75c Binary files /dev/null and b/R_functions/geneLME_benchmark_files/figure-html/bench-selective-scaling-1.png differ diff --git a/R_functions/geneLME_dev.R b/R_functions/geneLME_dev.R new file mode 100644 index 0000000..cbc61a9 --- /dev/null +++ b/R_functions/geneLME_dev.R @@ -0,0 +1,871 @@ + +######################################################## +# Scalable custom gene LMEs with contrast specification +# DEV version — adds interaction contrast support +######################################################## + + +######################################################## +# geneLME_contrast_spec +# Helper: returns a reference template of available contrast levels, +# formatted for use with the contrast_spec argument of geneLME(). +# +# Two modes depending on whether contrast_vars contains ":": +# +# Interaction mode (e.g. contrast_vars = "treatment:visit"): +# contrast_vars must be a single "var_a:var_b" string. +# Returns a data.frame with columns contrast_ref and contrast_lvl, +# one row per pairwise combination of interaction-level strings. +# Filter this to the contrasts of interest, then pass as +# contrast_spec to geneLME(). +# +# Single-variable mode (e.g. contrast_vars = c("treatment", "visit")): +# contrast_vars is a character vector of one or more plain variable +# names (no ":"). Returns a named list, one element per variable, +# each a data.frame with a single column 'level' listing the sorted +# unique values. The message printed for each variable explains how +# it maps to geneLME() arguments: +# - The first variable in contrast_vars → contrast_vars[1] in +# geneLME(); its levels define the length and position order of +# contrasts_primary vectors. +# - Additional variables → contrast_vars[2], used as the 'by' +# grouping variable; filter its levels via contrast_var_2_levels. +# This list is a reference only — it is not passed to geneLME(). +######################################################## + +geneLME_contrast_spec <- function(targets, contrast_vars) { + + has_interaction <- any(grepl(":", contrast_vars)) + has_plain <- any(!grepl(":", contrast_vars)) + + # Disallow mixing interaction and plain variable names in one call + if (has_interaction && has_plain) { + stop( + "contrast_vars mixes interaction terms (containing ':') and plain variable names.\n", + "Call geneLME_contrast_spec() separately for interaction and non-interaction variables." + ) + } + + if (has_interaction) { + + # ---- Interaction mode: must be a single "var_a:var_b" string ---- + if (length(contrast_vars) != 1) { + stop("Interaction mode requires a single 'var_a:var_b' string in contrast_vars.") + } + + vars <- strsplit(contrast_vars, ":")[[1]] + var_a <- vars[1] + var_b <- vars[2] + + if (!var_a %in% colnames(targets)) stop(paste0("Variable '", var_a, "' not found in targets.")) + if (!var_b %in% colnames(targets)) stop(paste0("Variable '", var_b, "' not found in targets.")) + + # Coerce each variable to a factor, preserving any existing factor level order. + # If the column is already a factor, levels() preserves the user-defined ordering. + # If it is a plain character vector, we impose alphabetical order explicitly. + # This ensures the interaction level string ordering — and therefore which member + # of each pair lands in contrast_ref vs contrast_lvl — is fully deterministic and + # consistent between geneLME_contrast_spec() and emmeans' internal ordering. + fac_a <- if (is.factor(targets[[var_a]])) targets[[var_a]] else factor(targets[[var_a]], levels = sort(unique(targets[[var_a]]))) + fac_b <- if (is.factor(targets[[var_b]])) targets[[var_b]] else factor(targets[[var_b]], levels = sort(unique(targets[[var_b]]))) + + # Build all interaction level strings in the same order emmeans will use. + # interaction() with two ordered factors produces a factor whose levels are + # var_a[1]:var_b[1], var_a[1]:var_b[2], ..., var_a[n]:var_b[m] — matching + # emmeans' default grid ordering (vary the rightmost variable fastest). + ixn_lvls <- levels(interaction(fac_a, fac_b, sep = " ")) + + # combn() traverses ixn_lvls in order, so for every pair the first element + # (contrast_ref) is always at a lower level-index than the second (contrast_lvl). + # This guarantees: level(var_a in ref) <= level(var_a in lvl) AND + # level(var_b in ref) <= level(var_b in lvl). + # Users may freely swap ref/lvl in any row after filtering — geneLME() and + # geneLME_fit() handle either direction correctly (the sign of the contrast + # estimate will flip, but no error will occur). + # Row-position indices for contrasts_secondary construction are added by geneLME() + # to its $contrast_spec output element after the user's filtered spec is received. + pairs <- combn(ixn_lvls, 2, simplify = FALSE) + spec <- data.frame( + contrast_ref = sapply(pairs, `[[`, 1), + contrast_lvl = sapply(pairs, `[[`, 2), + stringsAsFactors = FALSE + ) + + message( + nrow(spec), " pairwise combinations generated for '", contrast_vars, "'.\n", + "Filter this data.frame to your contrasts of interest, then pass as contrast_spec to geneLME().\n", + "geneLME() will attach an indexed copy of contrast_spec to its output ($contrast_spec)\n", + "showing the row-position index used for contrasts_secondary vector construction." + ) + + return(spec) + + } else { + + # ---- Single/multi-variable mode ---- + # Validate all variables exist + missing_vars <- setdiff(contrast_vars, colnames(targets)) + if (length(missing_vars) > 0) { + stop(paste0("Variable(s) not found in targets: ", paste(missing_vars, collapse = ", "))) + } + + result <- lapply(seq_along(contrast_vars), function(i) { + v <- contrast_vars[i] + lvls <- sort(unique(as.character(targets[[v]]))) + spec <- data.frame(level = lvls, stringsAsFactors = FALSE) + + if (i == 1) { + message( + "'", v, "' — primary contrast variable (contrast_vars[1] in geneLME).\n", + length(lvls), " levels (alphabetical order = position order for contrasts_primary vectors):\n", + paste(seq_along(lvls), lvls, sep = ". ", collapse = "\n"), "\n", + "→ contrast vectors passed to contrasts_primary must have length ", length(lvls), ",\n", + " with each element weighted by position (e.g. '", lvls[length(lvls)], " vs ", lvls[1], + "' = c(", paste(c(-1, rep(0, length(lvls) - 2), 1), collapse = ", "), "))." + ) + } else { + message( + "'", v, "' — secondary 'by' grouping variable (contrast_vars[", i, "] in geneLME).\n", + length(lvls), " levels available:\n", + paste(seq_along(lvls), lvls, sep = ". ", collapse = "\n"), "\n", + "→ pass a subset of these levels to contrast_var_2_levels in geneLME() to restrict\n", + " which groups the primary contrasts are computed within." + ) + } + + spec + }) + + names(result) <- contrast_vars + return(result) + } +} + + +######################################################## +# geneLME_fit +# Core per-gene fitting function. Called inside future_lapply. +# Receives only the minimal pre-extracted data needed for one gene, +# not the full EList object. +######################################################## + +geneLME_fit <- + function(gene_name, + expression_vec, # named numeric vector: expression values for this gene + weight_vec, # named numeric vector or NULL: per-sample weights + targets, # data.frame: sample metadata (dat$targets) + formula_str, # character: formula RHS e.g. "~ treatment*visit + (1|ptID)" + run_contrast, + contrast_vars, + contrast_var_2_levels, + contrast_spec, # data.frame (contrast_ref / contrast_lvl) or NULL + contrasts_primary, + contrasts_secondary) { + + result <- tryCatch({ + + # --- BUILD MODEL DATA --- + model_data <- targets + model_data$expression <- expression_vec + + # --- RECONSTRUCT FORMULA LOCALLY --- + # Built from the raw string so its enclosing environment is this call frame. + # lmer resolves 'weight_vec' and other names in local scope — no + # environment stripping needed, no locked-environment errors. + formula_obj <- as.formula(paste("expression", formula_str)) + + # --- FIT MODEL --- + if (is.null(weight_vec)) { + lme_i <- lmer( + formula_obj, + data = model_data, + control = lmerControl(calc.derivs = FALSE) + ) + } else { + lme_i <- lmer( + formula_obj, + weights = weight_vec, + data = model_data, + control = lmerControl(calc.derivs = FALSE) + ) + } + + if (isSingular(lme_i)) stop("Boundary (singular) fit") + + # --- EXTRACT: AIC --- + aic_res <- data.frame(gene = gene_name, AIC = AIC(lme_i)) + + # --- EXTRACT: ANOVA & coefficient summary --- + lme_i_anova <- car::Anova(lme_i) %>% broom.mixed::tidy() + lme_i_summary <- summary(lme_i)$coefficients %>% + as.data.frame() %>% + rownames_to_column("variable") + + # --- BUILD ANOVA TABLE --- + lme_i_anova_tab <- lme_i_anova %>% + rowwise() %>% + mutate( + gene = gene_name, + model_status = "success", + predictor_class = case_when( + grepl(":", term) ~ "interaction", + is.numeric(model_data[[term]]) ~ "continuous", + !is.numeric(model_data[[term]]) & length(unique(model_data[[term]])) == 2 ~ "two-level-categorical", + !is.numeric(model_data[[term]]) & length(unique(model_data[[term]])) > 2 ~ "multi-level-categorical" + ), + Estimate_source = case_when( + predictor_class %in% c("continuous", "two-level-categorical") ~ "lme_summary", + predictor_class == "multi-level-categorical" ~ "seeContrasts", + predictor_class == "interaction" & length(grep(":", lme_i_summary$variable)) == 1 ~ "lme_summary", + predictor_class == "interaction" & length(grep(":", lme_i_summary$variable)) > 1 ~ "seeContrasts" + ), + Estimate = case_when( + predictor_class == "continuous" ~ lme_i_summary$Estimate[match(term, lme_i_summary$variable)][[1]], + predictor_class == "two-level-categorical" ~ lme_i_summary$Estimate[match(term, lme_i_summary$variable)][[1]], + # Guard: grep(":", ...) returns integer(0) when the model has no interaction + # coefficient (e.g. a main-effects-only formula). [[1]] on integer(0) errors, + # so fall through to .default = NA_real_ via the length check. + predictor_class == "interaction" & Estimate_source == "lme_summary" & + length(grep(":", lme_i_summary$variable)) >= 1 ~ + lme_i_summary$Estimate[grep(":", lme_i_summary$variable)[1L]], + .default = NA_real_ + ), + Estimate_SE = case_when( + predictor_class == "continuous" ~ lme_i_summary$`Std. Error`[match(term, lme_i_summary$variable)][[1]], + predictor_class == "two-level-categorical" ~ lme_i_summary$`Std. Error`[match(term, lme_i_summary$variable)][[1]], + predictor_class == "interaction" & Estimate_source == "lme_summary" & + length(grep(":", lme_i_summary$variable)) >= 1 ~ + lme_i_summary$`Std. Error`[grep(":", lme_i_summary$variable)[1L]], + .default = NA_real_ + ) + ) + + # --- CONTRASTS --- + if (isTRUE(run_contrast)) { + + is_interaction <- any(grepl(":", contrast_vars)) + + if (is_interaction) { + + # ---- BRANCH A: Interaction contrast via explicit contrast_spec ---- + # contrast_vars is a single "var_a:var_b" string; contrast_spec is always + # provided (enforced by geneLME() validation). + vars <- strsplit(contrast_vars, ":")[[1]] + var_a <- vars[1] + var_b <- vars[2] + + # Compute EMM object first so we can extract level ordering directly from + # emmeans — this guarantees contrast vector indices align with what emmeans + # uses internally. + emm_obj <- emmeans( + lme_i, + spec = as.formula(paste("~", var_a, "*", var_b)), + data = model_data + ) + + # Extract interaction level labels in the exact order emmeans uses them. + emm_summ <- as.data.frame(summary(emm_obj)) + contrast_lvls <- paste(emm_summ[[var_a]], emm_summ[[var_b]]) + + default_vec <- rep(0, length(contrast_lvls)) + names(default_vec) <- contrast_lvls + + # Validate that all spec values exist as interaction levels + bad_ref <- setdiff(contrast_spec$contrast_ref, contrast_lvls) + bad_lvl <- setdiff(contrast_spec$contrast_lvl, contrast_lvls) + if (length(bad_ref) > 0 || length(bad_lvl) > 0) { + stop(paste0( + "contrast_spec values not found in interaction levels of '", contrast_vars, "': ", + paste(c(bad_ref, bad_lvl), collapse = ", ") + )) + } + + # Build named contrast vector list — one entry per row of contrast_spec. + # Use an empty list and assign by name so the list contains exactly + # nrow(contrast_spec) named non-NULL entries. (Pre-allocating with + # vector("list", n) and assigning by name appends rather than fills, + # leaving the integer-indexed slots as NULL.) + # + # Swap-tolerance: contrast_ref receives -1 and contrast_lvl receives +1. + # If the user has manually swapped ref/lvl in a given row (placing the + # "higher" level-index cell as contrast_ref), the contrast vector and its + # estimate are simply sign-flipped relative to the canonical direction — + # no error occurs and the result is statistically valid. + contrasts_list <- list() + for (k in seq_len(nrow(contrast_spec))) { + cv <- default_vec + cv[which(names(cv) == contrast_spec$contrast_ref[k])] <- -1 + cv[which(names(cv) == contrast_spec$contrast_lvl[k])] <- 1 + contrast_name <- paste(contrast_spec$contrast_lvl[k], + contrast_spec$contrast_ref[k], sep = " - ") + contrasts_list[[contrast_name]] <- cv + } + + emm_1st_A <- contrast(emm_obj, method = contrasts_list, adjust = "none") + + # Build a lookup from contrast name → ref/lvl for joining onto first-order results. + # The contrast name is constructed as "lvl - ref" (same pattern used above when + # building contrasts_list), so it matches the 'contrast' column emmeans returns. + spec_lookup <- data.frame( + contrast = paste(contrast_spec$contrast_lvl, + contrast_spec$contrast_ref, sep = " - "), + contrast_ref = contrast_spec$contrast_ref, + contrast_lvl = contrast_spec$contrast_lvl, + stringsAsFactors = FALSE + ) + + if (!is.null(contrasts_secondary)) { + # suppressWarnings: emmeans internally calls lmer on the first-order contrast + # estimates when computing second-order contrasts; this can emit a benign + # lme4 scale warning when the estimate/SE values differ in magnitude from + # the original predictors. The output is unaffected — the warning is cosmetic. + emm_2nd_A <- suppressWarnings( + contrast(emm_1st_A, method = contrasts_secondary, adjust = "none") + ) + contrast_res <- bind_rows( + # First-order: join ref/lvl from spec_lookup by contrast name + as.data.frame(emm_1st_A) %>% + mutate(contrast_order = "first_order") %>% + left_join(spec_lookup, by = "contrast"), + # Second-order: contrasts-of-contrasts have no single ref/lvl pair + as.data.frame(emm_2nd_A) %>% + mutate(contrast_order = "second_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_) + ) %>% mutate(gene = gene_name) + } else { + contrast_res <- as.data.frame(emm_1st_A) %>% + mutate(contrast_order = "first_order") %>% + left_join(spec_lookup, by = "contrast") %>% + mutate(gene = gene_name) + } + + } else { + + # ---- BRANCH B: Non-interaction ---- + spec_formula <- as.formula(paste("~", paste(contrast_vars, collapse = "|"))) + + by_list <- if (length(contrast_vars) == 2 && !is.null(contrast_var_2_levels)) { + setNames(list(contrast_var_2_levels), contrast_vars[2]) + } else { + NULL + } + + emm_1st <- emmeans(lme_i, spec = spec_formula, at = by_list, data = model_data) %>% + contrast(method = contrasts_primary, adjust = "none") + + if (!is.null(contrasts_secondary)) { + # suppressWarnings: same benign lme4 scale warning as Branch A second-order step. + emm_2nd <- suppressWarnings( + contrast(emm_1st, method = contrasts_secondary, adjust = "none") + ) + contrast_res <- bind_rows( + # Branch B uses named contrast vectors, not a ref/lvl spec — set NA for both orders + as.data.frame(emm_1st) %>% + mutate(contrast_order = "first_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_), + as.data.frame(emm_2nd) %>% + mutate(contrast_order = "second_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_) + ) %>% mutate(gene = gene_name) + } else { + contrast_res <- as.data.frame(emm_1st) %>% + mutate(contrast_order = "first_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_, + gene = gene_name) + } + } + + } else { + contrast_res <- NULL + } + + list( + aic = aic_res, + anova = lme_i_anova_tab, + contrasts = contrast_res, + model_status = setNames("success", gene_name) + ) + + }, error = function(e) { + err_msg <- conditionMessage(e) + list( + aic = data.frame(gene = gene_name, AIC = NA_real_), + anova = data.frame( + term = NA_character_, + statistic = NA_real_, + df = NA_real_, + p.value = NA_real_, + gene = gene_name, + model_status = err_msg, + predictor_class = NA_character_, + Estimate_source = NA_character_, + Estimate = NA_real_, + Estimate_SE = NA_real_ + ), + contrasts = data.frame( + contrast = NA_character_, + estimate = NA_real_, + SE = NA_real_, + df = NA_real_, + t.ratio = NA_real_, + p.value = NA_real_, + contrast_order = NA_character_, + contrast_ref = NA_character_, + contrast_lvl = NA_character_, + gene = gene_name + ), + model_status = setNames(err_msg, gene_name) + ) + }) + + return(result) + } + + +######################################################## +# geneLME_compiler +# Aggregates list of per-gene results into named result tables, +# then appends FDR-adjusted p-values within each grouping unit. +# +# FDR grouping strategy: +# lme_anova: adjust within each model term (across all genes). +# Each term's p-values form one adjustment set. +# lme_contrast: adjust within each contrast x contrast_order combination +# (across all genes). Branch B contrast labels already encode +# the 'by' variable level (e.g. "TrtC vs TrtA, visit = V2"), +# so grouping by contrast alone is sufficient. +# NA p-values (failed gene models) are preserved as NA in p.value_adj. +######################################################## + +geneLME_compiler <- function(fit, fdr_method = "BH", contrast_spec = NULL) { + + lme_anova <- map_dfr(fit, "anova") %>% + group_by(term) %>% + mutate(p.value_adj = p.adjust(p.value, method = fdr_method)) %>% + ungroup() + + lme_contrast_raw <- map_dfr(fit, "contrasts") + lme_contrast <- if (ncol(lme_contrast_raw) > 0 && + all(c("contrast", "contrast_order") %in% colnames(lme_contrast_raw))) { + lme_contrast_raw %>% + group_by(contrast, contrast_order) %>% + mutate(p.value_adj = p.adjust(p.value, method = fdr_method)) %>% + ungroup() + } else { + lme_contrast_raw # no contrasts run; return as-is (empty or NULL-row stub) + } + + list( + lme_anova = lme_anova, + lme_contrast = lme_contrast, + lme_fit = map_dfr(fit, "aic"), + lme_err = map_chr(fit, "model_status"), + contrast_spec = contrast_spec # indexed spec; NULL when no contrasts run + ) +} + + +######################################################## +# geneLME_dispatch +# Runs future_lapply with explicit global declaration to prevent +# future's automatic environment scan from capturing large objects. +# +# Key design decisions: +# 1. Iterate over an integer index — a plain integer sequence carries +# no environment baggage for future to scan. +# 2. All shared objects passed via future.globals, bypassing automatic +# scanning entirely. +######################################################## + +geneLME_dispatch <- function(gene_data_list, + targets_df, + formula_str, + run_contrast, + contrast_vars, + contrast_var_2_levels, + contrast_spec, + contrasts_primary, + contrasts_secondary) { + n_genes <- length(gene_data_list) + + future_lapply( + seq_len(n_genes), + FUN = function(i) { + gene_data <- gene_data_list[[i]] + geneLME_fit( + gene_name = gene_data$gene_name, + expression_vec = gene_data$expression_vec, + weight_vec = gene_data$weight_vec, + targets = targets_df, + formula_str = formula_str, + run_contrast = run_contrast, + contrast_vars = contrast_vars, + contrast_var_2_levels = contrast_var_2_levels, + contrast_spec = contrast_spec, + contrasts_primary = contrasts_primary, + contrasts_secondary = contrasts_secondary + ) + }, + future.globals = list( + gene_data_list = gene_data_list, + targets_df = targets_df, + formula_str = formula_str, + run_contrast = run_contrast, + contrast_vars = contrast_vars, + contrast_var_2_levels = contrast_var_2_levels, + contrast_spec = contrast_spec, + contrasts_primary = contrasts_primary, + contrasts_secondary = contrasts_secondary, + geneLME_fit = geneLME_fit + ), + future.packages = c("lme4", "emmeans", "car", "broom.mixed", "dplyr", "tibble"), + future.seed = TRUE + ) +} + + +######################################################## +# geneLME +# User-facing wrapper: validates inputs, sets up parallel plan, +# pre-extracts per-gene data, dispatches geneLME_fit in parallel. +######################################################## + +geneLME <- + function(dat, + formula_str, + model_weights = NULL, + run_contrast = NULL, + contrast_vars = NULL, + contrast_var_2_levels = NULL, + contrast_spec = NULL, # data.frame(contrast_ref, contrast_lvl) or NULL + # required when contrast_vars contains ":" + contrasts_primary = NULL, + contrasts_secondary = NULL, + fdr_method = "BH", # any method accepted by p.adjust() + n_cores = NULL) { + + # --- PRE-FLIGHT VALIDATION --- + + # Build a local formula object solely for variable checking — never + # passed to workers. Workers reconstruct the formula from formula_str. + formula_obj_local <- as.formula(paste("expression", formula_str)) + + # Initialise indexed_contrast_spec to NULL; populated inside the contrast + # validation block below (Branch A or Branch B) and passed to the compiler. + indexed_contrast_spec <- NULL + + # 1. Formula variables present in targets + required_vars <- all.vars(formula_obj_local) + required_vars <- required_vars[required_vars != "expression"] + + missing_vars <- setdiff(required_vars, colnames(dat$targets)) + if (length(missing_vars) > 0) { + stop(paste( + "The following variables in the formula are missing from dat$targets:", + paste(missing_vars, collapse = ", ") + )) + } + + # 2. Weights alignment + if (isTRUE(model_weights)) { + if (is.null(dat$weights)) { + stop("model_weights = TRUE but dat$weights is NULL.") + } + if (!identical(dim(dat$weights), dim(dat$E))) { + stop("Dimensions of dat$weights do not match dat$E.") + } + } + + # 3. Contrast variables and spec + if (isTRUE(run_contrast)) { + + if (is.null(contrast_vars)) { + stop("run_contrast = TRUE but contrast_vars is NULL.") + } + + is_interaction <- any(grepl(":", contrast_vars)) + + # 3a. Formula-vs-contrast_vars consistency check. + # Uses terms() so that both "a*b" and "a:b" formula syntax are handled correctly — + # "a*b" expands to include "a:b" in term labels, so both forms are detected. + formula_terms <- attr(terms(formula_obj_local), "term.labels") + + if (is_interaction) { + # For interaction contrasts, verify the interaction term is actually in the model. + # emmeans will silently run contrasts on additive margins if the interaction is absent, + # which is statistically misleading. + ixn_term <- contrast_vars[grep(":", contrast_vars)][1] # e.g. "treatment:visit" + if (!ixn_term %in% formula_terms) { + stop(paste0( + "contrast_vars specifies an interaction contrast for '", ixn_term, "', ", + "but this interaction term is not present in formula_str.\n", + "Either add the interaction to the formula (e.g. '~ treatment * visit + ...') ", + "or change contrast_vars to a non-interaction term." + )) + } + } else { + # For non-interaction contrasts, verify each contrast variable is in the formula + # as a main effect. Warn (not error) in case it is part of an interaction only. + for (cv in contrast_vars) { + if (!cv %in% formula_terms) { + warning(paste0( + "contrast_vars includes '", cv, "' but this term does not appear as a ", + "main effect in formula_str. Contrasts may be computed from interaction ", + "margins only — verify this is the intended model structure." + )) + } + } + } + + # 3b. contrast_spec validation (required for interaction contrasts) + if (is_interaction && is.null(contrast_spec)) { + stop(paste0( + "contrast_vars specifies an interaction contrast ('", contrast_vars, "') ", + "but contrast_spec is NULL.\n", + "Use geneLME_contrast_spec(dat$targets, contrast_vars = '", contrast_vars, + "') to generate a template, filter it to your contrasts of interest, ", + "then pass it as contrast_spec." + )) + } + + if (!is.null(contrast_spec)) { + if (!is.data.frame(contrast_spec)) { + stop("contrast_spec must be a data.frame.") + } + if (!all(c("contrast_ref", "contrast_lvl") %in% colnames(contrast_spec))) { + stop("contrast_spec must have columns 'contrast_ref' and 'contrast_lvl'.") + } + if (!is_interaction) { + warning("contrast_spec is provided but contrast_vars contains no interaction term (':'). ", + "contrast_spec is designed for interaction contrasts — did you mean to use contrasts_primary?") + } + + # Build the indexed contrast_spec that will be attached to the return value. + # contrast_index here is simply 1:nrow(contrast_spec) — the actual row position + # within the filtered spec passed by the user. This is what contrasts_secondary + # vectors must index into (not any index from the full unfiltered template). + indexed_contrast_spec <- contrast_spec %>% + mutate(contrast_index = seq_len(n())) %>% + select(contrast_index, everything()) + + # Print the indexed spec as a reminder whenever contrasts_secondary is provided. + if (is_interaction && !is.null(contrasts_secondary)) { + n_first <- nrow(contrast_spec) + message( + "contrasts_secondary will be applied to the first-order interaction contrasts ", + "in the order they appear in contrast_spec (", n_first, " contrasts):\n", + paste(seq_len(n_first), + paste(contrast_spec$contrast_lvl, contrast_spec$contrast_ref, sep = " - "), + sep = ". ", collapse = "\n"), + "\nEnsure contrasts_secondary vectors have length ", n_first, + ", with each element corresponding to the contrast at that position.", + "\nThe indexed contrast_spec is returned as $contrast_spec in the output." + ) + + # Soft-fail: wrong-length vectors produce silent NAs deep inside geneLME_fit(). + # Catch this here and return early with only $contrast_spec populated so the + # user has the indexed reference they need to fix their vectors. + bad_lens <- sapply(contrasts_secondary, length) + bad_names <- names(bad_lens)[bad_lens != n_first] + if (length(bad_names) > 0) { + message( + "\ncontrasts_secondary vector(s) have wrong length — returning early without running models.\n", + "Expected length ", n_first, " (= nrow(contrast_spec) after filtering).\n", + "Offending vector(s):\n", + paste0(" '", bad_names, "': length ", bad_lens[bad_names], collapse = "\n"), "\n\n", + "Common cause: vectors were built against nrow() of the full unfiltered\n", + "spec_template rather than nrow() of the filtered contrast_spec passed here.\n", + "Use $contrast_spec in the returned object to re-specify your secondary contrast vectors:\n", + " each vector must have length ", n_first, ", one element per row of $contrast_spec." + ) + return(invisible(list( + lme_anova = NULL, + lme_contrast = NULL, + lme_fit = NULL, + lme_err = NULL, + contrast_spec = indexed_contrast_spec + ))) + } + } + } + + # 3c. contrast_vars present in targets (split on ":" for interaction) + vars_to_check <- unique(unlist(strsplit(contrast_vars, ":"))) + missing_contrast_vars <- setdiff(vars_to_check, colnames(dat$targets)) + if (length(missing_contrast_vars) > 0) { + stop(paste( + "contrast_vars not found in dat$targets:", + paste(missing_contrast_vars, collapse = ", ") + )) + } + + # 3d. contrast_var_2_levels validation (Branch B only) + if (!is_interaction && length(contrast_vars) == 2 && !is.null(contrast_var_2_levels)) { + actual_levels <- unique(as.character(dat$targets[[contrast_vars[2]]])) + invalid_levels <- setdiff(contrast_var_2_levels, actual_levels) + if (length(invalid_levels) > 0) { + stop(paste0( + "Levels specified for '", contrast_vars[2], + "' not found in data: ", paste(invalid_levels, collapse = ", ") + )) + } + } + + # 3e. Branch B: build indexed contrast_spec from contrasts_primary names, + # and soft-fail on wrong-length contrasts_secondary vectors. + if (!is_interaction && !is.null(contrasts_primary)) { + indexed_contrast_spec <- data.frame( + contrast_index = seq_along(contrasts_primary), + contrast_name = names(contrasts_primary), + stringsAsFactors = FALSE + ) + + if (!is.null(contrasts_secondary)) { + n_primary <- length(contrasts_primary) + bad_lens <- sapply(contrasts_secondary, length) + bad_names <- names(bad_lens)[bad_lens != n_primary] + if (length(bad_names) > 0) { + message( + "\ncontrasts_secondary vector(s) have wrong length — returning early without running models.\n", + "Expected length ", n_primary, " (= number of contrasts_primary vectors).\n", + "Offending vector(s):\n", + paste0(" '", bad_names, "': length ", bad_lens[bad_names], collapse = "\n"), "\n\n", + "Use $contrast_spec in the returned object to confirm the primary contrast ordering:\n", + " each contrasts_secondary vector must have length ", n_primary, + ", one element per row of $contrast_spec." + ) + return(invisible(list( + lme_anova = NULL, + lme_contrast = NULL, + lme_fit = NULL, + lme_err = NULL, + contrast_spec = indexed_contrast_spec + ))) + } + } + } + } + + # 4. fdr_method must be a recognised p.adjust method + valid_fdr_methods <- p.adjust.methods # exported character vector from stats + if (!fdr_method %in% valid_fdr_methods) { + stop(paste0( + "'", fdr_method, "' is not a valid p.adjust method.\n", + "Choose one of: ", paste(valid_fdr_methods, collapse = ", ") + )) + } + + # 5. Warn on likely ID columns used as predictors + for (v in required_vars) { + if (is.character(dat$targets[[v]]) && + length(unique(dat$targets[[v]])) > nrow(dat$targets) / 2) { + warning(paste0( + "Variable '", v, "' has many unique string values — is it an ID column rather than a predictor?" + )) + } + } + + message("Input validation passed. Launching parallel LME fits...") + + # --- PRE-SLICE INTO PER-GENE LIST --- + W_mat <- if (isTRUE(model_weights)) dat$weights else NULL + + gene_data_list <- lapply(rownames(dat$E), function(g) { + list( + gene_name = g, + expression_vec = dat$E[g, ], + weight_vec = if (!is.null(W_mat)) W_mat[g, ] else NULL + ) + }) + + targets_df <- dat$targets + + # --- PARALLEL PLAN --- + workers <- if (is.null(n_cores)) max(1L, parallel::detectCores() - 4L) else n_cores + plan(multisession, workers = workers) + on.exit(plan(sequential), add = TRUE) + + # --- DISPATCH via clean-scope helper --- + fit_results <- geneLME_dispatch( + gene_data_list = gene_data_list, + targets_df = targets_df, + formula_str = formula_str, + run_contrast = run_contrast, + contrast_vars = contrast_vars, + contrast_var_2_levels = contrast_var_2_levels, + contrast_spec = contrast_spec, + contrasts_primary = contrasts_primary, + contrasts_secondary = contrasts_secondary + ) + + return(geneLME_compiler(fit_results, fdr_method = fdr_method, + contrast_spec = indexed_contrast_spec)) + } + + +######################################################## +# Testing (see geneLME_test.R for full test suite) +######################################################## + +# --- Branch A: interaction contrast via contrast_spec --- +# # Step 1: generate level reference template +# spec_template <- geneLME_contrast_spec( +# targets = dat$targets, +# contrast_vars = "treatment:visit" # interaction string +# ) +# # Step 2: filter to contrasts of interest +# my_spec <- spec_template %>% +# dplyr::filter(...) # e.g. same-visit cross-treatment, or within-treatment longitudinal +# +# test_mods_A <- +# geneLME( +# dat = dat_sub, +# formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", +# model_weights = TRUE, +# run_contrast = TRUE, +# contrast_vars = "treatment:visit", # must match an interaction term in formula_str +# contrast_spec = my_spec, # required for interaction contrasts +# n_cores = 10 +# ) +# +# # Optional: second-order contrasts on top of Branch A first-order contrasts. +# # Vectors must have length == nrow(my_spec), ordered to match my_spec rows. +# # geneLME() will print the numbered list of first-order contrasts at validation time. +# test_mods_A2 <- +# geneLME( +# dat = dat_sub, +# formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", +# model_weights = TRUE, +# run_contrast = TRUE, +# contrast_vars = "treatment:visit", +# contrast_spec = my_spec, +# contrasts_secondary = list( +# "TrtA vs TrtB: V3 minus V2 effect" = c(1, 0, -1, 0, 0, 0), +# "TrtA vs TrtC: V3 minus V2 effect" = c(0, 1, 0, -1, 0, 0) +# ), +# n_cores = 10 +# ) + +# --- Branch B: non-interaction --- +# # Step 1 (optional): inspect available levels for each contrast variable +# geneLME_contrast_spec(dat$targets, contrast_vars = "treatment") # reference only +# # Treatment levels (alphabetical): TrtA, TrtB, TrtC +# # contrasts_primary vectors have length 3: positions = [TrtA, TrtB, TrtC] +# +# test_mods_B <- +# geneLME( +# dat = dat_sub, +# formula_str = "~ treatment + visit + age + sex + rNANgUl + (1|ptID)", +# model_weights = TRUE, +# run_contrast = TRUE, +# contrast_vars = c("treatment", "visit"), +# contrast_var_2_levels = c("V2", "V3"), +# contrasts_primary = list("TrtC vs TrtA" = c(-1, 0, 1), +# "TrtB vs TrtA" = c(-1, 1, 0)), +# contrasts_secondary = list("TrtC vs TrtB" = c(1, -1)), +# n_cores = 10 +# ) diff --git a/R_functions/geneLME_dev2.R b/R_functions/geneLME_dev2.R new file mode 100644 index 0000000..f947290 --- /dev/null +++ b/R_functions/geneLME_dev2.R @@ -0,0 +1,908 @@ + +######################################################## +# Scalable custom gene LMEs with contrast specification +######################################################## +# Dev2 changes from geneLME.R (2026-02-20): +# 1. Singular fit → warning flag instead of error. +# isSingular() no longer stops the model; instead model_status is +# set to "singular_fit" and results are returned with a flag column. +# Users can filter on model_status downstream. +# 2. Branch A contrast structures pre-computed outside the per-gene loop. +# geneLME_build_contrast_args() runs once in geneLME() before the +# parallel stage and returns contrasts_list (named vector list) and +# spec_lookup (ref/lvl join table). Workers receive these ready-made +# objects instead of rebuilding them from contrast_spec on every gene. +######################################################## + + +######################################################## +# geneLME_contrast_spec +# Helper: returns a reference template of available contrast levels, +# formatted for use with the contrast_spec argument of geneLME(). +# +# Two modes depending on whether contrast_vars contains ":": +# +# Interaction mode (e.g. contrast_vars = "treatment:visit"): +# contrast_vars must be a single "var_a:var_b" string. +# Returns a data.frame with columns contrast_ref and contrast_lvl, +# one row per pairwise combination of interaction-level strings. +# Filter this to the contrasts of interest, then pass as +# contrast_spec to geneLME(). +# +# Single-variable mode (e.g. contrast_vars = c("treatment", "visit")): +# contrast_vars is a character vector of one or more plain variable +# names (no ":"). Returns a named list, one element per variable, +# each a data.frame with a single column 'level' listing the sorted +# unique values. The message printed for each variable explains how +# it maps to geneLME() arguments: +# - The first variable in contrast_vars → contrast_vars[1] in +# geneLME(); its levels define the length and position order of +# contrasts_primary vectors. +# - Additional variables → contrast_vars[2], used as the 'by' +# grouping variable; filter its levels via contrast_var_2_levels. +# This list is a reference only — it is not passed to geneLME(). +######################################################## + +geneLME_contrast_spec <- function(targets, contrast_vars) { + + has_interaction <- any(grepl(":", contrast_vars)) + has_plain <- any(!grepl(":", contrast_vars)) + + # Disallow mixing interaction and plain variable names in one call + if (has_interaction && has_plain) { + stop( + "contrast_vars mixes interaction terms (containing ':') and plain variable names.\n", + "Call geneLME_contrast_spec() separately for interaction and non-interaction variables." + ) + } + + if (has_interaction) { + + # ---- Interaction mode: must be a single "var_a:var_b" string ---- + if (length(contrast_vars) != 1) { + stop("Interaction mode requires a single 'var_a:var_b' string in contrast_vars.") + } + + vars <- strsplit(contrast_vars, ":")[[1]] + var_a <- vars[1] + var_b <- vars[2] + + if (!var_a %in% colnames(targets)) stop(paste0("Variable '", var_a, "' not found in targets.")) + if (!var_b %in% colnames(targets)) stop(paste0("Variable '", var_b, "' not found in targets.")) + + # Coerce each variable to a factor, preserving any existing factor level order. + # If the column is already a factor, levels() preserves the user-defined ordering. + # If it is a plain character vector, we impose alphabetical order explicitly. + # This ensures the interaction level string ordering — and therefore which member + # of each pair lands in contrast_ref vs contrast_lvl — is fully deterministic and + # consistent between geneLME_contrast_spec() and emmeans' internal ordering. + fac_a <- if (is.factor(targets[[var_a]])) targets[[var_a]] else factor(targets[[var_a]], levels = sort(unique(targets[[var_a]]))) + fac_b <- if (is.factor(targets[[var_b]])) targets[[var_b]] else factor(targets[[var_b]], levels = sort(unique(targets[[var_b]]))) + + # Build all interaction level strings in the same order emmeans will use. + # interaction() with two ordered factors produces a factor whose levels are + # var_a[1]:var_b[1], var_a[1]:var_b[2], ..., var_a[n]:var_b[m] — matching + # emmeans' default grid ordering (vary the rightmost variable fastest). + ixn_lvls <- levels(interaction(fac_a, fac_b, sep = " ")) + + # combn() traverses ixn_lvls in order, so for every pair the first element + # (contrast_ref) is always at a lower level-index than the second (contrast_lvl). + # This guarantees: level(var_a in ref) <= level(var_a in lvl) AND + # level(var_b in ref) <= level(var_b in lvl). + # Users may freely swap ref/lvl in any row after filtering — geneLME() and + # geneLME_fit() handle either direction correctly (the sign of the contrast + # estimate will flip, but no error will occur). + # Row-position indices for contrasts_secondary construction are added by geneLME() + # to its $contrast_spec output element after the user's filtered spec is received. + pairs <- combn(ixn_lvls, 2, simplify = FALSE) + spec <- data.frame( + contrast_ref = sapply(pairs, `[[`, 1), + contrast_lvl = sapply(pairs, `[[`, 2), + stringsAsFactors = FALSE + ) + + message( + nrow(spec), " pairwise combinations generated for '", contrast_vars, "'.\n", + "Filter this data.frame to your contrasts of interest, then pass as contrast_spec to geneLME().\n", + "geneLME() will attach an indexed copy of contrast_spec to its output ($contrast_spec)\n", + "showing the row-position index used for contrasts_secondary vector construction." + ) + + return(spec) + + } else { + + # ---- Single/multi-variable mode ---- + # Validate all variables exist + missing_vars <- setdiff(contrast_vars, colnames(targets)) + if (length(missing_vars) > 0) { + stop(paste0("Variable(s) not found in targets: ", paste(missing_vars, collapse = ", "))) + } + + result <- lapply(seq_along(contrast_vars), function(i) { + v <- contrast_vars[i] + lvls <- sort(unique(as.character(targets[[v]]))) + spec <- data.frame(level = lvls, stringsAsFactors = FALSE) + + if (i == 1) { + message( + "'", v, "' — primary contrast variable (contrast_vars[1] in geneLME).\n", + length(lvls), " levels (alphabetical order = position order for contrasts_primary vectors):\n", + paste(seq_along(lvls), lvls, sep = ". ", collapse = "\n"), "\n", + "→ contrast vectors passed to contrasts_primary must have length ", length(lvls), ",\n", + " with each element weighted by position (e.g. '", lvls[length(lvls)], " vs ", lvls[1], + "' = c(", paste(c(-1, rep(0, length(lvls) - 2), 1), collapse = ", "), "))." + ) + } else { + message( + "'", v, "' — secondary 'by' grouping variable (contrast_vars[", i, "] in geneLME).\n", + length(lvls), " levels available:\n", + paste(seq_along(lvls), lvls, sep = ". ", collapse = "\n"), "\n", + "→ pass a subset of these levels to contrast_var_2_levels in geneLME() to restrict\n", + " which groups the primary contrasts are computed within." + ) + } + + spec + }) + + names(result) <- contrast_vars + return(result) + } +} + + +######################################################## +# geneLME_build_contrast_args +# Pre-computes Branch A contrast structures once, before parallel dispatch. +# +# Called by geneLME() when contrast_vars contains ":". +# Returns a list with two elements that are passed directly to workers: +# +# $contrasts_list — named list of contrast vectors, one per row of +# contrast_spec. Each vector has length == number of interaction cells, +# named by emmeans level string ("TrtA V1", "TrtB V2", ...). +# contrast_ref cell receives -1; contrast_lvl cell receives +1. +# Name of each vector is "contrast_lvl - contrast_ref" — the string +# emmeans will use as the 'contrast' column in its output. +# +# $spec_lookup — data.frame(contrast, contrast_ref, contrast_lvl) for +# joining ref/lvl back onto the emmeans first-order result by contrast +# name. Avoids reconstructing this inside every geneLME_fit() call. +# +# The emmeans level ordering is derived from factor levels in targets (the +# same source emmeans uses internally), guaranteeing index alignment without +# requiring a fitted model object. +######################################################## + +geneLME_build_contrast_args <- function(targets, contrast_vars, contrast_spec) { + + vars <- strsplit(contrast_vars, ":")[[1]] + var_a <- vars[1] + var_b <- vars[2] + + # Reproduce the exact factor-level ordering that emmeans will use. + # emmeans builds its reference grid from the factor levels of the model + # data (targets); using the same coercion logic as geneLME_contrast_spec() + # guarantees the level strings here match those emmeans returns. + fac_a <- if (is.factor(targets[[var_a]])) targets[[var_a]] else factor(targets[[var_a]], levels = sort(unique(targets[[var_a]]))) + fac_b <- if (is.factor(targets[[var_b]])) targets[[var_b]] else factor(targets[[var_b]], levels = sort(unique(targets[[var_b]]))) + + # Build interaction level strings: "var_a_level var_b_level" in emmeans order. + # emmeans varies the rightmost variable fastest, matching interaction() default. + lvls_a <- levels(fac_a) + lvls_b <- levels(fac_b) + emm_lvls <- as.vector(outer(lvls_a, lvls_b, paste)) # row = var_a, col = var_b + + # Template zero vector, one element per interaction cell + default_vec <- rep(0, length(emm_lvls)) + names(default_vec) <- emm_lvls + + # Build contrasts_list once — identical for all genes + contrasts_list <- list() + for (k in seq_len(nrow(contrast_spec))) { + cv <- default_vec + cv[contrast_spec$contrast_ref[k]] <- -1 + cv[contrast_spec$contrast_lvl[k]] <- 1 + contrast_name <- paste(contrast_spec$contrast_lvl[k], + contrast_spec$contrast_ref[k], sep = " - ") + contrasts_list[[contrast_name]] <- cv + } + + # Build spec_lookup once — identical for all genes + spec_lookup <- data.frame( + contrast = paste(contrast_spec$contrast_lvl, + contrast_spec$contrast_ref, sep = " - "), + contrast_ref = contrast_spec$contrast_ref, + contrast_lvl = contrast_spec$contrast_lvl, + stringsAsFactors = FALSE + ) + + list(contrasts_list = contrasts_list, spec_lookup = spec_lookup) +} + + +######################################################## +# geneLME_fit +# Core per-gene fitting function. Called inside future_lapply. +# Receives only the minimal pre-extracted data needed for one gene, +# not the full EList object. +# +# Dev2 changes: +# - Singular fit is no longer an error; model_status = "singular_fit" +# is recorded and results are returned for user-side filtering. +# - Branch A receives pre-computed contrasts_list + spec_lookup instead +# of contrast_spec. The per-gene contrast vector loop is eliminated. +######################################################## + +geneLME_fit <- + function(gene_name, + expression_vec, # named numeric vector: expression values for this gene + weight_vec, # named numeric vector or NULL: per-sample weights + targets, # data.frame: sample metadata (dat$targets) + formula_str, # character: formula RHS e.g. "~ treatment*visit + (1|ptID)" + run_contrast, + contrast_vars, + contrast_var_2_levels, + contrasts_list, # Branch A: pre-computed named contrast vector list, or NULL + spec_lookup, # Branch A: pre-computed ref/lvl join table, or NULL + contrasts_primary, # Branch B: named list of contrast vectors, or NULL + contrasts_secondary) { # both branches: second-order contrast vectors, or NULL + + result <- tryCatch({ + + # --- BUILD MODEL DATA --- + model_data <- targets + model_data$expression <- expression_vec + + # --- RECONSTRUCT FORMULA LOCALLY --- + # Built from the raw string so its enclosing environment is this call frame. + # lmer resolves 'weight_vec' and other names in local scope — no + # environment stripping needed, no locked-environment errors. + formula_obj <- as.formula(paste("expression", formula_str)) + + # --- FIT MODEL --- + if (is.null(weight_vec)) { + lme_i <- lmer( + formula_obj, + data = model_data, + control = lmerControl(calc.derivs = FALSE) + ) + } else { + lme_i <- lmer( + formula_obj, + weights = weight_vec, + data = model_data, + control = lmerControl(calc.derivs = FALSE) + ) + } + + # --- SINGULAR FIT: flag, do not stop --- + # isSingular() indicates the random effect variance hit its boundary (zero). + # Fixed effect estimates and contrasts are still numerically valid; the user + # should decide whether to retain or filter these genes downstream. + # model_status records the flag; "success" indicates a clean fit. + model_status <- if (isSingular(lme_i)) "singular_fit" else "success" + + # --- EXTRACT: AIC --- + aic_res <- data.frame(gene = gene_name, AIC = AIC(lme_i)) + + # --- EXTRACT: ANOVA & coefficient summary --- + lme_i_anova <- car::Anova(lme_i) %>% broom.mixed::tidy() + lme_i_summary <- summary(lme_i)$coefficients %>% + as.data.frame() %>% + rownames_to_column("variable") + + # --- BUILD ANOVA TABLE --- + lme_i_anova_tab <- lme_i_anova %>% + rowwise() %>% + mutate( + gene = gene_name, + model_status = model_status, # "success" or "singular_fit" + predictor_class = case_when( + grepl(":", term) ~ "interaction", + is.numeric(model_data[[term]]) ~ "continuous", + !is.numeric(model_data[[term]]) & length(unique(model_data[[term]])) == 2 ~ "two-level-categorical", + !is.numeric(model_data[[term]]) & length(unique(model_data[[term]])) > 2 ~ "multi-level-categorical" + ), + Estimate_source = case_when( + predictor_class %in% c("continuous", "two-level-categorical") ~ "lme_summary", + predictor_class == "multi-level-categorical" ~ "seeContrasts", + predictor_class == "interaction" & length(grep(":", lme_i_summary$variable)) == 1 ~ "lme_summary", + predictor_class == "interaction" & length(grep(":", lme_i_summary$variable)) > 1 ~ "seeContrasts" + ), + Estimate = case_when( + predictor_class == "continuous" ~ lme_i_summary$Estimate[match(term, lme_i_summary$variable)][[1]], + predictor_class == "two-level-categorical" ~ lme_i_summary$Estimate[match(term, lme_i_summary$variable)][[1]], + # Guard: grep(":", ...) returns integer(0) when the model has no interaction + # coefficient (e.g. a main-effects-only formula). [[1]] on integer(0) errors, + # so fall through to .default = NA_real_ via the length check. + predictor_class == "interaction" & Estimate_source == "lme_summary" & + length(grep(":", lme_i_summary$variable)) >= 1 ~ + lme_i_summary$Estimate[grep(":", lme_i_summary$variable)[1L]], + .default = NA_real_ + ), + Estimate_SE = case_when( + predictor_class == "continuous" ~ lme_i_summary$`Std. Error`[match(term, lme_i_summary$variable)][[1]], + predictor_class == "two-level-categorical" ~ lme_i_summary$`Std. Error`[match(term, lme_i_summary$variable)][[1]], + predictor_class == "interaction" & Estimate_source == "lme_summary" & + length(grep(":", lme_i_summary$variable)) >= 1 ~ + lme_i_summary$`Std. Error`[grep(":", lme_i_summary$variable)[1L]], + .default = NA_real_ + ) + ) + + # --- CONTRASTS --- + if (isTRUE(run_contrast)) { + + is_interaction <- any(grepl(":", contrast_vars)) + + if (is_interaction) { + + # ---- BRANCH A: Interaction contrast via pre-computed contrasts_list ---- + # contrasts_list and spec_lookup were built once in geneLME() before + # parallel dispatch — no per-gene reconstruction needed here. + vars <- strsplit(contrast_vars, ":")[[1]] + var_a <- vars[1] + var_b <- vars[2] + + emm_obj <- emmeans( + lme_i, + spec = as.formula(paste("~", var_a, "*", var_b)), + data = model_data + ) + + emm_1st_A <- contrast(emm_obj, method = contrasts_list, adjust = "none") + + if (!is.null(contrasts_secondary)) { + # suppressWarnings: emmeans internally calls lmer on the first-order contrast + # estimates when computing second-order contrasts; this can emit a benign + # lme4 scale warning when the estimate/SE values differ in magnitude from + # the original predictors. The output is unaffected — the warning is cosmetic. + emm_2nd_A <- suppressWarnings( + contrast(emm_1st_A, method = contrasts_secondary, adjust = "none") + ) + contrast_res <- bind_rows( + # First-order: join ref/lvl from pre-computed spec_lookup by contrast name + as.data.frame(emm_1st_A) %>% + mutate(contrast_order = "first_order") %>% + left_join(spec_lookup, by = "contrast"), + # Second-order: contrasts-of-contrasts have no single ref/lvl pair + as.data.frame(emm_2nd_A) %>% + mutate(contrast_order = "second_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_) + ) %>% mutate(gene = gene_name, model_status = model_status) + } else { + contrast_res <- as.data.frame(emm_1st_A) %>% + mutate(contrast_order = "first_order") %>% + left_join(spec_lookup, by = "contrast") %>% + mutate(gene = gene_name, model_status = model_status) + } + + } else { + + # ---- BRANCH B: Non-interaction ---- + spec_formula <- as.formula(paste("~", paste(contrast_vars, collapse = "|"))) + + by_list <- if (length(contrast_vars) == 2 && !is.null(contrast_var_2_levels)) { + setNames(list(contrast_var_2_levels), contrast_vars[2]) + } else { + NULL + } + + emm_1st <- emmeans(lme_i, spec = spec_formula, at = by_list, data = model_data) %>% + contrast(method = contrasts_primary, adjust = "none") + + if (!is.null(contrasts_secondary)) { + # suppressWarnings: same benign lme4 scale warning as Branch A second-order step. + emm_2nd <- suppressWarnings( + contrast(emm_1st, method = contrasts_secondary, adjust = "none") + ) + contrast_res <- bind_rows( + # Branch B uses named contrast vectors, not a ref/lvl spec — set NA for both orders + as.data.frame(emm_1st) %>% + mutate(contrast_order = "first_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_), + as.data.frame(emm_2nd) %>% + mutate(contrast_order = "second_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_) + ) %>% mutate(gene = gene_name, model_status = model_status) + } else { + contrast_res <- as.data.frame(emm_1st) %>% + mutate(contrast_order = "first_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_, + gene = gene_name, + model_status = model_status) + } + } + + } else { + contrast_res <- NULL + } + + list( + aic = aic_res, + anova = lme_i_anova_tab, + contrasts = contrast_res, + model_status = setNames(model_status, gene_name) + ) + + }, error = function(e) { + err_msg <- conditionMessage(e) + list( + aic = data.frame(gene = gene_name, AIC = NA_real_), + anova = data.frame( + term = NA_character_, + statistic = NA_real_, + df = NA_real_, + p.value = NA_real_, + gene = gene_name, + model_status = err_msg, + predictor_class = NA_character_, + Estimate_source = NA_character_, + Estimate = NA_real_, + Estimate_SE = NA_real_ + ), + contrasts = data.frame( + contrast = NA_character_, + estimate = NA_real_, + SE = NA_real_, + df = NA_real_, + t.ratio = NA_real_, + p.value = NA_real_, + contrast_order = NA_character_, + contrast_ref = NA_character_, + contrast_lvl = NA_character_, + gene = gene_name, + model_status = err_msg + ), + model_status = setNames(err_msg, gene_name) + ) + }) + + return(result) + } + + +######################################################## +# geneLME_compiler +# Aggregates list of per-gene results into named result tables, +# then appends FDR-adjusted p-values within each grouping unit. +# +# FDR grouping strategy: +# lme_anova: adjust within each model term (across all genes). +# Each term's p-values form one adjustment set. +# lme_contrast: adjust within each contrast x contrast_order combination +# (across all genes). Branch B contrast labels already encode +# the 'by' variable level (e.g. "TrtC vs TrtA, visit = V2"), +# so grouping by contrast alone is sufficient. +# NA p-values (failed gene models) are preserved as NA in p.value_adj. +# singular_fit genes are included in the adjustment set (their p-values +# are valid; users filter on model_status if they wish to exclude them). +######################################################## + +geneLME_compiler <- function(fit, fdr_method = "BH", contrast_spec = NULL) { + + lme_anova <- map_dfr(fit, "anova") %>% + group_by(term) %>% + mutate(p.value_adj = p.adjust(p.value, method = fdr_method)) %>% + relocate(p.value_adj, .after = p.value)%>% + ungroup() + + lme_contrast_raw <- map_dfr(fit, "contrasts") + lme_contrast <- if (ncol(lme_contrast_raw) > 0 && + all(c("contrast", "contrast_order") %in% colnames(lme_contrast_raw))) { + lme_contrast_raw %>% + group_by(contrast, contrast_order) %>% + mutate(p.value_adj = p.adjust(p.value, method = fdr_method)) %>% + relocate(p.value_adj, .after = p.value)%>% + ungroup() + } else { + lme_contrast_raw # no contrasts run; return as-is (empty or NULL-row stub) + } + + list( + lme_anova = lme_anova, + lme_contrast = lme_contrast, + lme_fit = map_dfr(fit, "aic"), + lme_err = map_chr(fit, "model_status"), + contrast_spec = contrast_spec # indexed spec; NULL when no contrasts run + ) +} + + +######################################################## +# geneLME_dispatch +# Runs future_lapply with explicit global declaration to prevent +# future's automatic environment scan from capturing large objects. +# +# Key design decisions: +# 1. Iterate over an integer index — a plain integer sequence carries +# no environment baggage for future to scan. +# 2. All shared objects passed via future.globals, bypassing automatic +# scanning entirely. +# +# Dev2: contrast_spec replaced by contrasts_list + spec_lookup (Branch A) +# in both the function signature and future.globals. +######################################################## + +geneLME_dispatch <- function(gene_data_list, + targets_df, + formula_str, + run_contrast, + contrast_vars, + contrast_var_2_levels, + contrasts_list, + spec_lookup, + contrasts_primary, + contrasts_secondary) { + n_genes <- length(gene_data_list) + + future.apply::future_lapply( + seq_len(n_genes), + FUN = function(i) { + gene_data <- gene_data_list[[i]] + geneLME_fit( + gene_name = gene_data$gene_name, + expression_vec = gene_data$expression_vec, + weight_vec = gene_data$weight_vec, + targets = targets_df, + formula_str = formula_str, + run_contrast = run_contrast, + contrast_vars = contrast_vars, + contrast_var_2_levels = contrast_var_2_levels, + contrasts_list = contrasts_list, + spec_lookup = spec_lookup, + contrasts_primary = contrasts_primary, + contrasts_secondary = contrasts_secondary + ) + }, + future.globals = list( + gene_data_list = gene_data_list, + targets_df = targets_df, + formula_str = formula_str, + run_contrast = run_contrast, + contrast_vars = contrast_vars, + contrast_var_2_levels = contrast_var_2_levels, + contrasts_list = contrasts_list, + spec_lookup = spec_lookup, + contrasts_primary = contrasts_primary, + contrasts_secondary = contrasts_secondary, + geneLME_fit = geneLME_fit + ), + future.packages = c("lme4", "emmeans", "car", "broom.mixed", "dplyr", "tibble"), + future.seed = TRUE + ) +} + + +######################################################## +# geneLME +# User-facing wrapper: validates inputs, sets up parallel plan, +# pre-extracts per-gene data, dispatches geneLME_fit in parallel. +# +# Dev2: calls geneLME_build_contrast_args() for Branch A to pre-compute +# contrasts_list and spec_lookup before the parallel stage. +######################################################## + +geneLME <- + function(dat, + formula_str, + model_weights = NULL, + run_contrast = NULL, + contrast_vars = NULL, + contrast_var_2_levels = NULL, + contrast_spec = NULL, # data.frame(contrast_ref, contrast_lvl) or NULL + # required when contrast_vars contains ":" + contrasts_primary = NULL, + contrasts_secondary = NULL, + fdr_method = "BH", # any method accepted by p.adjust() + n_cores = NULL) { + + # --- PRE-FLIGHT VALIDATION --- + + # Build a local formula object solely for variable checking — never + # passed to workers. Workers reconstruct the formula from formula_str. + formula_obj_local <- as.formula(paste("expression", formula_str)) + + # Initialise objects populated during the contrast validation block below. + indexed_contrast_spec <- NULL + contrasts_list <- NULL # Branch A: pre-computed named contrast vectors + spec_lookup <- NULL # Branch A: pre-computed ref/lvl join table + + # 1. Formula variables present in targets + required_vars <- all.vars(formula_obj_local) + required_vars <- required_vars[required_vars != "expression"] + + missing_vars <- setdiff(required_vars, colnames(dat$targets)) + if (length(missing_vars) > 0) { + stop(paste( + "The following variables in the formula are missing from dat$targets:", + paste(missing_vars, collapse = ", ") + )) + } + + # 2. Weights alignment + if (isTRUE(model_weights)) { + if (is.null(dat$weights)) { + stop("model_weights = TRUE but dat$weights is NULL.") + } + if (!identical(dim(dat$weights), dim(dat$E))) { + stop("Dimensions of dat$weights do not match dat$E.") + } + } + + # 3. Contrast variables and spec + if (isTRUE(run_contrast)) { + + if (is.null(contrast_vars)) { + stop("run_contrast = TRUE but contrast_vars is NULL.") + } + + is_interaction <- any(grepl(":", contrast_vars)) + + # 3a. Formula-vs-contrast_vars consistency check. + # Uses terms() so that both "a*b" and "a:b" formula syntax are handled correctly — + # "a*b" expands to include "a:b" in term labels, so both forms are detected. + formula_terms <- attr(terms(formula_obj_local), "term.labels") + + if (is_interaction) { + # For interaction contrasts, verify the interaction term is actually in the model. + # emmeans will silently run contrasts on additive margins if the interaction is absent, + # which is statistically misleading. + ixn_term <- contrast_vars[grep(":", contrast_vars)][1] # e.g. "treatment:visit" + if (!ixn_term %in% formula_terms) { + stop(paste0( + "contrast_vars specifies an interaction contrast for '", ixn_term, "', ", + "but this interaction term is not present in formula_str.\n", + "Either add the interaction to the formula (e.g. '~ treatment * visit + ...') ", + "or change contrast_vars to a non-interaction term." + )) + } + } else { + # For non-interaction contrasts, verify each contrast variable is in the formula + # as a main effect. Warn (not error) in case it is part of an interaction only. + for (cv in contrast_vars) { + if (!cv %in% formula_terms) { + warning(paste0( + "contrast_vars includes '", cv, "' but this term does not appear as a ", + "main effect in formula_str. Contrasts may be computed from interaction ", + "margins only — verify this is the intended model structure." + )) + } + } + } + + # 3b. contrast_spec validation (required for interaction contrasts) + if (is_interaction && is.null(contrast_spec)) { + stop(paste0( + "contrast_vars specifies an interaction contrast ('", contrast_vars, "') ", + "but contrast_spec is NULL.\n", + "Use geneLME_contrast_spec(dat$targets, contrast_vars = '", contrast_vars, + "') to generate a template, filter it to your contrasts of interest, ", + "then pass it as contrast_spec." + )) + } + + if (!is.null(contrast_spec)) { + if (!is.data.frame(contrast_spec)) { + stop("contrast_spec must be a data.frame.") + } + if (!all(c("contrast_ref", "contrast_lvl") %in% colnames(contrast_spec))) { + stop("contrast_spec must have columns 'contrast_ref' and 'contrast_lvl'.") + } + if (!is_interaction) { + warning("contrast_spec is provided but contrast_vars contains no interaction term (':'). ", + "contrast_spec is designed for interaction contrasts — did you mean to use contrasts_primary?") + } + + # Build the indexed contrast_spec that will be attached to the return value. + indexed_contrast_spec <- contrast_spec %>% + mutate(contrast_index = seq_len(n())) %>% + select(contrast_index, everything()) + + # Print the indexed spec as a reminder whenever contrasts_secondary is provided. + if (is_interaction && !is.null(contrasts_secondary)) { + n_first <- nrow(contrast_spec) + message( + "contrasts_secondary will be applied to the first-order interaction contrasts ", + "in the order they appear in contrast_spec (", n_first, " contrasts):\n", + paste(seq_len(n_first), + paste(contrast_spec$contrast_lvl, contrast_spec$contrast_ref, sep = " - "), + sep = ". ", collapse = "\n"), + "\nEnsure contrasts_secondary vectors have length ", n_first, + ", with each element corresponding to the contrast at that position.", + "\nThe indexed contrast_spec is returned as $contrast_spec in the output." + ) + + # Soft-fail: wrong-length vectors produce silent NAs deep inside geneLME_fit(). + bad_lens <- sapply(contrasts_secondary, length) + bad_names <- names(bad_lens)[bad_lens != n_first] + if (length(bad_names) > 0) { + message( + "\ncontrasts_secondary vector(s) have wrong length — returning early without running models.\n", + "Expected length ", n_first, " (= nrow(contrast_spec) after filtering).\n", + "Offending vector(s):\n", + paste0(" '", bad_names, "': length ", bad_lens[bad_names], collapse = "\n"), "\n\n", + "Common cause: vectors were built against nrow() of the full unfiltered\n", + "spec_template rather than nrow() of the filtered contrast_spec passed here.\n", + "Use $contrast_spec in the returned object to re-specify your secondary contrast vectors:\n", + " each vector must have length ", n_first, ", one element per row of $contrast_spec." + ) + return(invisible(list( + lme_anova = NULL, + lme_contrast = NULL, + lme_fit = NULL, + lme_err = NULL, + contrast_spec = indexed_contrast_spec + ))) + } + } + + # --- PRE-COMPUTE BRANCH A CONTRAST STRUCTURES --- + # Build contrasts_list and spec_lookup once here, before parallel dispatch. + # Workers receive these ready-made objects instead of rebuilding from + # contrast_spec on every gene — eliminates nrow(contrast_spec) × n_genes + # iterations of R-level vector construction. + if (is_interaction) { + contrast_args <- geneLME_build_contrast_args(dat$targets, contrast_vars, contrast_spec) + contrasts_list <- contrast_args$contrasts_list + spec_lookup <- contrast_args$spec_lookup + } + } + + # 3c. contrast_vars present in targets (split on ":" for interaction) + vars_to_check <- unique(unlist(strsplit(contrast_vars, ":"))) + missing_contrast_vars <- setdiff(vars_to_check, colnames(dat$targets)) + if (length(missing_contrast_vars) > 0) { + stop(paste( + "contrast_vars not found in dat$targets:", + paste(missing_contrast_vars, collapse = ", ") + )) + } + + # 3d. contrast_var_2_levels validation (Branch B only) + if (!is_interaction && length(contrast_vars) == 2 && !is.null(contrast_var_2_levels)) { + actual_levels <- unique(as.character(dat$targets[[contrast_vars[2]]])) + invalid_levels <- setdiff(contrast_var_2_levels, actual_levels) + if (length(invalid_levels) > 0) { + stop(paste0( + "Levels specified for '", contrast_vars[2], + "' not found in data: ", paste(invalid_levels, collapse = ", ") + )) + } + } + + # 3e. Branch B: build indexed contrast_spec from contrasts_primary names, + # and soft-fail on wrong-length contrasts_secondary vectors. + if (!is_interaction && !is.null(contrasts_primary)) { + indexed_contrast_spec <- data.frame( + contrast_index = seq_along(contrasts_primary), + contrast_name = names(contrasts_primary), + stringsAsFactors = FALSE + ) + + if (!is.null(contrasts_secondary)) { + n_primary <- length(contrasts_primary) + bad_lens <- sapply(contrasts_secondary, length) + bad_names <- names(bad_lens)[bad_lens != n_primary] + if (length(bad_names) > 0) { + message( + "\ncontrasts_secondary vector(s) have wrong length — returning early without running models.\n", + "Expected length ", n_primary, " (= number of contrasts_primary vectors).\n", + "Offending vector(s):\n", + paste0(" '", bad_names, "': length ", bad_lens[bad_names], collapse = "\n"), "\n\n", + "Use $contrast_spec in the returned object to confirm the primary contrast ordering:\n", + " each contrasts_secondary vector must have length ", n_primary, + ", one element per row of $contrast_spec." + ) + return(invisible(list( + lme_anova = NULL, + lme_contrast = NULL, + lme_fit = NULL, + lme_err = NULL, + contrast_spec = indexed_contrast_spec + ))) + } + } + } + } + + # 4. fdr_method must be a recognised p.adjust method + valid_fdr_methods <- p.adjust.methods # exported character vector from stats + if (!fdr_method %in% valid_fdr_methods) { + stop(paste0( + "'", fdr_method, "' is not a valid p.adjust method.\n", + "Choose one of: ", paste(valid_fdr_methods, collapse = ", ") + )) + } + + # 5. Warn on likely ID columns used as predictors + for (v in required_vars) { + if (is.character(dat$targets[[v]]) && + length(unique(dat$targets[[v]])) > nrow(dat$targets) / 2) { + warning(paste0( + "Variable '", v, "' has many unique string values — is it an ID column rather than a predictor?" + )) + } + } + + message("Input validation passed. Launching parallel LME fits...") + + # --- PRE-SLICE INTO PER-GENE LIST --- + W_mat <- if (isTRUE(model_weights)) dat$weights else NULL + + gene_data_list <- lapply(rownames(dat$E), function(g) { + list( + gene_name = g, + expression_vec = dat$E[g, ], + weight_vec = if (!is.null(W_mat)) W_mat[g, ] else NULL + ) + }) + + targets_df <- dat$targets + + # --- PARALLEL PLAN --- + workers <- if (is.null(n_cores)) max(1L, parallel::detectCores() - 4L) else n_cores + future::plan(future::multisession, workers = workers) + on.exit(future::plan(future::sequential), add = TRUE) + + # --- DISPATCH via clean-scope helper --- + fit_results <- geneLME_dispatch( + gene_data_list = gene_data_list, + targets_df = targets_df, + formula_str = formula_str, + run_contrast = run_contrast, + contrast_vars = contrast_vars, + contrast_var_2_levels = contrast_var_2_levels, + contrasts_list = contrasts_list, + spec_lookup = spec_lookup, + contrasts_primary = contrasts_primary, + contrasts_secondary = contrasts_secondary + ) + + return(geneLME_compiler(fit_results, fdr_method = fdr_method, + contrast_spec = indexed_contrast_spec)) + } + + +######################################################## +# Testing (see geneLME_test.R for full test suite) +######################################################## + +# --- Branch A: interaction contrast via contrast_spec --- +# spec_template <- geneLME_contrast_spec( +# targets = dat$targets, +# contrast_vars = "treatment:visit" +# ) +# my_spec <- spec_template %>% dplyr::filter(...) +# +# test_mods_A <- +# geneLME( +# dat = dat_sub, +# formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", +# model_weights = TRUE, +# run_contrast = TRUE, +# contrast_vars = "treatment:visit", +# contrast_spec = my_spec, +# n_cores = 10 +# ) + +# --- Branch B: non-interaction --- +# test_mods_B <- +# geneLME( +# dat = dat_sub, +# formula_str = "~ treatment + visit + age + sex + rNANgUl + (1|ptID)", +# model_weights = TRUE, +# run_contrast = TRUE, +# contrast_vars = c("treatment", "visit"), +# contrast_var_2_levels = c("V2", "V3"), +# contrasts_primary = list("TrtC vs TrtA" = c(-1, 0, 1), +# "TrtB vs TrtA" = c(-1, 1, 0)), +# contrasts_secondary = list("TrtC vs TrtB" = c(1, -1)), +# n_cores = 10 +# ) diff --git a/R_functions/geneLME_function_overview.html b/R_functions/geneLME_function_overview.html new file mode 100644 index 0000000..d76f606 --- /dev/null +++ b/R_functions/geneLME_function_overview.html @@ -0,0 +1,611 @@ + + + + + + + geneLME — Function Overview + + + + +
+

geneLME — Function Overview

+
+ +

geneLME — Function Overview

+
+

Companion to: geneLME.R Last +updated: 2026-02-20 For session continuity notes see +CLAUDE_NOTES_geneLME.md. For a worked tutorial with output +see geneLME_tutorial.html (knitted from +geneLME_tutorial.Rmd).

+
+
+

Function Map

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FunctionRoleCalled by
geneLME_contrast_spec()Pre-run helper: inspect available contrast levels and understand how +to specify argumentsUser
geneLME_build_contrast_args()Branch A helper: pre-computes contrasts_list + +spec_lookup once before parallel dispatchgeneLME()
geneLME()User-facing entry point: validates inputs, slices data, sets +parallel planUser
geneLME_dispatch()Launches future_lapply with explicit globals; clean +parallel scopegeneLME()
geneLME_fit()Per-gene worker: fits lmer, extracts ANOVA + +contrastsgeneLME_dispatch()
geneLME_compiler()Binds per-gene list results into four output data framesgeneLME()
+
+

geneLME_contrast_spec(targets, contrast_vars)

+

A pre-run helper to enumerate available levels and understand how to +construct contrast arguments before calling geneLME(). Two +modes:

+

Interaction mode

+

contrast_vars is a single "var_a:var_b" +string. Returns a data.frame(contrast_ref, contrast_lvl) +with all pairwise combinations of interaction cells. The user filters +this to their contrasts of interest and passes it directly as +contrast_spec to geneLME().

+
spec_template <- geneLME_contrast_spec(dat$targets, contrast_vars = "treatment:visit")
+# → data.frame with 66 rows (all pairs of 12 interaction cells)
+#    contrast_ref contrast_lvl
+# 1      TrtA V1      TrtB V1
+# 2      TrtA V1      TrtC V1
+# ...
+
+my_spec <- spec_template %>%
+  filter(sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl),  # same visit
+         sub(".* ", "", contrast_ref) %in% c("V2", "V3"))               # only V2, V3
+# → 6 rows: the 3 cross-treatment pairs at V2, and 3 at V3
+

Single / multi-variable mode

+

contrast_vars is a character vector of one or more plain +variable names (no :). Returns a named +list, one data.frame(level) per variable. A +message for each variable explains its role in subsequent +geneLME() arguments. This list is a reference only — not +passed to geneLME().

+
ref <- geneLME_contrast_spec(dat$targets, contrast_vars = c("treatment", "visit"))
+
+# Message for treatment (position 1 — primary):
+# 'treatment' — primary contrast variable (contrast_vars[1] in geneLME).
+# 3 levels (alphabetical order = position order for contrasts_primary vectors):
+# 1. TrtA   2. TrtB   3. TrtC
+# → contrast vectors passed to contrasts_primary must have length 3
+#   e.g. 'TrtC vs TrtA' = c(-1, 0, 1)
+
+# Message for visit (position 2 — secondary 'by' variable):
+# 'visit' — secondary 'by' grouping variable (contrast_vars[2] in geneLME).
+# 4 levels: V1, V2, V3, V4
+# → pass a subset to contrast_var_2_levels in geneLME() to restrict
+#   which groups the primary contrasts are computed within.
+
+ref$treatment   # data.frame: TrtA, TrtB, TrtC
+ref$visit       # data.frame: V1, V2, V3, V4
+

Note: mixing interaction and plain variable names in +one call raises an error — call separately.

+
+

geneLME() — Main +Function

+

Signature

+
geneLME(dat, formula_str,
+        model_weights         = NULL,
+        run_contrast          = NULL,
+        contrast_vars         = NULL,
+        contrast_var_2_levels = NULL,
+        contrast_spec         = NULL,
+        contrasts_primary     = NULL,
+        contrasts_secondary   = NULL,
+        fdr_method            = "BH",   # any method accepted by p.adjust()
+        n_cores               = NULL)
+

Input

+

dat — EList-like list with three elements:

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + +
ElementTypeDescription
dat$Ematrix (genes × samples)log2 expression values
dat$weightsmatrix (genes × samples), optionalvoom precision weights
dat$targetsdata.frame (samples × covariates)sample metadata; all formula and contrast variables must be columns +here
+

Output

+

Named list of five elements:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ElementContents
lme_anovaOne row per model term per gene: term, +statistic, df, p.value, +p.value_adj (FDR within term), gene, +model_status, predictor_class, +Estimate, Estimate_SE
lme_contrastOne row per contrast per gene: contrast, +contrast_ref, contrast_lvl (Branch A +first-order only; NA otherwise), estimate, SE, +df, t.ratio, p.value, +p.value_adj (FDR within contrast × order), +contrast_order ("first_order" / +"second_order"), gene. Branch B also has a +visit column.
lme_fitOne row per gene: gene, AIC
lme_errNamed character vector: gene → "success", +"singular_fit", or unexpected error message. A +model_status column carrying the same value also appears in +lme_anova and lme_contrast.
contrast_specIndexed copy of the filtered contrast_spec (Branch A) +or data.frame(contrast_index, contrast_name) from +contrasts_primary names (Branch B); NULL when +no contrasts run. On soft-fail (wrong-length +contrasts_secondary), all other elements are NULL and only +this is returned.
+

Validation Checks +(pre-flight)

+
    +
  1. All formula variables present in dat$targets
  2. +
  3. dat$weights present and dimension-matched if +model_weights = TRUE
  4. +
  5. contrast_vars not NULL when +run_contrast = TRUE
  6. +
  7. Interaction term present in formula if +contrast_vars contains : (hard error — +emmeans would otherwise silently use additive margins, +which is statistically misleading)
  8. +
  9. contrast_spec required (not NULL) when +contrast_vars contains :
  10. +
  11. contrast_spec has correct columns +(contrast_ref, contrast_lvl)
  12. +
  13. All contrast_vars component variables present in +dat$targets
  14. +
  15. contrast_var_2_levels values valid against actual data +levels
  16. +
  17. fdr_method validated against +p.adjust.methods
  18. +
  19. contrasts_secondary vector lengths +(soft-fail, not hard stop): each vector must have length == +nrow(contrast_spec) (Branch A) or +length(contrasts_primary) (Branch B). On mismatch returns +early with only $contrast_spec populated.
  20. +
  21. Warn on likely ID columns used as predictors
  22. +
+
+

Branch A — Interaction +Contrasts

+

When to use: model formula contains an interaction +term (treatment * visit or treatment:visit) +and you want specific pairwise contrasts across particular interaction +cells.

+

Step 1 — +Generate and filter the contrast template

+
spec_template <- geneLME_contrast_spec(dat$targets, "treatment:visit")
+# Returns data.frame of all 66 pairwise interaction cell combinations
+
+my_spec <- spec_template %>%
+  filter(sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl),  # same visit only
+         sub(".* ", "", contrast_ref) %in% c("V2", "V3"))               # visits V2 and V3 only
+# 6 rows
+

Step 2 — Run

+
result_A <- geneLME(
+  dat           = dat,
+  formula_str   = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+  model_weights = TRUE,
+  run_contrast  = TRUE,
+  contrast_vars = "treatment:visit",   # single ":" string → triggers Branch A
+  contrast_spec = my_spec,             # required; each row defines one contrast
+  n_cores       = 10
+)
+

Sample +lme_contrast output (6 contrasts × n genes rows)

+

contrast_ref = the −1 cell; contrast_lvl = +the +1 cell. Together they make the sign of estimate +unambiguous without parsing the contrast label string.

+
            contrast   contrast_ref  contrast_lvl  estimate    SE      df   t.ratio  p.value  contrast_order  gene
+TrtB V2 - TrtA V2    TrtA V2        TrtB V2        -0.040    0.960  93.04  -0.042    0.967    first_order     gene01
+TrtC V2 - TrtA V2    TrtA V2        TrtC V2         0.092    0.942  97.37   0.098    0.922    first_order     gene01
+TrtC V2 - TrtB V2    TrtB V2        TrtC V2         0.132    0.993  89.10   0.133    0.894    first_order     gene01
+TrtB V3 - TrtA V3    TrtA V3        TrtB V3         0.191    0.934  98.51   0.205    0.838    first_order     gene01
+TrtC V3 - TrtA V3    TrtA V3        TrtC V3         1.625    0.945  90.82   1.721    0.089    first_order     gene01
+TrtC V3 - TrtB V3    TrtB V3        TrtC V3         1.434    0.958  90.33   1.497    0.138    first_order     gene01
+
+

Branch A with Second-Order +Contrasts

+

At validation time, geneLME() prints the numbered +first-order list (in contrast_spec row order) to guide +vector construction. Secondary vectors must have length +nrow(contrast_spec).

+
result_A2 <- geneLME(
+  dat                 = dat,
+  formula_str         = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+  model_weights       = TRUE,
+  run_contrast        = TRUE,
+  contrast_vars       = "treatment:visit",
+  contrast_spec       = my_spec,          # 6 rows, order defines vector positions
+  contrasts_secondary = list(
+    # my_spec rows: 1=TrtB V2-TrtA V2, 2=TrtC V2-TrtA V2, 3=TrtC V2-TrtB V2,
+    #               4=TrtB V3-TrtA V3, 5=TrtC V3-TrtA V3, 6=TrtC V3-TrtB V3
+    "TrtA vs TrtB: V3 vs V2" = c(1, 0, -1, 0, 0, 0),   # row1 − row3: V2 vs V3 delta
+    "TrtA vs TrtC: V3 vs V2" = c(0, 1,  0,-1, 0, 0)    # row2 − row4
+  ),
+  n_cores             = 10
+)
+# lme_contrast: first_order (6 × n_genes) + second_order (2 × n_genes) rows
+

Validation message printed:

+
contrasts_secondary will be applied to the first-order interaction contrasts
+in the order they appear in contrast_spec (6 contrasts):
+1. TrtB V2 - TrtA V2
+2. TrtC V2 - TrtA V2
+3. TrtC V2 - TrtB V2
+4. TrtB V3 - TrtA V3
+5. TrtC V3 - TrtA V3
+6. TrtC V3 - TrtB V3
+Ensure contrasts_secondary vectors have length 6, with each element
+corresponding to the contrast at that position.
+
+

Branch B — Non-Interaction +Contrasts

+

When to use: model is additive (no interaction), and +you want named contrasts on the marginal means of one variable, +optionally evaluated within specific levels of a second variable.

+

Step 1 — Inspect levels

+
ref <- geneLME_contrast_spec(dat$targets, contrast_vars = c("treatment", "visit"))
+# treatment levels (alphabetical): TrtA[1], TrtB[2], TrtC[3]
+# → contrasts_primary vectors must have length 3, positions = [TrtA, TrtB, TrtC]
+# visit levels: V1, V2, V3, V4 → pass subset to contrast_var_2_levels
+

Step 2 — Run

+
result_B <- geneLME(
+  dat                   = dat,
+  formula_str           = "~ treatment + visit + age + sex + rNANgUl + (1|ptID)",
+  model_weights         = TRUE,
+  run_contrast          = TRUE,
+  contrast_vars         = c("treatment", "visit"), # [1] = primary; [2] = by-variable
+  contrast_var_2_levels = c("V2", "V3"),           # restrict output to these visit levels
+  contrasts_primary     = list(
+    "TrtC vs TrtA" = c(-1, 0, 1),  # positions: [TrtA=−1, TrtB=0, TrtC=+1]
+    "TrtB vs TrtA" = c(-1, 1, 0)
+  ),
+  contrasts_secondary   = list(
+    "TrtC vs TrtB" = c(1, -1)      # length = number of primary contrasts (2)
+  ),
+  n_cores               = 10
+)
+

Sample +lme_contrast output (per gene: 2 primary × 2 visits + 1 +secondary × 2 visits = 6 rows)

+
       contrast  visit   estimate    SE       df   t.ratio  p.value  contrast_order  gene
+TrtC vs TrtA     V2      0.814    0.457  101.6   1.782     0.078    first_order     gene01
+TrtB vs TrtA     V2      0.009    0.457  102.3   0.019     0.984    first_order     gene01
+TrtC vs TrtA     V3      0.814    0.457  101.6   1.782     0.078    first_order     gene01
+TrtB vs TrtA     V3      0.009    0.457  102.3   0.019     0.984    first_order     gene01
+TrtC vs TrtB     V2      0.805    0.462   99.1   1.742     0.085    second_order    gene01
+TrtC vs TrtB     V3      0.805    0.462   99.1   1.742     0.085    second_order    gene01
+
+

Note: In an additive model, estimates are identical +across V2 and V3 — contrast_var_2_levels restricts +which visit cells emmeans evaluates at, but since +there is no interaction the treatment effect is constant across visits. +This filter is most useful to control the number of output rows and +limit multiple testing burden.

+
+
+

lme_anova Output +Structure

+

One row per model term per gene. Key columns:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnDescription
termPredictor name (e.g. "treatment", +"treatment:visit", "age")
statisticChi-square statistic from car::Anova()
dfDegrees of freedom
p.valuep-value
geneGene identifier
model_status"success" or error message
predictor_class"continuous", "two-level-categorical", +"multi-level-categorical", or +"interaction"
Estimate_source"lme_summary" (value in +Estimate/Estimate_SE) or +"seeContrasts" (NA — use contrast output for estimate)
EstimateCoefficient estimate where available; NA for multi-level or complex +interactions
Estimate_SEStandard error of estimate where available
+
+

Error Handling and Singular +Fits

+

Per-gene outcomes are tracked via a model_status column +in both lme_anova and lme_contrast, and +summarised in lme_err (named character vector, gene → +status). Possible values:

+ +

Note: singular fits were previously treated as a hard +stop() that excluded the gene entirely. They are now +flagged and returned, giving users the choice to retain or filter.

+
table(result$lme_err)
+non_success <- names(result$lme_err)[result$lme_err != "success"]
+
+# Filter to clean fits only downstream:
+result$lme_contrast %>% filter(model_status == "success")
+ + diff --git a/R_functions/geneLME_function_overview.md b/R_functions/geneLME_function_overview.md new file mode 100644 index 0000000..3307b04 --- /dev/null +++ b/R_functions/geneLME_function_overview.md @@ -0,0 +1,299 @@ +# geneLME — Function Overview + +> **Companion to:** `geneLME.R` +> **Last updated:** 2026-02-20 +> For session continuity notes see `CLAUDE_NOTES_geneLME.md`. +> For a worked tutorial with output see `geneLME_tutorial.html` (knitted from `geneLME_tutorial.Rmd`). + +--- + +## Function Map + +| Function | Role | Called by | +|---|---|---| +| `geneLME_contrast_spec()` | Pre-run helper: inspect available contrast levels and understand how to specify arguments | User | +| `geneLME_build_contrast_args()` | Branch A helper: pre-computes `contrasts_list` + `spec_lookup` once before parallel dispatch | `geneLME()` | +| `geneLME()` | User-facing entry point: validates inputs, slices data, sets parallel plan | User | +| `geneLME_dispatch()` | Launches `future_lapply` with explicit globals; clean parallel scope | `geneLME()` | +| `geneLME_fit()` | Per-gene worker: fits `lmer`, extracts ANOVA + contrasts | `geneLME_dispatch()` | +| `geneLME_compiler()` | Binds per-gene list results into four output data frames | `geneLME()` | + +--- + +## `geneLME_contrast_spec(targets, contrast_vars)` + +A pre-run helper to enumerate available levels and understand how to construct contrast arguments before calling `geneLME()`. Two modes: + +### Interaction mode + +`contrast_vars` is a single `"var_a:var_b"` string. Returns a `data.frame(contrast_ref, contrast_lvl)` with all pairwise combinations of interaction cells. The user filters this to their contrasts of interest and passes it directly as `contrast_spec` to `geneLME()`. + +```r +spec_template <- geneLME_contrast_spec(dat$targets, contrast_vars = "treatment:visit") +# → data.frame with 66 rows (all pairs of 12 interaction cells) +# contrast_ref contrast_lvl +# 1 TrtA V1 TrtB V1 +# 2 TrtA V1 TrtC V1 +# ... + +my_spec <- spec_template %>% + filter(sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl), # same visit + sub(".* ", "", contrast_ref) %in% c("V2", "V3")) # only V2, V3 +# → 6 rows: the 3 cross-treatment pairs at V2, and 3 at V3 +``` + +### Single / multi-variable mode + +`contrast_vars` is a character vector of one or more plain variable names (no `:`). Returns a **named list**, one `data.frame(level)` per variable. A message for each variable explains its role in subsequent `geneLME()` arguments. This list is a reference only — not passed to `geneLME()`. + +```r +ref <- geneLME_contrast_spec(dat$targets, contrast_vars = c("treatment", "visit")) + +# Message for treatment (position 1 — primary): +# 'treatment' — primary contrast variable (contrast_vars[1] in geneLME). +# 3 levels (alphabetical order = position order for contrasts_primary vectors): +# 1. TrtA 2. TrtB 3. TrtC +# → contrast vectors passed to contrasts_primary must have length 3 +# e.g. 'TrtC vs TrtA' = c(-1, 0, 1) + +# Message for visit (position 2 — secondary 'by' variable): +# 'visit' — secondary 'by' grouping variable (contrast_vars[2] in geneLME). +# 4 levels: V1, V2, V3, V4 +# → pass a subset to contrast_var_2_levels in geneLME() to restrict +# which groups the primary contrasts are computed within. + +ref$treatment # data.frame: TrtA, TrtB, TrtC +ref$visit # data.frame: V1, V2, V3, V4 +``` + +**Note:** mixing interaction and plain variable names in one call raises an error — call separately. + +--- + +## `geneLME()` — Main Function + +### Signature + +```r +geneLME(dat, formula_str, + model_weights = NULL, + run_contrast = NULL, + contrast_vars = NULL, + contrast_var_2_levels = NULL, + contrast_spec = NULL, + contrasts_primary = NULL, + contrasts_secondary = NULL, + fdr_method = "BH", # any method accepted by p.adjust() + n_cores = NULL) +``` + +### Input + +`dat` — EList-like list with three elements: + +| Element | Type | Description | +|---|---|---| +| `dat$E` | matrix (genes × samples) | log2 expression values | +| `dat$weights` | matrix (genes × samples), optional | voom precision weights | +| `dat$targets` | data.frame (samples × covariates) | sample metadata; all formula and contrast variables must be columns here | + +### Output + +Named list of five elements: + +| Element | Contents | +|---|---| +| `lme_anova` | One row per model term per gene: `term`, `statistic`, `df`, `p.value`, `p.value_adj` (FDR within term), `gene`, `model_status`, `predictor_class`, `Estimate`, `Estimate_SE` | +| `lme_contrast` | One row per contrast per gene: `contrast`, `contrast_ref`, `contrast_lvl` (Branch A first-order only; NA otherwise), `estimate`, `SE`, `df`, `t.ratio`, `p.value`, `p.value_adj` (FDR within contrast × order), `contrast_order` (`"first_order"` / `"second_order"`), `gene`. Branch B also has a `visit` column. | +| `lme_fit` | One row per gene: `gene`, `AIC` | +| `lme_err` | Named character vector: gene → `"success"`, `"singular_fit"`, or unexpected error message. A `model_status` column carrying the same value also appears in `lme_anova` and `lme_contrast`. | +| `contrast_spec` | Indexed copy of the filtered `contrast_spec` (Branch A) or `data.frame(contrast_index, contrast_name)` from `contrasts_primary` names (Branch B); `NULL` when no contrasts run. On soft-fail (wrong-length `contrasts_secondary`), all other elements are NULL and only this is returned. | + +### Validation Checks (pre-flight) + +1. All formula variables present in `dat$targets` +2. `dat$weights` present and dimension-matched if `model_weights = TRUE` +3. `contrast_vars` not NULL when `run_contrast = TRUE` +4. **Interaction term present in formula** if `contrast_vars` contains `:` (hard error — `emmeans` would otherwise silently use additive margins, which is statistically misleading) +5. `contrast_spec` required (not NULL) when `contrast_vars` contains `:` +6. `contrast_spec` has correct columns (`contrast_ref`, `contrast_lvl`) +7. All `contrast_vars` component variables present in `dat$targets` +8. `contrast_var_2_levels` values valid against actual data levels +9. `fdr_method` validated against `p.adjust.methods` +10. **`contrasts_secondary` vector lengths** (soft-fail, not hard stop): each vector must have length == `nrow(contrast_spec)` (Branch A) or `length(contrasts_primary)` (Branch B). On mismatch returns early with only `$contrast_spec` populated. +11. Warn on likely ID columns used as predictors + +--- + +## Branch A — Interaction Contrasts + +**When to use:** model formula contains an interaction term (`treatment * visit` or `treatment:visit`) and you want specific pairwise contrasts across particular interaction cells. + +### Step 1 — Generate and filter the contrast template + +```r +spec_template <- geneLME_contrast_spec(dat$targets, "treatment:visit") +# Returns data.frame of all 66 pairwise interaction cell combinations + +my_spec <- spec_template %>% + filter(sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl), # same visit only + sub(".* ", "", contrast_ref) %in% c("V2", "V3")) # visits V2 and V3 only +# 6 rows +``` + +### Step 2 — Run + +```r +result_A <- geneLME( + dat = dat, + formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", # single ":" string → triggers Branch A + contrast_spec = my_spec, # required; each row defines one contrast + n_cores = 10 +) +``` + +### Sample `lme_contrast` output (6 contrasts × n genes rows) + +`contrast_ref` = the −1 cell; `contrast_lvl` = the +1 cell. Together they make the sign of +`estimate` unambiguous without parsing the contrast label string. + +``` + contrast contrast_ref contrast_lvl estimate SE df t.ratio p.value contrast_order gene +TrtB V2 - TrtA V2 TrtA V2 TrtB V2 -0.040 0.960 93.04 -0.042 0.967 first_order gene01 +TrtC V2 - TrtA V2 TrtA V2 TrtC V2 0.092 0.942 97.37 0.098 0.922 first_order gene01 +TrtC V2 - TrtB V2 TrtB V2 TrtC V2 0.132 0.993 89.10 0.133 0.894 first_order gene01 +TrtB V3 - TrtA V3 TrtA V3 TrtB V3 0.191 0.934 98.51 0.205 0.838 first_order gene01 +TrtC V3 - TrtA V3 TrtA V3 TrtC V3 1.625 0.945 90.82 1.721 0.089 first_order gene01 +TrtC V3 - TrtB V3 TrtB V3 TrtC V3 1.434 0.958 90.33 1.497 0.138 first_order gene01 +``` + +--- + +## Branch A with Second-Order Contrasts + +At validation time, `geneLME()` prints the numbered first-order list (in `contrast_spec` row order) to guide vector construction. Secondary vectors must have length `nrow(contrast_spec)`. + +```r +result_A2 <- geneLME( + dat = dat, + formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = my_spec, # 6 rows, order defines vector positions + contrasts_secondary = list( + # my_spec rows: 1=TrtB V2-TrtA V2, 2=TrtC V2-TrtA V2, 3=TrtC V2-TrtB V2, + # 4=TrtB V3-TrtA V3, 5=TrtC V3-TrtA V3, 6=TrtC V3-TrtB V3 + "TrtA vs TrtB: V3 vs V2" = c(1, 0, -1, 0, 0, 0), # row1 − row3: V2 vs V3 delta + "TrtA vs TrtC: V3 vs V2" = c(0, 1, 0,-1, 0, 0) # row2 − row4 + ), + n_cores = 10 +) +# lme_contrast: first_order (6 × n_genes) + second_order (2 × n_genes) rows +``` + +**Validation message printed:** +``` +contrasts_secondary will be applied to the first-order interaction contrasts +in the order they appear in contrast_spec (6 contrasts): +1. TrtB V2 - TrtA V2 +2. TrtC V2 - TrtA V2 +3. TrtC V2 - TrtB V2 +4. TrtB V3 - TrtA V3 +5. TrtC V3 - TrtA V3 +6. TrtC V3 - TrtB V3 +Ensure contrasts_secondary vectors have length 6, with each element +corresponding to the contrast at that position. +``` + +--- + +## Branch B — Non-Interaction Contrasts + +**When to use:** model is additive (no interaction), and you want named contrasts on the marginal means of one variable, optionally evaluated within specific levels of a second variable. + +### Step 1 — Inspect levels + +```r +ref <- geneLME_contrast_spec(dat$targets, contrast_vars = c("treatment", "visit")) +# treatment levels (alphabetical): TrtA[1], TrtB[2], TrtC[3] +# → contrasts_primary vectors must have length 3, positions = [TrtA, TrtB, TrtC] +# visit levels: V1, V2, V3, V4 → pass subset to contrast_var_2_levels +``` + +### Step 2 — Run + +```r +result_B <- geneLME( + dat = dat, + formula_str = "~ treatment + visit + age + sex + rNANgUl + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = c("treatment", "visit"), # [1] = primary; [2] = by-variable + contrast_var_2_levels = c("V2", "V3"), # restrict output to these visit levels + contrasts_primary = list( + "TrtC vs TrtA" = c(-1, 0, 1), # positions: [TrtA=−1, TrtB=0, TrtC=+1] + "TrtB vs TrtA" = c(-1, 1, 0) + ), + contrasts_secondary = list( + "TrtC vs TrtB" = c(1, -1) # length = number of primary contrasts (2) + ), + n_cores = 10 +) +``` + +### Sample `lme_contrast` output (per gene: 2 primary × 2 visits + 1 secondary × 2 visits = 6 rows) + +``` + contrast visit estimate SE df t.ratio p.value contrast_order gene +TrtC vs TrtA V2 0.814 0.457 101.6 1.782 0.078 first_order gene01 +TrtB vs TrtA V2 0.009 0.457 102.3 0.019 0.984 first_order gene01 +TrtC vs TrtA V3 0.814 0.457 101.6 1.782 0.078 first_order gene01 +TrtB vs TrtA V3 0.009 0.457 102.3 0.019 0.984 first_order gene01 +TrtC vs TrtB V2 0.805 0.462 99.1 1.742 0.085 second_order gene01 +TrtC vs TrtB V3 0.805 0.462 99.1 1.742 0.085 second_order gene01 +``` + +> **Note:** In an additive model, estimates are identical across V2 and V3 — `contrast_var_2_levels` restricts *which* visit cells `emmeans` evaluates at, but since there is no interaction the treatment effect is constant across visits. This filter is most useful to control the number of output rows and limit multiple testing burden. + +--- + +## `lme_anova` Output Structure + +One row per model term per gene. Key columns: + +| Column | Description | +|---|---| +| `term` | Predictor name (e.g. `"treatment"`, `"treatment:visit"`, `"age"`) | +| `statistic` | Chi-square statistic from `car::Anova()` | +| `df` | Degrees of freedom | +| `p.value` | p-value | +| `gene` | Gene identifier | +| `model_status` | `"success"` or error message | +| `predictor_class` | `"continuous"`, `"two-level-categorical"`, `"multi-level-categorical"`, or `"interaction"` | +| `Estimate_source` | `"lme_summary"` (value in `Estimate`/`Estimate_SE`) or `"seeContrasts"` (NA — use contrast output for estimate) | +| `Estimate` | Coefficient estimate where available; NA for multi-level or complex interactions | +| `Estimate_SE` | Standard error of estimate where available | + +--- + +## Error Handling and Singular Fits + +Per-gene outcomes are tracked via a `model_status` column in both `lme_anova` and `lme_contrast`, and summarised in `lme_err` (named character vector, gene → status). Possible values: + +- **`"success"`** — model converged cleanly; all results are reliable. +- **`"singular_fit"`** — `isSingular()` was `TRUE`; random effect variance hit its boundary (zero). Fixed effect estimates and contrasts are still returned and are numerically valid. Common with small N or highly collinear random/fixed effects. Filter downstream if desired. +- **Unexpected error string** — `tryCatch` caught an unexpected error; rows for that gene contain NAs. + +Note: singular fits were previously treated as a hard `stop()` that excluded the gene entirely. They are now flagged and returned, giving users the choice to retain or filter. + +```r +table(result$lme_err) +non_success <- names(result$lme_err)[result$lme_err != "success"] + +# Filter to clean fits only downstream: +result$lme_contrast %>% filter(model_status == "success") +``` diff --git a/R_functions/geneLME_test.R b/R_functions/geneLME_test.R new file mode 100644 index 0000000..7786d78 --- /dev/null +++ b/R_functions/geneLME_test.R @@ -0,0 +1,419 @@ +######################################################## +# geneLME_test.R +# Mock data and test cases for geneLME functions. +# Set SOURCE_FILE below to switch between implementations. +# +# Design: paired treatment x visit study +# - 10 patients (ptID), each observed at 4 visits x 3 treatments +# - 3 treatments: "TrtA", "TrtB", "TrtC" +# - 4 visits: "V1", "V2", "V3", "V4" +# - 120 total samples (10 * 4 * 3) +# - 50 genes +# - Simulated voom weights matrix (same dims as E) +# +# Covariates included in targets: +# Patient-level (same value repeated across all samples for a given patient): +# sex — factor: "M" / "F" +# age — continuous: age at enrollment (years) +# Sample-level technical (vary per sample): +# rNANgUl — RNA concentration (continuous) +# percent_duplication — library duplication rate (continuous, 0–1) +# median_cv_coverage — coverage uniformity metric (continuous) +# lib.size — total mapped reads (continuous, large integer scale) +# norm.factors — TMM normalization factors (continuous, near 1) +######################################################## + +library(lme4) +library(emmeans) +library(car) +library(broom.mixed) +library(dplyr) +library(tibble) +library(purrr) +library(future) +library(future.apply) + +# Source implementation to test — change this line to switch versions: +# "geneLME_dev.R" — previous dev (singular fit = error, per-gene contrast build) +# "geneLME_dev2.R" — current dev (singular fit = flag, pre-computed contrasts) +# "geneLME.R" — stable merged version +SOURCE_FILE <- "geneLME.R" +source(SOURCE_FILE) +cat("Sourced:", SOURCE_FILE, "\n") + + +######################################################## +# 1. SIMULATE MOCK EList +######################################################## + +set.seed(42) + +n_patients <- 10 +treatments <- c("TrtA", "TrtB", "TrtC") +visits <- c("V1", "V2", "V3", "V4") +n_genes <- 50 + +# --- Patient-level attributes (one row per patient, then joined in) --- +# These are fixed characteristics that do not vary across visits or treatments +patient_meta <- data.frame( + ptID = paste0("pt", sprintf("%02d", 1:n_patients)), + sex = factor(sample(c("M", "F"), n_patients, replace = TRUE)), + age = round(rnorm(n_patients, mean = 38, sd = 10)), + stringsAsFactors = FALSE +) + +# Build targets (sample metadata) +# expand.grid gives one row per ptID x treatment x visit combination +targets <- expand.grid( + ptID = paste0("pt", sprintf("%02d", 1:n_patients)), + treatment = treatments, + visit = visits, + stringsAsFactors = FALSE +) %>% + arrange(ptID, treatment, visit) %>% + # Join patient-level covariates — sex and age are identical across all + # samples from the same patient, as in real data + left_join(patient_meta, by = "ptID") %>% + mutate( + sample_id = paste(ptID, treatment, visit, sep = "_"), + # --- Sample-level technical covariates --- + # These vary independently per sample (sequencing run characteristics) + rNANgUl = rnorm(n(), mean = 5, sd = 1), + percent_duplication = runif(n(), min = 0.05, max = 0.55), + median_cv_coverage = rnorm(n(), mean = 0.85, sd = 0.08), + lib.size = round(rnorm(n(), mean = 20e6, sd = 3e6)), + norm.factors = rnorm(n(), mean = 1, sd = 0.05) + ) + +rownames(targets) <- targets$sample_id +n_samples <- nrow(targets) # 120 + +# Simulate log2 expression matrix (genes x samples) +# Add a modest treatment*visit interaction effect to a subset of genes +E_mat <- matrix( + rnorm(n_genes * n_samples, mean = 8, sd = 2), + nrow = n_genes, + ncol = n_samples, + dimnames = list( + paste0("gene", sprintf("%02d", 1:n_genes)), + targets$sample_id + ) +) + +# Add a simulated effect: genes 1-10 have a TrtC:V3 upregulation +trtC_v3 <- which(targets$treatment == "TrtC" & targets$visit == "V3") +E_mat[1:10, trtC_v3] <- E_mat[1:10, trtC_v3] + 2.5 + +# Add a patient random effect +for (pt in unique(targets$ptID)) { + pt_idx <- which(targets$ptID == pt) + E_mat[, pt_idx] <- E_mat[, pt_idx] + rnorm(1, 0, 1) +} + +# Simulated voom precision weights (all near 1, small variance) +W_mat <- matrix( + abs(rnorm(n_genes * n_samples, mean = 1, sd = 0.1)), + nrow = n_genes, + ncol = n_samples, + dimnames = dimnames(E_mat) +) + +# Assemble EList-like list (mimics limma EList structure) +dat <- list( + E = E_mat, + weights = W_mat, + targets = targets +) + + +######################################################## +# 2. SMALL SUBSET FOR FAST TESTING +######################################################## + +dat_sub <- list( + E = dat$E[1:10, ], + weights = dat$weights[1:10, ], + targets = dat$targets +) + + +######################################################## +# 3. TEST: geneLME_contrast_spec() — both modes +######################################################## + +cat("\n--- Test: geneLME_contrast_spec() interaction mode ---\n") + +spec_template <- geneLME_contrast_spec( + targets = dat$targets, + contrast_vars = "treatment:visit" +) + +# spec_template has two columns: contrast_ref and contrast_lvl only. +# contrast_index is NOT in spec_template — it is added by geneLME() to its +# $contrast_spec return element after it receives the user's filtered spec. +print(head(spec_template, 10)) +cat("Total pairwise combinations:", nrow(spec_template), "\n") +cat("Columns:", paste(colnames(spec_template), collapse = ", "), "\n") + +# Filter to same-visit cross-treatment comparisons at V2 and V3. +my_spec <- spec_template %>% + filter( + # Both elements of the pair are at the same visit + sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl), + # Only visits V2 and V3 + sub(".* ", "", contrast_ref) %in% c("V2", "V3") + ) + +cat("Filtered spec rows:", nrow(my_spec), "\n") +print(my_spec) + +cat("\n--- Test: geneLME_contrast_spec() single-variable mode (one variable) ---\n") + +treatment_levels <- geneLME_contrast_spec( + targets = dat$targets, + contrast_vars = "treatment" +) +cat("Treatment levels (reference for building contrasts_primary):\n") +print(treatment_levels) + +cat("\n--- Test: geneLME_contrast_spec() multi-variable mode ---\n") + +# Pass both contrast variables together; the function explains how each maps +# to geneLME() arguments (contrasts_primary vs contrast_var_2_levels) +trt_visit_levels <- geneLME_contrast_spec( + targets = dat$targets, + contrast_vars = c("treatment", "visit") +) +cat("Returned list structure:\n") +str(trt_visit_levels) +cat("\ntreatment levels:\n") +print(trt_visit_levels$treatment) +cat("visit levels:\n") +print(trt_visit_levels$visit) + + +######################################################## +# 4. TEST: Branch A — explicit contrast_spec (interaction) +######################################################## + +cat("\n--- Test: Branch A (contrast_spec, interaction) ---\n") + +test_A <- geneLME( + dat = dat_sub, + formula_str = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = my_spec, + n_cores = 2 +) + +cat("ANOVA rows:", nrow(test_A$lme_anova), "\n") +cat("Contrast rows:", nrow(test_A$lme_contrast), "\n") +cat("Fit rows:", nrow(test_A$lme_fit), "\n") +cat("Model status summary (lme_err):\n") +print(table(test_A$lme_err)) +cat("Model status in ANOVA table:\n") +print(table(test_A$lme_anova$model_status)) + +# Verify singular fit genes still have non-NA contrast estimates +singular_genes <- names(test_A$lme_err)[test_A$lme_err == "singular_fit"] +if (length(singular_genes) > 0) { + singular_contrasts <- test_A$lme_contrast %>% filter(gene %in% singular_genes) + cat("Singular-fit genes with non-NA estimates:", + sum(!is.na(singular_contrasts$estimate)), "/", nrow(singular_contrasts), "\n") + cat("4 PASS (singular_fit) — singular genes return estimates with flag\n") +} else { + cat("No singular fits in this run (unexpected with 10-patient mock data)\n") +} + +cat("\nSample contrast output:\n") +print(head(test_A$lme_contrast %>% select(gene, model_status, contrast, contrast_ref, contrast_lvl, estimate, p.value), 10)) + + +######################################################## +# 4b. TEST: Branch A — with second-order contrasts +######################################################## + +# my_spec (after filtering) has 6 first-order contrasts. geneLME() attaches an +# indexed copy of my_spec to result_A$contrast_spec — contrast_index = 1:nrow(my_spec) +# (simple row positions in the filtered spec). Use these indices when building +# contrasts_secondary vectors. The row order within my_spec determines position in +# contrasts_secondary vectors, since geneLME() iterates over seq_len(nrow(contrast_spec)). +# +# Expected my_spec row order (alphabetical interaction level ordering): +# row 1 (contrast_index=1): TrtB V2 - TrtA V2 +# row 2 (contrast_index=2): TrtC V2 - TrtA V2 +# row 3 (contrast_index=3): TrtC V2 - TrtB V2 +# row 4 (contrast_index=4): TrtB V3 - TrtA V3 +# row 5 (contrast_index=5): TrtC V3 - TrtA V3 +# row 6 (contrast_index=6): TrtC V3 - TrtB V3 +# +# Second-order contrasts: difference-of-differences across visits +# "TrtA vs TrtB: V3 minus V2 effect": row 4 - row 1 → c(-1, 0, 0, 1, 0, 0) +# "TrtA vs TrtC: V3 minus V2 effect": row 5 - row 2 → c(0, -1, 0, 0, 1, 0) +# +# Verify row order with: test_A$contrast_spec +# Vectors must have length 6, one element per row of my_spec (in row order). + +cat("\n--- Test: Branch A with second-order contrasts ---\n") +cat("Indexed contrast_spec from test_A (verify row ordering before specifying secondary contrasts):\n") +print(test_A$contrast_spec) + +test_A2 <- geneLME( + dat = dat_sub, + formula_str = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = my_spec, + contrasts_secondary = list( + "TrtA vs TrtB: V3 minus V2 effect" = c(-1, 0, 0, 1, 0, 0), # row 4 - row 1 + "TrtA vs TrtC: V3 minus V2 effect" = c(0, -1, 0, 0, 1, 0) # row 5 - row 2 + ), + n_cores = 2 +) + +cat("Contrast rows (should include second_order):", nrow(test_A2$lme_contrast), "\n") +cat("Contrast orders present:\n") +print(table(test_A2$lme_contrast$contrast_order)) +cat("\nSample second-order contrast output:\n") +print(test_A2$lme_contrast[test_A2$lme_contrast$contrast_order == "second_order", ]) + + +######################################################## +# 5. TEST: Branch B — non-interaction (regression test) +######################################################## + +cat("\n--- Test: Branch B (non-interaction, regression test) ---\n") + +# Treatment levels in alphabetical order (as emmeans sees them): TrtA, TrtB, TrtC +# contrasts_primary vectors have length 3; positions correspond to: [TrtA, TrtB, TrtC] + +test_B <- geneLME( + dat = dat_sub, + formula_str = "~ treatment + visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = c("treatment", "visit"), + contrast_var_2_levels = c("V2", "V3"), + contrasts_primary = list( + "TrtC vs TrtA" = c(-1, 0, 1), + "TrtB vs TrtA" = c(-1, 1, 0) + ), + contrasts_secondary = list("TrtC vs TrtB" = c(1, -1)), + n_cores = 2 +) + +cat("ANOVA rows:", nrow(test_B$lme_anova), "\n") +cat("Contrast rows:", nrow(test_B$lme_contrast), "\n") +cat("Model status summary (lme_err):\n") +print(table(test_B$lme_err)) +cat("Model status in ANOVA table:\n") +print(table(test_B$lme_anova$model_status)) + +cat("\nSample contrast output:\n") +print(head(test_B$lme_contrast %>% select(gene, model_status, contrast, contrast_ref, contrast_lvl, estimate, p.value), 10)) + + +######################################################## +# 6. TEST: Validation errors (should each produce an informative error/warning) +######################################################## + +cat("\n--- Test: Input validation errors ---\n") + +# 6a. Missing formula variable +tryCatch( + geneLME(dat_sub, formula_str = "~ NONEXISTENT_VAR + (1|ptID)", n_cores = 2), + error = function(e) cat("6a PASS — missing formula var:", conditionMessage(e), "\n") +) + +# 6b. model_weights = TRUE but no dat$weights +dat_no_w <- dat_sub; dat_no_w$weights <- NULL +tryCatch( + geneLME(dat_no_w, formula_str = "~ treatment + (1|ptID)", model_weights = TRUE, n_cores = 2), + error = function(e) cat("6b PASS — missing weights:", conditionMessage(e), "\n") +) + +# 6c. Interaction contrast requested but interaction NOT in formula +tryCatch( + geneLME(dat_sub, + formula_str = "~ treatment + visit + age + (1|ptID)", # additive — no interaction + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = my_spec, + n_cores = 2), + error = function(e) cat("6c PASS — interaction not in formula:", conditionMessage(e), "\n") +) + +# 6d. Interaction contrast requested but contrast_spec not provided +tryCatch( + geneLME(dat_sub, + formula_str = "~ treatment * visit + age + (1|ptID)", + run_contrast = TRUE, + contrast_vars = "treatment:visit", + # contrast_spec intentionally omitted + n_cores = 2), + error = function(e) cat("6d PASS — interaction contrast_spec missing:", conditionMessage(e), "\n") +) + +# 6e. contrast_spec has wrong columns +tryCatch( + geneLME(dat_sub, + formula_str = "~ treatment * visit + age + (1|ptID)", + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = data.frame(a = "x", b = "y"), + n_cores = 2), + error = function(e) cat("6e PASS — bad contrast_spec columns:", conditionMessage(e), "\n") +) + +# 6f. contrast_var_2_levels has invalid levels +tryCatch( + geneLME(dat_sub, + formula_str = "~ treatment + visit + age + (1|ptID)", + run_contrast = TRUE, + contrast_vars = c("treatment", "visit"), + contrast_var_2_levels = c("V2", "NOTAVISIT"), + contrasts_primary = list("TrtC vs TrtA" = c(-1, 0, 1)), + n_cores = 2), + error = function(e) cat("6f PASS — invalid contrast_var_2_levels:", conditionMessage(e), "\n") +) + +# 6g. contrasts_secondary with wrong length → soft-fail (returns early with $contrast_spec, +# all other elements NULL). This is NOT a stop() — it returns an invisible list. +cat("\n--- Test: soft-fail on wrong-length contrasts_secondary ---\n") + +soft_fail_A <- geneLME( + dat_sub, + formula_str = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = my_spec, + contrasts_secondary = list("wrong length" = rep(0, 99)), # should be length 6 + n_cores = 2 +) + +cat("6g PASS — soft-fail Branch A:\n") +cat(" NULL elements:", paste(names(which(sapply(soft_fail_A, is.null))), collapse = ", "), "\n") +cat(" $contrast_spec populated:\n") +print(soft_fail_A$contrast_spec) + +soft_fail_B <- geneLME( + dat_sub, + formula_str = "~ treatment + visit + age + (1|ptID)", + run_contrast = TRUE, + contrast_vars = c("treatment", "visit"), + contrasts_primary = list("TrtC vs TrtA" = c(-1, 0, 1), + "TrtB vs TrtA" = c(-1, 1, 0)), + contrasts_secondary = list("wrong length" = rep(0, 99)), # should be length 2 + n_cores = 2 +) + +cat("\n6g PASS — soft-fail Branch B:\n") +cat(" NULL elements:", paste(names(which(sapply(soft_fail_B, is.null))), collapse = ", "), "\n") +cat(" $contrast_spec populated:\n") +print(soft_fail_B$contrast_spec) + +cat("\nAll validation tests complete.\n") diff --git a/R_functions/geneLME_tutorial.Rmd b/R_functions/geneLME_tutorial.Rmd new file mode 100644 index 0000000..e9d09ba --- /dev/null +++ b/R_functions/geneLME_tutorial.Rmd @@ -0,0 +1,759 @@ +--- +title: "geneLME: Scalable Per-Gene Linear Mixed Effects Models" +subtitle: "Function Tutorial with Worked Examples" +date: "`r Sys.Date()`" +output: + html_document: + toc: true + toc_float: + collapsed: false + smooth_scroll: true + toc_depth: 3 + theme: flatly + highlight: tango + code_folding: show + df_print: paged +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set( + echo = TRUE, + message = FALSE, + warning = FALSE, + comment = "#>", + fig.align = "center" +) +``` + +--- + +# Overview + +`geneLME` fits one linear mixed effects model per gene across a full RNA-seq expression matrix, +extracting ANOVA tables and user-defined `emmeans`-based contrasts in parallel via the `future` +framework. It is designed around the `limma` EList data structure. + +## Key capabilities + +- Fits `lmer` models per gene in parallel (`future_lapply`) +- Supports voom precision weights (`dat$weights`) +- Extracts `car::Anova()` type II ANOVA tables with per-predictor coefficient estimates +- Flexible contrast specification: + - **Branch A:** explicit pairwise contrasts across interaction cells via a filterable `contrast_spec` data frame + - **Branch B:** named contrast vectors on main-effect marginal means, optionally within levels of a grouping variable +- Optional second-order contrasts (contrasts-of-contrasts) in both branches +- Pre-flight input validation with informative error messages +- Soft-fail on wrong-length `contrasts_secondary`: returns `$contrast_spec` (indexed) for debugging without running models +- Per-gene outcome tracking: `model_status` column flags singular fits (`"singular_fit"`) vs clean fits (`"success"`); results are returned for all genes, not dropped + +## Package dependencies + +```{r libs} +library(lme4) +library(emmeans) +library(car) +library(broom.mixed) +library(dplyr) +library(tibble) +library(purrr) +library(future) +library(future.apply) + +source("geneLME.R") +``` + +--- + +# Mock Data + +All examples use a simulated EList-like object with a **3-treatment × 4-visit paired design** +(10 patients, 120 samples, 50 genes). Genes 1–10 have a simulated `TrtC:V3` upregulation of +2.5 +log2 units to provide a detectable signal in Branch A examples. + +```{r mock-data} +set.seed(42) + +n_patients <- 10 +treatments <- c("TrtA", "TrtB", "TrtC") +visits <- c("V1", "V2", "V3", "V4") +n_genes <- 50 + +# Patient-level covariates +patient_meta <- data.frame( + ptID = paste0("pt", sprintf("%02d", 1:n_patients)), + sex = factor(sample(c("M", "F"), n_patients, replace = TRUE)), + age = round(rnorm(n_patients, mean = 38, sd = 10)), + stringsAsFactors = FALSE +) + +# Sample metadata (targets) +targets <- expand.grid( + ptID = paste0("pt", sprintf("%02d", 1:n_patients)), + treatment = treatments, + visit = visits, + stringsAsFactors = FALSE +) %>% + arrange(ptID, treatment, visit) %>% + left_join(patient_meta, by = "ptID") %>% + mutate( + sample_id = paste(ptID, treatment, visit, sep = "_"), + rNANgUl = rnorm(n(), mean = 5, sd = 1), + percent_duplication = runif(n(), min = 0.05, max = 0.55), + median_cv_coverage = rnorm(n(), mean = 0.85, sd = 0.08), + lib.size = round(rnorm(n(), mean = 20e6, sd = 3e6)), + norm.factors = rnorm(n(), mean = 1, sd = 0.05) + ) +rownames(targets) <- targets$sample_id + +# Expression matrix +n_samples <- nrow(targets) +E_mat <- matrix(rnorm(n_genes * n_samples, mean = 8, sd = 2), + nrow = n_genes, ncol = n_samples, + dimnames = list(paste0("gene", sprintf("%02d", 1:n_genes)), + targets$sample_id)) + +# Simulated TrtC:V3 effect on genes 1-10 +trtC_v3 <- which(targets$treatment == "TrtC" & targets$visit == "V3") +E_mat[1:10, trtC_v3] <- E_mat[1:10, trtC_v3] + 2.5 + +# Patient random effect +for (pt in unique(targets$ptID)) { + idx <- which(targets$ptID == pt) + E_mat[, idx] <- E_mat[, idx] + rnorm(1, 0, 1) +} + +# Voom-like precision weights +W_mat <- matrix(abs(rnorm(n_genes * n_samples, mean = 1, sd = 0.1)), + nrow = n_genes, ncol = n_samples, + dimnames = dimnames(E_mat)) + +# EList-like object +dat <- list(E = E_mat, weights = W_mat, targets = targets) + +# Small subset for faster examples +dat_sub <- list(E = dat$E[1:10, ], weights = dat$weights[1:10, ], targets = dat$targets) + +cat("Design:", n_patients, "patients ×", length(treatments), "treatments ×", + length(visits), "visits =", n_samples, "samples\n") +cat("Genes:", n_genes, "(full) /", nrow(dat_sub$E), "(subset used in examples)\n") +``` + +--- + +# Step 0: `geneLME_contrast_spec()` — Inspect Levels Before Running + +Before calling `geneLME()`, use `geneLME_contrast_spec()` to enumerate available contrast +levels and understand exactly how to construct your contrast arguments. This function has two +modes depending on whether `contrast_vars` contains `":"`. + +## Interaction mode + +Pass a single `"var_a:var_b"` string to get all pairwise combinations of interaction cells. +This returns a `data.frame` that you filter down to the specific contrasts of interest and +pass directly as `contrast_spec` to `geneLME()`. + +```{r contrast-spec-interaction} +spec_template <- geneLME_contrast_spec( + targets = dat$targets, + contrast_vars = "treatment:visit" +) + +# spec_template has two columns: contrast_ref and contrast_lvl. +# contrast_index is NOT included here — it is added by geneLME() to its $contrast_spec +# output element once it receives your filtered spec. +cat("Total pairwise combinations:", nrow(spec_template), "\n\n") +head(spec_template, 12) +``` + +Filter to the contrasts of interest — here, same-visit cross-treatment comparisons at V2 and V3: + +```{r contrast-spec-filter} +my_spec <- spec_template %>% + filter( + sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl), # same visit + sub(".* ", "", contrast_ref) %in% c("V2", "V3") # V2 and V3 only + ) + +cat("Filtered to", nrow(my_spec), "contrasts:\n") +print(my_spec) +``` + +## Single/multi-variable mode + +Pass a character vector of plain variable names (no `:`). Returns a **named list** — one +`data.frame(level)` per variable — and prints a message for each variable explaining its role +in `geneLME()` arguments. + +```{r contrast-spec-multivariable} +ref <- geneLME_contrast_spec( + targets = dat$targets, + contrast_vars = c("treatment", "visit") +) +``` + +The printed messages explain: + +- **`treatment`** (position 1, primary): its levels define the **length and position order** of + `contrasts_primary` vectors in `geneLME()` +- **`visit`** (position 2, secondary): its levels can be subset via `contrast_var_2_levels` + to restrict which groups the primary contrasts are computed within + +```{r contrast-spec-multivariable-show} +cat("treatment levels (contrasts_primary vector positions):\n") +print(ref$treatment) + +cat("\nvisit levels (available for contrast_var_2_levels filtering):\n") +print(ref$visit) +``` + +--- + +# Branch A: Interaction Contrasts via `contrast_spec` + +**Use when:** the model formula contains an interaction term (`treatment * visit` or +`treatment:visit`) and you want specific, explicitly-defined pairwise contrasts across +interaction cells. + +Under the hood, `geneLME_fit()` computes the full `emmeans` object for the interaction, then +builds one named contrast vector per row of `contrast_spec` — setting the reference cell to +−1, the comparison cell to +1, and all others to 0. This guarantees that only the contrasts +you define are run. + +## Basic Branch A call + +```{r branch-a} +result_A <- geneLME( + dat = dat_sub, + formula_str = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", # single ":" string → Branch A + contrast_spec = my_spec, # the filtered data.frame from geneLME_contrast_spec() + n_cores = 2 +) +``` + +## Output structure + +```{r branch-a-output-structure} +cat("Output elements:", paste(names(result_A), collapse = ", "), "\n\n") + +cat("lme_fit (AIC per gene):\n") +print(result_A$lme_fit) + +cat("\nModel status summary:\n") +print(table(result_A$lme_err)) + +# $contrast_spec is the indexed copy of my_spec attached by geneLME(). +# contrast_index here is 1:nrow(my_spec) — the row position within the filtered +# spec. This is the stable index you should use when building contrasts_secondary vectors. +cat("\n$contrast_spec (indexed by geneLME):\n") +print(result_A$contrast_spec) +``` + +## `lme_anova` — ANOVA table + +One row per model term per gene. `predictor_class` classifies each term, and `Estimate_source` +indicates whether a coefficient estimate is available directly (`"lme_summary"`) or should be +read from the contrast output (`"seeContrasts"`). `p.value_adj` is the FDR-adjusted p-value +computed across all genes **within each model term** (Benjamini-Hochberg by default). + +```{r branch-a-anova} +cat("ANOVA rows:", nrow(result_A$lme_anova), "\n") +cat(" (", length(unique(result_A$lme_anova$term)), "terms ×", + length(unique(result_A$lme_anova$gene)), "genes)\n\n") + +# Show one gene in full +result_A$lme_anova %>% + filter(gene == "gene01") %>% + select(term, statistic, df, p.value, p.value_adj, predictor_class, Estimate_source, Estimate, Estimate_SE) +``` + +## `lme_contrast` — contrast results + +One row per defined contrast per gene. `contrast_order` is always `"first_order"` in a basic +Branch A call (no second-order contrasts specified). `p.value_adj` is FDR-adjusted across all +genes **within each contrast × contrast_order combination**. + +`contrast_ref` and `contrast_lvl` are populated for **Branch A first-order contrasts** — +they record exactly which interaction cell is the reference (−1) and which is the comparison +(+1), eliminating any ambiguity in how the estimate is signed. For second-order contrasts and +all Branch B contrasts these columns are `NA` (no single ref/lvl pair applies). + +```{r branch-a-contrast} +cat("Contrast rows:", nrow(result_A$lme_contrast), "\n") +cat(" (", length(unique(result_A$lme_contrast$contrast)), "contrasts ×", + length(unique(result_A$lme_contrast$gene)), "genes)\n\n") + +result_A$lme_contrast %>% + filter(gene == "gene01") %>% + select(contrast, contrast_ref, contrast_lvl, estimate, SE, df, t.ratio, p.value, p.value_adj, contrast_order) +``` + +Notice that gene01 shows large, significant effects at V3 (TrtC vs others) but not at V2, +consistent with the simulated `TrtC:V3` upregulation. + +--- + +# Branch A: Adding Second-Order Contrasts + +Second-order contrasts are contrasts-of-contrasts: each operates on the vector of first-order +contrast estimates, allowing you to ask whether a treatment difference *changes* across visits. + +At validation time, `geneLME()` prints the numbered first-order list in `contrast_spec` row +order, so you can verify the position indices before constructing your secondary vectors. +**Secondary vectors must have length equal to `nrow(contrast_spec)`**. + +```{r branch-a-second-order} +# At validation time, geneLME() prints the first-order contrast list in my_spec row +# order. You can also inspect result_A$contrast_spec for the indexed row ordering. +# With my_spec as defined above (same-visit, V2 and V3), alphabetical ordering gives: +# row 1: TrtB V2 - TrtA V2 +# row 2: TrtC V2 - TrtA V2 +# row 3: TrtC V2 - TrtB V2 +# row 4: TrtB V3 - TrtA V3 +# row 5: TrtC V3 - TrtA V3 +# row 6: TrtC V3 - TrtB V3 +# +# Second-order questions: +# "Did the TrtA vs TrtB difference change from V2 to V3?" → row4 − row1 +# "Did the TrtA vs TrtC difference change from V2 to V3?" → row5 − row2 +# +# Verify with: result_A$contrast_spec + +result_A2 <- geneLME( + dat = dat_sub, + formula_str = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = my_spec, + contrasts_secondary = list( + "TrtA vs TrtB: V3 minus V2 effect" = c(-1, 0, 0, 1, 0, 0), # row4 − row1 + "TrtA vs TrtC: V3 minus V2 effect" = c(0, -1, 0, 0, 1, 0) # row5 − row2 + ), + n_cores = 2 +) +``` + +```{r branch-a-second-order-output} +cat("Contrast order breakdown:\n") +print(table(result_A2$lme_contrast$contrast_order)) + +cat("\nSecond-order results for gene01 and gene09 (strongest TrtC:V3 effect):\n") +# contrast_ref/contrast_lvl are NA for second-order rows (no single ref/lvl pair) +result_A2$lme_contrast %>% + filter(contrast_order == "second_order", gene %in% c("gene01", "gene09")) %>% + select(gene, contrast, contrast_ref, contrast_lvl, estimate, SE, df, t.ratio, p.value) +``` + +## Programmatic second-order contrast construction via `$contrast_spec` + +When there are many first-order contrasts, hand-coding the position of each `−1`/`+1` in +every secondary vector is error-prone. After a successful `geneLME()` call (or a soft-fail +call — see [Input Validation]), the `$contrast_spec` element of the returned list contains +an indexed copy of `contrast_spec` with a `contrast_index` column added by `geneLME()`. +`contrast_index` is simply `1:nrow(contrast_spec)` — the row positions in the filtered spec +you passed in. Use these as stable handles when building `contrasts_secondary` vectors +programmatically. + +**Workflow:** run `geneLME()` once (or use a dry-run via a wrong-length `contrasts_secondary` +soft-fail), inspect `result$contrast_spec` to confirm the row ordering and indices, then build +and pass the correct `contrasts_secondary` list. + +**Example scenario:** within each treatment pair, ask whether the between-treatment difference +changed from V2 to V3 — i.e., `(TrtX−TrtY at V3) − (TrtX−TrtY at V2)`. + +```{r secondary-contrasts-programmatic} +# result_A$contrast_spec is the indexed version of my_spec as seen by geneLME(). +# contrast_index = 1:nrow(my_spec) (simple row positions after filtering). +cat("Indexed contrast_spec from result_A:\n") +print(result_A$contrast_spec) + +# Annotate with the visit for each contrast (extracted from the interaction label) +indexed_spec <- result_A$contrast_spec %>% + mutate(visit = sub(".* ", "", contrast_ref)) # extract visit from the interaction label + +print(indexed_spec) +``` + +```{r secondary-contrasts-build} +# Build a two-column lookup: one contrast_index per sign, one row per second-order contrast. +# group_by the treatment pair (shared across visits); within each group there are exactly +# two rows — one for V2 and one for V3 — giving the index_neg (V2, subtracted) and +# index_pos (V3, added). +secondary_lookup <- indexed_spec %>% + # Create a treatment-pair label that is the same regardless of visit + mutate(trt_pair = paste(sub(" .*", "", contrast_ref), # var_a from ref + sub(" .*", "", contrast_lvl), # var_a from lvl + sep = " vs ")) %>% + arrange(trt_pair, visit) %>% # V2 before V3 (alphabetical) + group_by(trt_pair) %>% + summarise( + secondary_contrast_name = paste0("(", last(contrast_lvl), " - ", last(contrast_ref), ")", + " - ", + "(", first(contrast_lvl), " - ", first(contrast_ref), ")"), + index_neg = first(contrast_index), # V2 row → subtracted (−1) + index_pos = last(contrast_index), # V3 row → added (+1) + .groups = "drop" + ) + +print(secondary_lookup) +``` + +```{r secondary-contrasts-format} +# Format as a named list of zero vectors with −1/+1 placed by contrast_index. +# The vector length must equal nrow(my_spec) — the number of first-order contrasts. +n_first_order <- nrow(my_spec) + +contrasts_secondary_prog <- setNames( + lapply(seq_len(nrow(secondary_lookup)), function(i) { + v <- rep(0, n_first_order) + # contrast_index IS the row position (1:nrow), so it can be used directly. + # Using which() for robustness in case the data.frame is reordered. + v[which(indexed_spec$contrast_index == secondary_lookup$index_neg[i])] <- -1 + v[which(indexed_spec$contrast_index == secondary_lookup$index_pos[i])] <- 1 + v + }), + secondary_lookup$secondary_contrast_name +) + +# Inspect: each vector should have exactly one −1 and one +1 +print(contrasts_secondary_prog) +``` + +```{r secondary-contrasts-run} +result_A_prog <- geneLME( + dat = dat_sub, + formula_str = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = my_spec, + contrasts_secondary = contrasts_secondary_prog, + n_cores = 2 +) + +cat("Second-order results for gene01:\n") +result_A_prog$lme_contrast %>% + filter(contrast_order == "second_order", gene == "gene01") %>% + select(gene, contrast, contrast_ref, contrast_lvl, estimate, SE, df, t.ratio, p.value) +``` + +> **Key design note:** `geneLME()` appends `contrast_index` (= `1:nrow(contrast_spec)`) to its +> `$contrast_spec` output element — not to the `contrast_spec` data frame you pass in. This +> means `contrast_index` always reflects the row positions in *your filtered spec*, never in +> the full unfiltered template. The `which(indexed_spec$contrast_index == ...)` lookup maps +> those indices back to row positions for vector construction, and is robust to any reordering +> of the spec data frame. + +--- + +# Branch B: Non-Interaction Contrasts + +**Use when:** the model is additive (no interaction term), and you want named contrasts on the +marginal means of one variable. Optionally, compute those contrasts within specific levels of a +second grouping variable (`contrast_var_2_levels`), and optionally follow up with second-order +contrasts across the primary contrast estimates. + +Under the hood, `geneLME_fit()` calls `emmeans()` with a `spec` formula derived from +`contrast_vars`, using `contrast_var_2_levels` as the `at =` filter. `contrasts_primary` +vectors are passed directly to `emmeans::contrast()`. + +## Inspecting levels before building contrast vectors + +Before specifying `contrasts_primary`, confirm the level ordering emmeans will use — it is +alphabetical by default: + +```{r branch-b-spec} +ref_B <- geneLME_contrast_spec(dat$targets, contrast_vars = c("treatment", "visit")) +``` + +`treatment` levels are `TrtA[1]`, `TrtB[2]`, `TrtC[3]` — so `contrasts_primary` vectors +must have length 3 with positions corresponding to those labels: + +``` +"TrtC vs TrtA" = c(-1, 0, 1) # TrtA=−1, TrtB=0, TrtC=+1 +"TrtB vs TrtA" = c(-1, 1, 0) # TrtA=−1, TrtB=+1, TrtC=0 +``` + +## Basic Branch B call + +```{r branch-b} +result_B <- geneLME( + dat = dat_sub, + formula_str = "~ treatment + visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = c("treatment", "visit"), # [1] primary; [2] 'by' grouping variable + contrast_var_2_levels = c("V2", "V3"), # restrict to these visit levels + contrasts_primary = list( + "TrtC vs TrtA" = c(-1, 0, 1), + "TrtB vs TrtA" = c(-1, 1, 0) + ), + n_cores = 2 +) +``` + +```{r branch-b-output} +cat("Contrast rows:", nrow(result_B$lme_contrast), "\n") +cat(" (", length(unique(result_B$lme_contrast$contrast)), "contrasts ×", + length(contrast_var_2_levels <- c("V2","V3")), "visits ×", + length(unique(result_B$lme_contrast$gene)), "genes)\n\n") + +# contrast_ref and contrast_lvl are NA for Branch B (named contrast vectors, +# not a ref/lvl spec frame) +result_B$lme_contrast %>% + filter(gene == "gene01") %>% + select(contrast, contrast_ref, contrast_lvl, visit, estimate, SE, df, t.ratio, p.value, contrast_order) +``` + +> **Note:** In an additive model, treatment estimates are identical across V2 and V3 — there is +> no interaction, so the treatment effect does not depend on visit. `contrast_var_2_levels` +> controls *which* visit cells are included in the output, which is useful for limiting the +> number of tests and focusing results on visits of biological interest. + +## Branch B with second-order contrasts + +A second-order contrast operates on the vector of first-order contrast estimates. Here, +`contrasts_secondary` must have length equal to the number of primary contrasts (2 in this +example). The result asks: "Is the `TrtC vs TrtA` difference significantly larger than the +`TrtB vs TrtA` difference?" — i.e., is there a differential treatment response between TrtB +and TrtC relative to TrtA? + +```{r branch-b-second-order} +result_B2 <- geneLME( + dat = dat_sub, + formula_str = "~ treatment + visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = c("treatment", "visit"), + contrast_var_2_levels = c("V2", "V3"), + contrasts_primary = list( + "TrtC vs TrtA" = c(-1, 0, 1), + "TrtB vs TrtA" = c(-1, 1, 0) + ), + contrasts_secondary = list( + "TrtC vs TrtB (2nd order)" = c(1, -1) # length = n primary contrasts (2) + ), + n_cores = 2 +) +``` + +```{r branch-b-second-order-output} +cat("Contrast order breakdown:\n") +print(table(result_B2$lme_contrast$contrast_order)) + +result_B2$lme_contrast %>% + filter(gene == "gene01") %>% + select(contrast, contrast_ref, contrast_lvl, visit, estimate, SE, df, t.ratio, p.value, contrast_order) +``` + +--- + +# FDR Adjustment + +`geneLME()` automatically appends a `p.value_adj` column to both `lme_anova` and +`lme_contrast` using `p.adjust()`. Adjustment is performed **across all genes within each +grouping unit** — not globally across all rows — so that each test family is treated +independently: + +| Output table | Grouping unit for adjustment | +|---------------|-------------------------------------------| +| `lme_anova` | Each model term (e.g. `treatment`, `treatment:visit`) | +| `lme_contrast`| Each contrast label × `contrast_order` | + +The default method is Benjamini-Hochberg (`"BH"`). Any method accepted by `p.adjust()` can +be specified via the `fdr_method` argument: + +```{r fdr-methods} +# Available methods +p.adjust.methods +``` + +```{r fdr-custom} +# Example: Bonferroni correction instead of BH +result_bonf <- geneLME( + dat = dat_sub, + formula_str = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = my_spec, + fdr_method = "bonferroni", + n_cores = 2 +) + +# Compare BH vs Bonferroni for one contrast across genes +comparison <- result_A$lme_contrast %>% + filter(contrast == first(contrast), contrast_order == "first_order") %>% + select(gene, p.value, p.value_adj_BH = p.value_adj) %>% + left_join( + result_bonf$lme_contrast %>% + filter(contrast == first(contrast), contrast_order == "first_order") %>% + select(gene, p.value_adj_bonf = p.value_adj), + by = "gene" + ) + +print(comparison) +``` + +> **Note on NA values:** genes whose models failed receive `NA` for `p.value`, which +> propagates to `p.value_adj`. These genes are excluded from the adjustment set, so the +> effective number of tests equals the number of successfully fitted genes. + +--- + +# Input Validation + +`geneLME()` validates all inputs before launching any parallel work. Informative errors are +raised for the most common misspecifications: + +```{r validation-errors} +run_validation_test <- function(label, expr) { + tryCatch( + { expr; cat(label, "— (no error raised)\n") }, + error = function(e) cat(label, "ERROR:\n ", conditionMessage(e), "\n\n"), + warning = function(w) cat(label, "WARNING:\n ", conditionMessage(w), "\n\n") + ) +} + +# 1. Formula variable missing from targets +run_validation_test("Test 1 — missing formula variable:", + geneLME(dat_sub, formula_str = "~ NONEXISTENT_VAR + (1|ptID)", n_cores = 2) +) + +# 2. Weights requested but not present +dat_no_w <- dat_sub; dat_no_w$weights <- NULL +run_validation_test("Test 2 — model_weights=TRUE but no dat$weights:", + geneLME(dat_no_w, formula_str = "~ treatment + (1|ptID)", model_weights = TRUE, n_cores = 2) +) + +# 3. Interaction contrast requested but interaction not in formula +run_validation_test("Test 3 — interaction contrast but additive formula:", + geneLME(dat_sub, + formula_str = "~ treatment + visit + age + (1|ptID)", + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = my_spec, + n_cores = 2) +) + +# 4. Interaction contrast but contrast_spec not provided +run_validation_test("Test 4 — interaction contrast, contrast_spec = NULL:", + geneLME(dat_sub, + formula_str = "~ treatment * visit + age + (1|ptID)", + run_contrast = TRUE, + contrast_vars = "treatment:visit", + n_cores = 2) +) + +# 5. contrast_spec has wrong column names +run_validation_test("Test 5 — contrast_spec with wrong columns:", + geneLME(dat_sub, + formula_str = "~ treatment * visit + age + (1|ptID)", + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = data.frame(a = "x", b = "y"), + n_cores = 2) +) + +# 6. contrast_var_2_levels contains an invalid level +run_validation_test("Test 6 — invalid contrast_var_2_levels:", + geneLME(dat_sub, + formula_str = "~ treatment + visit + age + (1|ptID)", + run_contrast = TRUE, + contrast_vars = c("treatment", "visit"), + contrast_var_2_levels = c("V2", "NOTAVISIT"), + contrasts_primary = list("TrtC vs TrtA" = c(-1, 0, 1)), + n_cores = 2) +) +``` + +When `contrasts_secondary` vectors have the wrong length, `geneLME()` **soft-fails**: it returns +early without running any models, but still populates `$contrast_spec` with the indexed row +ordering. This lets you inspect the indexing and fix your vectors without a hard error. + +```{r validation-soft-fail} +# 7. contrasts_secondary with wrong length → soft-fail with $contrast_spec populated +bad_result <- geneLME( + dat = dat_sub, + formula_str = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = my_spec, + contrasts_secondary = list( + "wrong length vector" = rep(0, 99) # should be length 6 = nrow(my_spec) + ), + n_cores = 2 +) + +cat("Elements of soft-fail result:\n") +print(sapply(bad_result, is.null)) # lme_anova/contrast/fit/err are NULL; contrast_spec is not + +cat("\n$contrast_spec (use these indices to fix contrasts_secondary):\n") +print(bad_result$contrast_spec) +``` + +--- + +# Error Handling and Singular Fits + +Per-gene errors are captured by `tryCatch` inside `geneLME_fit()` and do not abort the run. +Each gene's outcome is recorded in the `model_status` column present in both `lme_anova` +and `lme_contrast`, and also summarised in `lme_err` (a named character vector keyed by gene): + +- **`"success"`** — model converged cleanly; fixed effects and contrasts are fully reliable. +- **`"singular_fit"`** — `isSingular()` was `TRUE`, meaning the random effect variance hit its + boundary (zero). Fixed effect estimates are still returned and are numerically valid, but + the random effect structure may be over-specified for the data. Common with small sample sizes. + Filter these genes downstream if warranted. +- Any other value — an unexpected error was caught; rows for that gene will contain NAs. + +```{r error-handling} +# Check model status summary across all genes — Branch A +cat("Model status — Branch A:\n") +print(table(result_A$lme_err)) + +# With small mock data (10 patients), singular fits are expected due to +# patient-level covariates (sex, age) being collinear with the patient random effect. +# With real data at typical sample sizes this is less common. + +# Identify singular-fit or failed genes +non_success <- names(result_A$lme_err)[result_A$lme_err != "success"] +if (length(non_success) > 0) { + cat("\nNon-success genes and their status:\n") + print(result_A$lme_err[non_success]) + # For singular fits, results are still returned — filter if desired: + # result_A$lme_contrast %>% filter(model_status == "success") +} else { + cat("\nAll genes fitted successfully.\n") +} +``` + +--- + +# Quick-Reference: Argument Guide + +| Argument | Type | Required | Description | +|---|---|---|---| +| `dat` | list | Yes | EList-like: `$E`, `$weights` (optional), `$targets` | +| `formula_str` | character | Yes | RHS formula string, e.g. `"~ treatment * visit + (1\|ptID)"` | +| `model_weights` | logical | No | `TRUE` to use `dat$weights` as precision weights | +| `run_contrast` | logical | No | `TRUE` to run emmeans contrasts | +| `contrast_vars` | character | If `run_contrast=TRUE` | `"var_a:var_b"` (Branch A) or `c("var1", "var2")` (Branch B) | +| `contrast_var_2_levels` | character vector | No | Branch B: restrict 'by' variable to these levels | +| `contrast_spec` | data.frame | Required for Branch A | Columns `contrast_ref` and `contrast_lvl`; from `geneLME_contrast_spec()`. `geneLME()` attaches an indexed copy (with `contrast_index`) to `$contrast_spec` in its return value. | +| `contrasts_primary` | named list | Branch B | Named contrast vectors; length = levels of `contrast_vars[1]` | +| `contrasts_secondary` | named list | No | Second-order contrasts; length = `nrow(contrast_spec)` (Branch A) or `length(contrasts_primary)` (Branch B). Wrong lengths trigger a soft-fail with `$contrast_spec` populated for debugging. | +| `fdr_method` | character | No | P-value adjustment method passed to `p.adjust()`; default `"BH"` (Benjamini-Hochberg) | +| `n_cores` | integer | No | Number of parallel workers; defaults to `detectCores() − 4` | + +--- + +# Session Info + +```{r session-info} +sessionInfo() +``` diff --git a/R_functions/geneLME_tutorial.html b/R_functions/geneLME_tutorial.html new file mode 100644 index 0000000..ddf1d95 --- /dev/null +++ b/R_functions/geneLME_tutorial.html @@ -0,0 +1,4034 @@ + + + + + + + + + + + + + + +geneLME: Scalable Per-Gene Linear Mixed Effects Models + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + +
+
+

Overview

+

geneLME fits one linear mixed effects model per gene +across a full RNA-seq expression matrix, extracting ANOVA tables and +user-defined emmeans-based contrasts in parallel via the +future framework. It is designed around the +limma EList data structure.

+
+

Key capabilities

+
    +
  • Fits lmer models per gene in parallel +(future_lapply)
  • +
  • Supports voom precision weights (dat$weights)
  • +
  • Extracts car::Anova() type II ANOVA tables with +per-predictor coefficient estimates
  • +
  • Flexible contrast specification: +
      +
    • Branch A: explicit pairwise contrasts across +interaction cells via a filterable contrast_spec data +frame
    • +
    • Branch B: named contrast vectors on main-effect +marginal means, optionally within levels of a grouping variable
    • +
  • +
  • Optional second-order contrasts (contrasts-of-contrasts) in both +branches
  • +
  • Pre-flight input validation with informative error messages
  • +
  • Soft-fail on wrong-length contrasts_secondary: returns +$contrast_spec (indexed) for debugging without running +models
  • +
  • Per-gene error capture: failed genes do not abort the run
  • +
+
+
+

Package dependencies

+
library(lme4)
+library(emmeans)
+library(car)
+library(broom.mixed)
+library(dplyr)
+library(tibble)
+library(purrr)
+library(future)
+library(future.apply)
+
+source("geneLME_dev.R")
+
+
+
+
+

Mock Data

+

All examples use a simulated EList-like object with a +3-treatment × 4-visit paired design (10 patients, 120 +samples, 50 genes). Genes 1–10 have a simulated TrtC:V3 +upregulation of +2.5 log2 units to provide a detectable signal in Branch +A examples.

+
set.seed(42)
+
+n_patients <- 10
+treatments <- c("TrtA", "TrtB", "TrtC")
+visits     <- c("V1", "V2", "V3", "V4")
+n_genes    <- 50
+
+# Patient-level covariates
+patient_meta <- data.frame(
+  ptID = paste0("pt", sprintf("%02d", 1:n_patients)),
+  sex  = factor(sample(c("M", "F"), n_patients, replace = TRUE)),
+  age  = round(rnorm(n_patients, mean = 38, sd = 10)),
+  stringsAsFactors = FALSE
+)
+
+# Sample metadata (targets)
+targets <- expand.grid(
+  ptID      = paste0("pt", sprintf("%02d", 1:n_patients)),
+  treatment = treatments,
+  visit     = visits,
+  stringsAsFactors = FALSE
+) %>%
+  arrange(ptID, treatment, visit) %>%
+  left_join(patient_meta, by = "ptID") %>%
+  mutate(
+    sample_id           = paste(ptID, treatment, visit, sep = "_"),
+    rNANgUl             = rnorm(n(), mean = 5,    sd = 1),
+    percent_duplication = runif(n(), min = 0.05,  max = 0.55),
+    median_cv_coverage  = rnorm(n(), mean = 0.85, sd = 0.08),
+    lib.size            = round(rnorm(n(), mean = 20e6, sd = 3e6)),
+    norm.factors        = rnorm(n(), mean = 1,    sd = 0.05)
+  )
+rownames(targets) <- targets$sample_id
+
+# Expression matrix
+n_samples <- nrow(targets)
+E_mat <- matrix(rnorm(n_genes * n_samples, mean = 8, sd = 2),
+                nrow = n_genes, ncol = n_samples,
+                dimnames = list(paste0("gene", sprintf("%02d", 1:n_genes)),
+                                targets$sample_id))
+
+# Simulated TrtC:V3 effect on genes 1-10
+trtC_v3 <- which(targets$treatment == "TrtC" & targets$visit == "V3")
+E_mat[1:10, trtC_v3] <- E_mat[1:10, trtC_v3] + 2.5
+
+# Patient random effect
+for (pt in unique(targets$ptID)) {
+  idx <- which(targets$ptID == pt)
+  E_mat[, idx] <- E_mat[, idx] + rnorm(1, 0, 1)
+}
+
+# Voom-like precision weights
+W_mat <- matrix(abs(rnorm(n_genes * n_samples, mean = 1, sd = 0.1)),
+                nrow = n_genes, ncol = n_samples,
+                dimnames = dimnames(E_mat))
+
+# EList-like object
+dat <- list(E = E_mat, weights = W_mat, targets = targets)
+
+# Small subset for faster examples
+dat_sub <- list(E = dat$E[1:10, ], weights = dat$weights[1:10, ], targets = dat$targets)
+
+cat("Design:", n_patients, "patients ×", length(treatments), "treatments ×",
+    length(visits), "visits =", n_samples, "samples\n")
+
#> Design: 10 patients × 3 treatments × 4 visits = 120 samples
+
cat("Genes:", n_genes, "(full) /", nrow(dat_sub$E), "(subset used in examples)\n")
+
#> Genes: 50 (full) / 10 (subset used in examples)
+
+
+
+

Step 0: geneLME_contrast_spec() — Inspect Levels Before +Running

+

Before calling geneLME(), use +geneLME_contrast_spec() to enumerate available contrast +levels and understand exactly how to construct your contrast arguments. +This function has two modes depending on whether +contrast_vars contains ":".

+
+

Interaction mode

+

Pass a single "var_a:var_b" string to get all pairwise +combinations of interaction cells. This returns a +data.frame that you filter down to the specific contrasts +of interest and pass directly as contrast_spec to +geneLME().

+
spec_template <- geneLME_contrast_spec(
+  targets       = dat$targets,
+  contrast_vars = "treatment:visit"
+)
+
+# spec_template has two columns: contrast_ref and contrast_lvl.
+# contrast_index is NOT included here — it is added by geneLME() to its $contrast_spec
+# output element once it receives your filtered spec.
+cat("Total pairwise combinations:", nrow(spec_template), "\n\n")
+
#> Total pairwise combinations: 66
+
head(spec_template, 12)
+
+ +
+

Filter to the contrasts of interest — here, same-visit +cross-treatment comparisons at V2 and V3:

+
my_spec <- spec_template %>%
+  filter(
+    sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl),  # same visit
+    sub(".* ", "", contrast_ref) %in% c("V2", "V3")               # V2 and V3 only
+  )
+
+cat("Filtered to", nrow(my_spec), "contrasts:\n")
+
#> Filtered to 6 contrasts:
+
print(my_spec)
+
#>   contrast_ref contrast_lvl
+#> 1      TrtA V2      TrtB V2
+#> 2      TrtA V2      TrtC V2
+#> 3      TrtB V2      TrtC V2
+#> 4      TrtA V3      TrtB V3
+#> 5      TrtA V3      TrtC V3
+#> 6      TrtB V3      TrtC V3
+
+
+

Single/multi-variable mode

+

Pass a character vector of plain variable names (no :). +Returns a named list — one +data.frame(level) per variable — and prints a message for +each variable explaining its role in geneLME() +arguments.

+
ref <- geneLME_contrast_spec(
+  targets       = dat$targets,
+  contrast_vars = c("treatment", "visit")
+)
+

The printed messages explain:

+
    +
  • treatment (position 1, primary): its +levels define the length and position order of +contrasts_primary vectors in geneLME()
  • +
  • visit (position 2, secondary): its +levels can be subset via contrast_var_2_levels to restrict +which groups the primary contrasts are computed within
  • +
+
cat("treatment levels (contrasts_primary vector positions):\n")
+
#> treatment levels (contrasts_primary vector positions):
+
print(ref$treatment)
+
#>   level
+#> 1  TrtA
+#> 2  TrtB
+#> 3  TrtC
+
cat("\nvisit levels (available for contrast_var_2_levels filtering):\n")
+
#> 
+#> visit levels (available for contrast_var_2_levels filtering):
+
print(ref$visit)
+
#>   level
+#> 1    V1
+#> 2    V2
+#> 3    V3
+#> 4    V4
+
+
+
+
+

Branch A: Interaction Contrasts via contrast_spec

+

Use when: the model formula contains an interaction +term (treatment * visit or treatment:visit) +and you want specific, explicitly-defined pairwise contrasts across +interaction cells.

+

Under the hood, geneLME_fit() computes the full +emmeans object for the interaction, then builds one named +contrast vector per row of contrast_spec — setting the +reference cell to −1, the comparison cell to +1, and all others to 0. +This guarantees that only the contrasts you define are run.

+
+

Basic Branch A call

+
result_A <- geneLME(
+  dat           = dat_sub,
+  formula_str   = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)",
+  model_weights = TRUE,
+  run_contrast  = TRUE,
+  contrast_vars = "treatment:visit",   # single ":" string → Branch A
+  contrast_spec = my_spec,             # the filtered data.frame from geneLME_contrast_spec()
+  n_cores       = 2
+)
+
+
+

Output structure

+
cat("Output elements:", paste(names(result_A), collapse = ", "), "\n\n")
+
#> Output elements: lme_anova, lme_contrast, lme_fit, lme_err, contrast_spec
+
cat("lme_fit (AIC per gene):\n")
+
#> lme_fit (AIC per gene):
+
print(result_A$lme_fit)
+
#>      gene      AIC
+#> 1  gene01 532.8037
+#> 2  gene02 532.7433
+#> 3  gene03 545.1509
+#> 4  gene04 511.5349
+#> 5  gene05 531.4956
+#> 6  gene06 524.0391
+#> 7  gene07 523.6012
+#> 8  gene08 504.2733
+#> 9  gene09 495.8422
+#> 10 gene10 534.7778
+
cat("\nModel status summary:\n")
+
#> 
+#> Model status summary:
+
print(table(result_A$lme_err))
+
#> 
+#> success 
+#>      10
+
# $contrast_spec is the indexed copy of my_spec attached by geneLME().
+# contrast_index here is 1:nrow(my_spec) — the row position within the filtered
+# spec. This is the stable index you should use when building contrasts_secondary vectors.
+cat("\n$contrast_spec (indexed by geneLME):\n")
+
#> 
+#> $contrast_spec (indexed by geneLME):
+
print(result_A$contrast_spec)
+
#>   contrast_index contrast_ref contrast_lvl
+#> 1              1      TrtA V2      TrtB V2
+#> 2              2      TrtA V2      TrtC V2
+#> 3              3      TrtB V2      TrtC V2
+#> 4              4      TrtA V3      TrtB V3
+#> 5              5      TrtA V3      TrtC V3
+#> 6              6      TrtB V3      TrtC V3
+
+
+

lme_anova — ANOVA table

+

One row per model term per gene. predictor_class +classifies each term, and Estimate_source indicates whether +a coefficient estimate is available directly +("lme_summary") or should be read from the contrast output +("seeContrasts"). p.value_adj is the +FDR-adjusted p-value computed across all genes within each model +term (Benjamini-Hochberg by default).

+
cat("ANOVA rows:", nrow(result_A$lme_anova), "\n")
+
#> ANOVA rows: 80
+
cat("  (", length(unique(result_A$lme_anova$term)), "terms ×",
+    length(unique(result_A$lme_anova$gene)), "genes)\n\n")
+
#>   ( 8 terms × 10 genes)
+
# Show one gene in full
+result_A$lme_anova %>%
+  filter(gene == "gene01") %>%
+  select(term, statistic, df, p.value, p.value_adj, predictor_class, Estimate_source, Estimate, Estimate_SE)
+
+ +
+
+
+

lme_contrast — contrast results

+

One row per defined contrast per gene. contrast_order is +always "first_order" in a basic Branch A call (no +second-order contrasts specified). p.value_adj is +FDR-adjusted across all genes within each contrast × +contrast_order combination.

+

contrast_ref and contrast_lvl are populated +for Branch A first-order contrasts — they record +exactly which interaction cell is the reference (−1) and which is the +comparison (+1), eliminating any ambiguity in how the estimate is +signed. For second-order contrasts and all Branch B contrasts these +columns are NA (no single ref/lvl pair applies).

+
cat("Contrast rows:", nrow(result_A$lme_contrast), "\n")
+
#> Contrast rows: 60
+
cat("  (", length(unique(result_A$lme_contrast$contrast)), "contrasts ×",
+    length(unique(result_A$lme_contrast$gene)), "genes)\n\n")
+
#>   ( 6 contrasts × 10 genes)
+
result_A$lme_contrast %>%
+  filter(gene == "gene01") %>%
+  select(contrast, contrast_ref, contrast_lvl, estimate, SE, df, t.ratio, p.value, p.value_adj, contrast_order)
+
+ +
+

Notice that gene01 shows large, significant effects at V3 (TrtC vs +others) but not at V2, consistent with the simulated +TrtC:V3 upregulation.

+
+
+
+
+

Branch A: Adding Second-Order Contrasts

+

Second-order contrasts are contrasts-of-contrasts: each operates on +the vector of first-order contrast estimates, allowing you to ask +whether a treatment difference changes across visits.

+

At validation time, geneLME() prints the numbered +first-order list in contrast_spec row order, so you can +verify the position indices before constructing your secondary vectors. +Secondary vectors must have length equal to +nrow(contrast_spec).

+
# At validation time, geneLME() prints the first-order contrast list in my_spec row
+# order. You can also inspect result_A$contrast_spec for the indexed row ordering.
+# With my_spec as defined above (same-visit, V2 and V3), alphabetical ordering gives:
+#  row 1: TrtB V2 - TrtA V2
+#  row 2: TrtC V2 - TrtA V2
+#  row 3: TrtC V2 - TrtB V2
+#  row 4: TrtB V3 - TrtA V3
+#  row 5: TrtC V3 - TrtA V3
+#  row 6: TrtC V3 - TrtB V3
+#
+# Second-order questions:
+#   "Did the TrtA vs TrtB difference change from V2 to V3?"  → row4 − row1
+#   "Did the TrtA vs TrtC difference change from V2 to V3?"  → row5 − row2
+#
+# Verify with: result_A$contrast_spec
+
+result_A2 <- geneLME(
+  dat                 = dat_sub,
+  formula_str         = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)",
+  model_weights       = TRUE,
+  run_contrast        = TRUE,
+  contrast_vars       = "treatment:visit",
+  contrast_spec       = my_spec,
+  contrasts_secondary = list(
+    "TrtA vs TrtB: V3 minus V2 effect" = c(-1, 0, 0, 1, 0, 0),  # row4 − row1
+    "TrtA vs TrtC: V3 minus V2 effect" = c(0, -1, 0, 0, 1, 0)   # row5 − row2
+  ),
+  n_cores             = 2
+)
+
cat("Contrast order breakdown:\n")
+
#> Contrast order breakdown:
+
print(table(result_A2$lme_contrast$contrast_order))
+
#> 
+#>  first_order second_order 
+#>           60           20
+
cat("\nSecond-order results for gene01 and gene09 (strongest TrtC:V3 effect):\n")
+
#> 
+#> Second-order results for gene01 and gene09 (strongest TrtC:V3 effect):
+
# contrast_ref/contrast_lvl are NA for second-order rows (no single ref/lvl pair)
+result_A2$lme_contrast %>%
+  filter(contrast_order == "second_order", gene %in% c("gene01", "gene09")) %>%
+  select(gene, contrast, contrast_ref, contrast_lvl, estimate, SE, df, t.ratio, p.value)
+
+ +
+
+

Programmatic second-order contrast construction via +$contrast_spec

+

When there are many first-order contrasts, hand-coding the position +of each −1/+1 in every secondary vector is +error-prone. After a successful geneLME() call (or a +soft-fail call — see Input Validation), +the $contrast_spec element of the returned list contains an +indexed copy of contrast_spec with a +contrast_index column added by geneLME(). +contrast_index is simply 1:nrow(contrast_spec) +— the row positions in the filtered spec you passed in. Use these as +stable handles when building contrasts_secondary vectors +programmatically.

+

Workflow: run geneLME() once (or use a +dry-run via a wrong-length contrasts_secondary soft-fail), +inspect result$contrast_spec to confirm the row ordering +and indices, then build and pass the correct +contrasts_secondary list.

+

Example scenario: within each treatment pair, ask +whether the between-treatment difference changed from V2 to V3 — i.e., +(TrtX−TrtY at V3) − (TrtX−TrtY at V2).

+
# result_A$contrast_spec is the indexed version of my_spec as seen by geneLME().
+# contrast_index = 1:nrow(my_spec) (simple row positions after filtering).
+cat("Indexed contrast_spec from result_A:\n")
+
#> Indexed contrast_spec from result_A:
+
print(result_A$contrast_spec)
+
#>   contrast_index contrast_ref contrast_lvl
+#> 1              1      TrtA V2      TrtB V2
+#> 2              2      TrtA V2      TrtC V2
+#> 3              3      TrtB V2      TrtC V2
+#> 4              4      TrtA V3      TrtB V3
+#> 5              5      TrtA V3      TrtC V3
+#> 6              6      TrtB V3      TrtC V3
+
# Annotate with the visit for each contrast (extracted from the interaction label)
+indexed_spec <- result_A$contrast_spec %>%
+  mutate(visit = sub(".* ", "", contrast_ref))   # extract visit from the interaction label
+
+print(indexed_spec)
+
#>   contrast_index contrast_ref contrast_lvl visit
+#> 1              1      TrtA V2      TrtB V2    V2
+#> 2              2      TrtA V2      TrtC V2    V2
+#> 3              3      TrtB V2      TrtC V2    V2
+#> 4              4      TrtA V3      TrtB V3    V3
+#> 5              5      TrtA V3      TrtC V3    V3
+#> 6              6      TrtB V3      TrtC V3    V3
+
# Build a two-column lookup: one contrast_index per sign, one row per second-order contrast.
+# group_by the treatment pair (shared across visits); within each group there are exactly
+# two rows — one for V2 and one for V3 — giving the index_neg (V2, subtracted) and
+# index_pos (V3, added).
+secondary_lookup <- indexed_spec %>%
+  # Create a treatment-pair label that is the same regardless of visit
+  mutate(trt_pair = paste(sub(" .*", "", contrast_ref),  # var_a from ref
+                          sub(" .*", "", contrast_lvl),  # var_a from lvl
+                          sep = " vs ")) %>%
+  arrange(trt_pair, visit) %>%          # V2 before V3 (alphabetical)
+  group_by(trt_pair) %>%
+  summarise(
+    secondary_contrast_name = paste0("(", last(contrast_lvl), " - ", last(contrast_ref), ")",
+                                     " - ",
+                                     "(", first(contrast_lvl), " - ", first(contrast_ref), ")"),
+    index_neg = first(contrast_index),  # V2 row → subtracted (−1)
+    index_pos = last(contrast_index),   # V3 row → added    (+1)
+    .groups = "drop"
+  )
+
+print(secondary_lookup)
+
#> # A tibble: 3 × 4
+#>   trt_pair     secondary_contrast_name                   index_neg index_pos
+#>   <chr>        <chr>                                         <int>     <int>
+#> 1 TrtA vs TrtB (TrtB V3 - TrtA V3) - (TrtB V2 - TrtA V2)         1         4
+#> 2 TrtA vs TrtC (TrtC V3 - TrtA V3) - (TrtC V2 - TrtA V2)         2         5
+#> 3 TrtB vs TrtC (TrtC V3 - TrtB V3) - (TrtC V2 - TrtB V2)         3         6
+
# Format as a named list of zero vectors with −1/+1 placed by contrast_index.
+# The vector length must equal nrow(my_spec) — the number of first-order contrasts.
+n_first_order <- nrow(my_spec)
+
+contrasts_secondary_prog <- setNames(
+  lapply(seq_len(nrow(secondary_lookup)), function(i) {
+    v <- rep(0, n_first_order)
+    # contrast_index IS the row position (1:nrow), so it can be used directly.
+    # Using which() for robustness in case the data.frame is reordered.
+    v[which(indexed_spec$contrast_index == secondary_lookup$index_neg[i])] <- -1
+    v[which(indexed_spec$contrast_index == secondary_lookup$index_pos[i])] <-  1
+    v
+  }),
+  secondary_lookup$secondary_contrast_name
+)
+
+# Inspect: each vector should have exactly one −1 and one +1
+print(contrasts_secondary_prog)
+
#> $`(TrtB V3 - TrtA V3) - (TrtB V2 - TrtA V2)`
+#> [1] -1  0  0  1  0  0
+#> 
+#> $`(TrtC V3 - TrtA V3) - (TrtC V2 - TrtA V2)`
+#> [1]  0 -1  0  0  1  0
+#> 
+#> $`(TrtC V3 - TrtB V3) - (TrtC V2 - TrtB V2)`
+#> [1]  0  0 -1  0  0  1
+
result_A_prog <- geneLME(
+  dat                 = dat_sub,
+  formula_str         = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)",
+  model_weights       = TRUE,
+  run_contrast        = TRUE,
+  contrast_vars       = "treatment:visit",
+  contrast_spec       = my_spec,
+  contrasts_secondary = contrasts_secondary_prog,
+  n_cores             = 2
+)
+
+cat("Second-order results for gene01:\n")
+
#> Second-order results for gene01:
+
result_A_prog$lme_contrast %>%
+  filter(contrast_order == "second_order", gene == "gene01") %>%
+  select(gene, contrast, contrast_ref, contrast_lvl, estimate, SE, df, t.ratio, p.value)
+
+ +
+
+

Key design note: geneLME() appends +contrast_index (= 1:nrow(contrast_spec)) to +its $contrast_spec output element — not to the +contrast_spec data frame you pass in. This means +contrast_index always reflects the row positions in +your filtered spec, never in the full unfiltered template. The +which(indexed_spec$contrast_index == ...) lookup maps those +indices back to row positions for vector construction, and is robust to +any reordering of the spec data frame.

+
+
+
+
+
+

Branch B: Non-Interaction Contrasts

+

Use when: the model is additive (no interaction +term), and you want named contrasts on the marginal means of one +variable. Optionally, compute those contrasts within specific levels of +a second grouping variable (contrast_var_2_levels), and +optionally follow up with second-order contrasts across the primary +contrast estimates.

+

Under the hood, geneLME_fit() calls +emmeans() with a spec formula derived from +contrast_vars, using contrast_var_2_levels as +the at = filter. contrasts_primary vectors are +passed directly to emmeans::contrast().

+
+

Inspecting levels before building contrast vectors

+

Before specifying contrasts_primary, confirm the level +ordering emmeans will use — it is alphabetical by default:

+
ref_B <- geneLME_contrast_spec(dat$targets, contrast_vars = c("treatment", "visit"))
+

treatment levels are TrtA[1], +TrtB[2], TrtC[3] — so +contrasts_primary vectors must have length 3 with positions +corresponding to those labels:

+
"TrtC vs TrtA" = c(-1, 0, 1)   # TrtA=−1, TrtB=0, TrtC=+1
+"TrtB vs TrtA" = c(-1, 1, 0)   # TrtA=−1, TrtB=+1, TrtC=0
+
+
+

Basic Branch B call

+
result_B <- geneLME(
+  dat                   = dat_sub,
+  formula_str           = "~ treatment + visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)",
+  model_weights         = TRUE,
+  run_contrast          = TRUE,
+  contrast_vars         = c("treatment", "visit"),  # [1] primary; [2] 'by' grouping variable
+  contrast_var_2_levels = c("V2", "V3"),            # restrict to these visit levels
+  contrasts_primary     = list(
+    "TrtC vs TrtA" = c(-1, 0, 1),
+    "TrtB vs TrtA" = c(-1, 1, 0)
+  ),
+  n_cores               = 2
+)
+
cat("Contrast rows:", nrow(result_B$lme_contrast), "\n")
+
#> Contrast rows: 40
+
cat("  (", length(unique(result_B$lme_contrast$contrast)), "contrasts ×",
+    length(contrast_var_2_levels <- c("V2","V3")), "visits ×",
+    length(unique(result_B$lme_contrast$gene)), "genes)\n\n")
+
#>   ( 2 contrasts × 2 visits × 10 genes)
+
# contrast_ref and contrast_lvl are NA for Branch B (named contrast vectors,
+# not a ref/lvl spec frame)
+result_B$lme_contrast %>%
+  filter(gene == "gene01") %>%
+  select(contrast, contrast_ref, contrast_lvl, visit, estimate, SE, df, t.ratio, p.value, contrast_order)
+
+ +
+
+

Note: In an additive model, treatment estimates are +identical across V2 and V3 — there is no interaction, so the treatment +effect does not depend on visit. contrast_var_2_levels +controls which visit cells are included in the output, which is +useful for limiting the number of tests and focusing results on visits +of biological interest.

+
+
+
+

Branch B with second-order contrasts

+

A second-order contrast operates on the vector of first-order +contrast estimates. Here, contrasts_secondary must have +length equal to the number of primary contrasts (2 in this example). The +result asks: “Is the TrtC vs TrtA difference significantly +larger than the TrtB vs TrtA difference?” — i.e., is there +a differential treatment response between TrtB and TrtC relative to +TrtA?

+
result_B2 <- geneLME(
+  dat                   = dat_sub,
+  formula_str           = "~ treatment + visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)",
+  model_weights         = TRUE,
+  run_contrast          = TRUE,
+  contrast_vars         = c("treatment", "visit"),
+  contrast_var_2_levels = c("V2", "V3"),
+  contrasts_primary     = list(
+    "TrtC vs TrtA" = c(-1, 0, 1),
+    "TrtB vs TrtA" = c(-1, 1, 0)
+  ),
+  contrasts_secondary   = list(
+    "TrtC vs TrtB (2nd order)" = c(1, -1)  # length = n primary contrasts (2)
+  ),
+  n_cores               = 2
+)
+
cat("Contrast order breakdown:\n")
+
#> Contrast order breakdown:
+
print(table(result_B2$lme_contrast$contrast_order))
+
#> 
+#>  first_order second_order 
+#>           40           20
+
result_B2$lme_contrast %>%
+  filter(gene == "gene01") %>%
+  select(contrast, contrast_ref, contrast_lvl, visit, estimate, SE, df, t.ratio, p.value, contrast_order)
+
+ +
+
+
+
+
+

FDR Adjustment

+

geneLME() automatically appends a +p.value_adj column to both lme_anova and +lme_contrast using p.adjust(). Adjustment is +performed across all genes within each grouping unit — +not globally across all rows — so that each test family is treated +independently:

+ ++++ + + + + + + + + + + + + + + + + +
Output tableGrouping unit for adjustment
lme_anovaEach model term (e.g. treatment, +treatment:visit)
lme_contrastEach contrast label × contrast_order
+

The default method is Benjamini-Hochberg ("BH"). Any +method accepted by p.adjust() can be specified via the +fdr_method argument:

+
# Available methods
+p.adjust.methods
+
#> [1] "holm"       "hochberg"   "hommel"     "bonferroni" "BH"        
+#> [6] "BY"         "fdr"        "none"
+
# Example: Bonferroni correction instead of BH
+result_bonf <- geneLME(
+  dat           = dat_sub,
+  formula_str   = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)",
+  model_weights = TRUE,
+  run_contrast  = TRUE,
+  contrast_vars = "treatment:visit",
+  contrast_spec = my_spec,
+  fdr_method    = "bonferroni",
+  n_cores       = 2
+)
+
+# Compare BH vs Bonferroni for one contrast across genes
+comparison <- result_A$lme_contrast %>%
+  filter(contrast == first(contrast), contrast_order == "first_order") %>%
+  select(gene, p.value, p.value_adj_BH = p.value_adj) %>%
+  left_join(
+    result_bonf$lme_contrast %>%
+      filter(contrast == first(contrast), contrast_order == "first_order") %>%
+      select(gene, p.value_adj_bonf = p.value_adj),
+    by = "gene"
+  )
+
+print(comparison)
+
#> # A tibble: 10 × 4
+#>    gene   p.value p.value_adj_BH p.value_adj_bonf
+#>    <chr>    <dbl>          <dbl>            <dbl>
+#>  1 gene01   0.967          0.998                1
+#>  2 gene02   0.972          0.998                1
+#>  3 gene03   0.998          0.998                1
+#>  4 gene04   0.831          0.998                1
+#>  5 gene05   0.101          0.719                1
+#>  6 gene06   0.555          0.998                1
+#>  7 gene07   0.194          0.719                1
+#>  8 gene08   0.836          0.998                1
+#>  9 gene09   0.216          0.719                1
+#> 10 gene10   0.715          0.998                1
+
+

Note on NA values: genes whose models failed receive +NA for p.value, which propagates to +p.value_adj. These genes are excluded from the adjustment +set, so the effective number of tests equals the number of successfully +fitted genes.

+
+
+
+
+

Input Validation

+

geneLME() validates all inputs before launching any +parallel work. Informative errors are raised for the most common +misspecifications:

+
run_validation_test <- function(label, expr) {
+  tryCatch(
+    { expr; cat(label, "— (no error raised)\n") },
+    error   = function(e) cat(label, "ERROR:\n ", conditionMessage(e), "\n\n"),
+    warning = function(w) cat(label, "WARNING:\n ", conditionMessage(w), "\n\n")
+  )
+}
+
+# 1. Formula variable missing from targets
+run_validation_test("Test 1 — missing formula variable:",
+  geneLME(dat_sub, formula_str = "~ NONEXISTENT_VAR + (1|ptID)", n_cores = 2)
+)
+
#> Test 1 — missing formula variable: ERROR:
+#>   The following variables in the formula are missing from dat$targets: NONEXISTENT_VAR
+
# 2. Weights requested but not present
+dat_no_w <- dat_sub; dat_no_w$weights <- NULL
+run_validation_test("Test 2 — model_weights=TRUE but no dat$weights:",
+  geneLME(dat_no_w, formula_str = "~ treatment + (1|ptID)", model_weights = TRUE, n_cores = 2)
+)
+
#> Test 2 — model_weights=TRUE but no dat$weights: ERROR:
+#>   model_weights = TRUE but dat$weights is NULL.
+
# 3. Interaction contrast requested but interaction not in formula
+run_validation_test("Test 3 — interaction contrast but additive formula:",
+  geneLME(dat_sub,
+          formula_str   = "~ treatment + visit + age + (1|ptID)",
+          run_contrast  = TRUE,
+          contrast_vars = "treatment:visit",
+          contrast_spec = my_spec,
+          n_cores       = 2)
+)
+
#> Test 3 — interaction contrast but additive formula: ERROR:
+#>   contrast_vars specifies an interaction contrast for 'treatment:visit', but this interaction term is not present in formula_str.
+#> Either add the interaction to the formula (e.g. '~ treatment * visit + ...') or change contrast_vars to a non-interaction term.
+
# 4. Interaction contrast but contrast_spec not provided
+run_validation_test("Test 4 — interaction contrast, contrast_spec = NULL:",
+  geneLME(dat_sub,
+          formula_str   = "~ treatment * visit + age + (1|ptID)",
+          run_contrast  = TRUE,
+          contrast_vars = "treatment:visit",
+          n_cores       = 2)
+)
+
#> Test 4 — interaction contrast, contrast_spec = NULL: ERROR:
+#>   contrast_vars specifies an interaction contrast ('treatment:visit') but contrast_spec is NULL.
+#> Use geneLME_contrast_spec(dat$targets, contrast_vars = 'treatment:visit') to generate a template, filter it to your contrasts of interest, then pass it as contrast_spec.
+
# 5. contrast_spec has wrong column names
+run_validation_test("Test 5 — contrast_spec with wrong columns:",
+  geneLME(dat_sub,
+          formula_str   = "~ treatment * visit + age + (1|ptID)",
+          run_contrast  = TRUE,
+          contrast_vars = "treatment:visit",
+          contrast_spec = data.frame(a = "x", b = "y"),
+          n_cores       = 2)
+)
+
#> Test 5 — contrast_spec with wrong columns: ERROR:
+#>   contrast_spec must have columns 'contrast_ref' and 'contrast_lvl'.
+
# 6. contrast_var_2_levels contains an invalid level
+run_validation_test("Test 6 — invalid contrast_var_2_levels:",
+  geneLME(dat_sub,
+          formula_str           = "~ treatment + visit + age + (1|ptID)",
+          run_contrast          = TRUE,
+          contrast_vars         = c("treatment", "visit"),
+          contrast_var_2_levels = c("V2", "NOTAVISIT"),
+          contrasts_primary     = list("TrtC vs TrtA" = c(-1, 0, 1)),
+          n_cores               = 2)
+)
+
#> Test 6 — invalid contrast_var_2_levels: ERROR:
+#>   Levels specified for 'visit' not found in data: NOTAVISIT
+

When contrasts_secondary vectors have the wrong length, +geneLME() soft-fails: it returns early +without running any models, but still populates +$contrast_spec with the indexed row ordering. This lets you +inspect the indexing and fix your vectors without a hard error.

+
# 7. contrasts_secondary with wrong length → soft-fail with $contrast_spec populated
+bad_result <- geneLME(
+  dat                 = dat_sub,
+  formula_str         = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)",
+  model_weights       = TRUE,
+  run_contrast        = TRUE,
+  contrast_vars       = "treatment:visit",
+  contrast_spec       = my_spec,
+  contrasts_secondary = list(
+    "wrong length vector" = rep(0, 99)  # should be length 6 = nrow(my_spec)
+  ),
+  n_cores = 2
+)
+
+cat("Elements of soft-fail result:\n")
+
#> Elements of soft-fail result:
+
print(sapply(bad_result, is.null))   # lme_anova/contrast/fit/err are NULL; contrast_spec is not
+
#>     lme_anova  lme_contrast       lme_fit       lme_err contrast_spec 
+#>          TRUE          TRUE          TRUE          TRUE         FALSE
+
cat("\n$contrast_spec (use these indices to fix contrasts_secondary):\n")
+
#> 
+#> $contrast_spec (use these indices to fix contrasts_secondary):
+
print(bad_result$contrast_spec)
+
#>   contrast_index contrast_ref contrast_lvl
+#> 1              1      TrtA V2      TrtB V2
+#> 2              2      TrtA V2      TrtC V2
+#> 3              3      TrtB V2      TrtC V2
+#> 4              4      TrtA V3      TrtB V3
+#> 5              5      TrtA V3      TrtC V3
+#> 6              6      TrtB V3      TrtC V3
+
+
+
+

Error Handling and Failed Genes

+

Per-gene errors are captured by tryCatch inside +geneLME_fit() and do not abort the run. Failed genes +produce NA-filled rows in lme_anova and +lme_contrast, and the error message is stored in +lme_err.

+
# Check model status summary
+cat("Model status — Branch A:\n")
+
#> Model status — Branch A:
+
print(table(result_A$lme_err))
+
#> 
+#> success 
+#>      10
+
# With small mock data (10 patients), singular fits are expected due to
+# patient-level covariates (sex, age) being collinear with the patient random effect.
+# With real data at typical sample sizes this is less common.
+
+# Identify any failed genes
+failed <- names(result_A$lme_err)[result_A$lme_err != "success"]
+if (length(failed) > 0) {
+  cat("\nFailed genes:\n")
+  print(result_A$lme_err[failed])
+} else {
+  cat("\nAll genes fitted successfully.\n")
+}
+
#> 
+#> All genes fitted successfully.
+
+
+
+

Quick-Reference: Argument Guide

+ ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentTypeRequiredDescription
datlistYesEList-like: $E, $weights (optional), +$targets
formula_strcharacterYesRHS formula string, +e.g. "~ treatment * visit + (1\|ptID)"
model_weightslogicalNoTRUE to use dat$weights as precision +weights
run_contrastlogicalNoTRUE to run emmeans contrasts
contrast_varscharacterIf run_contrast=TRUE"var_a:var_b" (Branch A) or +c("var1", "var2") (Branch B)
contrast_var_2_levelscharacter vectorNoBranch B: restrict ‘by’ variable to these levels
contrast_specdata.frameRequired for Branch AColumns contrast_ref and contrast_lvl; +from geneLME_contrast_spec(). geneLME() +attaches an indexed copy (with contrast_index) to +$contrast_spec in its return value.
contrasts_primarynamed listBranch BNamed contrast vectors; length = levels of +contrast_vars[1]
contrasts_secondarynamed listNoSecond-order contrasts; length = nrow(contrast_spec) +(Branch A) or length(contrasts_primary) (Branch B). Wrong +lengths trigger a soft-fail with $contrast_spec populated +for debugging.
fdr_methodcharacterNoP-value adjustment method passed to p.adjust(); default +"BH" (Benjamini-Hochberg)
n_coresintegerNoNumber of parallel workers; defaults to +detectCores() − 4
+
+
+
+

Session Info

+
sessionInfo()
+
#> R version 4.5.1 (2025-06-13)
+#> Platform: aarch64-apple-darwin20
+#> Running under: macOS Sequoia 15.7.3
+#> 
+#> Matrix products: default
+#> BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
+#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
+#> 
+#> locale:
+#> [1] en_US/en_US/en_US/C/en_US/en_US
+#> 
+#> time zone: America/Los_Angeles
+#> tzcode source: internal
+#> 
+#> attached base packages:
+#> [1] stats     graphics  grDevices utils     datasets  methods   base     
+#> 
+#> other attached packages:
+#>  [1] future.apply_1.20.1 future_1.69.0       purrr_1.2.1        
+#>  [4] tibble_3.3.1        dplyr_1.2.0         broom.mixed_0.2.9.6
+#>  [7] car_3.1-3           carData_3.0-5       emmeans_2.0.1      
+#> [10] lme4_1.1-38         Matrix_1.7-4       
+#> 
+#> loaded via a namespace (and not attached):
+#>  [1] utf8_1.2.6         tidyr_1.3.2        sass_0.4.10        generics_0.1.4    
+#>  [5] lattice_0.22-7     listenv_0.10.0     digest_0.6.39      magrittr_2.0.4    
+#>  [9] evaluate_1.0.5     grid_4.5.1         estimability_1.5.1 mvtnorm_1.3-3     
+#> [13] fastmap_1.2.0      jsonlite_2.0.0     backports_1.5.0    Formula_1.2-5     
+#> [17] codetools_0.2-20   jquerylib_0.1.4    abind_1.4-8        reformulas_0.4.3.1
+#> [21] Rdpack_2.6.4       cli_3.6.5          rlang_1.1.7        rbibutils_2.4.1   
+#> [25] parallelly_1.46.1  splines_4.5.1      withr_3.0.2        cachem_1.1.0      
+#> [29] yaml_2.3.12        otel_0.2.0         parallel_4.5.1     tools_4.5.1       
+#> [33] nloptr_2.2.1       coda_0.19-4.1      minqa_1.2.8        forcats_1.0.1     
+#> [37] globals_0.18.0     boot_1.3-32        broom_1.0.11       vctrs_0.7.1       
+#> [41] R6_2.6.1           lifecycle_1.0.5    MASS_7.3-65        furrr_0.3.1       
+#> [45] pkgconfig_2.0.3    bslib_0.10.0       pillar_1.11.1      glue_1.8.0        
+#> [49] Rcpp_1.1.1         xfun_0.56          tidyselect_1.2.1   knitr_1.51        
+#> [53] xtable_1.8-4       htmltools_0.5.9    nlme_3.1-168       rmarkdown_2.30    
+#> [57] compiler_4.5.1
+
+ + + +
+
+ +
+ + + + + + + + + + + + + + + + +