diff --git a/Claud_next_steps.rtf b/Claud_next_steps.rtf new file mode 100644 index 0000000..beb6944 --- /dev/null +++ b/Claud_next_steps.rtf @@ -0,0 +1,38 @@ +{\rtf1\ansi\ansicpg1252\cocoartf2822 +\cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fswiss\fcharset0 Helvetica;\f1\fmodern\fcharset0 Courier;} +{\colortbl;\red255\green255\blue255;\red38\green38\blue38;\red255\green255\blue255;} +{\*\expandedcolortbl;;\cssrgb\c20000\c20000\c20000;\cssrgb\c100000\c100000\c100000;} +\margl1440\margr1440\vieww13240\viewh11160\viewkind0 +\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0 + +\f0\fs24 \cf0 # Notes for next session\ +Check your CLAUDE_NOTES and familiarize yourself with geneLME_dev.R and geneLME_dev2.R\ +\ +\ +Pick up with: verifying and validating the dev2 version. I have tested this and still encountered errors stemming from functions not explicitly called from required packages (eg. Line 849: plan(multisession, workers=workers). I corrected this where I could find it, but please review and make sure my specifications are valid and correct all such instances. If needed, should we add explicit package calling to anyplace that non base functions are called?\ +\ +I\'92d also like you to please edit the functions to silence internal warnings such as \'93 +\f1\fs26 \cf2 \cb3 \expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 ## Warning: package \'91lme4\'92 was built under R version 4.5.2\'94 +\f0\fs24 \cf0 \cb1 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 or other messages about package versioning. \ +\ +I\'92ve also encountered the warnings \ +\'93 +\f1\fs26 \cf2 \cb3 \expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 Warning: Some predictor variables are on very different scales: consider rescaling. \ +\pard\pardeftab720\partightenfactor0 +\cf2 \cb3 \strokec2 ## You may also use (g)lmerControl(autoscale = TRUE) to improve numerical stability.\ +\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0 + +\f0\fs24 \cf0 \cb1 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 \'93\ +\ +This may be good rational for setting \'93 +\f1\fs26 \cf2 \cb3 \expnd0\expndtw0\kerning0 +lmerControl(autoscale = TRUE)\'94 +\f0\fs24 \cf0 \cb1 \kerning1\expnd0\expndtw0 internally, though I had thought this was the default behavior in lme4 now. If indeed it is, then we should certainly silence that. \ +\ +Once the dev2 version has been more thoroughly debugged, it can be merged into the main geneLME.R script.\ +\ +\ +\ +} \ No newline at end of file diff --git a/R_functions/CLAUDE_NOTES_geneLME.md b/R_functions/CLAUDE_NOTES_geneLME.md new file mode 100644 index 0000000..fa73f11 --- /dev/null +++ b/R_functions/CLAUDE_NOTES_geneLME.md @@ -0,0 +1,569 @@ +# geneLME Project — Claude Session Notes + +> **How to use this file:** At the start of each session, ask Claude to read this file. +> Claude should update the "Current Status" and "Session Log" sections at the end of any +> session where meaningful progress is made. + +--- + +## Project Overview + +**Goal:** A scalable, parallelized per-gene linear mixed effects (LME) modelling function for +RNA-seq expression data stored in a `limma` EList (`dat`) object. Key features: +- Fits one `lmer` model per gene in parallel via `future_lapply` +- Extracts ANOVA table, coefficient estimates, and emmeans-based contrasts +- Supports weighted models (voom weights from `dat$weights`) +- Supports flexible contrast specification including interaction terms + +**Active worktree:** `.claude/worktrees/reverent-satoshi/` +**Key files:** +- `R_functions/geneLME.R` — current stable (merged from dev2 2026-02-20; fully featured) +- `R_functions/geneLME_dev.R` — former dev version (superseded; kept for reference) +- `R_functions/geneLME_dev2.R` — former dev2 (now merged into geneLME.R; kept for reference) +- `R_functions/geneLME_test.R` — mock data and full test suite (SOURCE_FILE variable selects version) +- `R_functions/geneLME_benchmark.Rmd` — v1 benchmark (geneLME vs kimma) +- `R_functions/geneLME_benchmark2.Rmd` — v2 benchmark (sign consistency + speed vs kimma; dev2 correctness section removed post-merge) +- `R_functions/CLAUDE_NOTES_geneLME.md` — this file + +--- + +## Architecture Summary + +The code is split into five functions: + +| Function | Role | +|---|---| +| `geneLME_contrast_spec()` | Helper: generates a contrast level reference template. Two modes (see below). | +| `geneLME_build_contrast_args()` | Branch A helper: pre-computes `contrasts_list` + `spec_lookup` once before parallel dispatch. | +| `geneLME_fit()` | Per-gene worker: fits lmer, extracts ANOVA + contrasts. Runs inside `future_lapply`. | +| `geneLME_dispatch()` | Parallel launcher: calls `future_lapply` with explicit `future.globals` to avoid environment scanning overhead. | +| `geneLME_compiler()` | Aggregator: binds per-gene results into tidy data frames. | +| `geneLME()` | User-facing wrapper: validates inputs, pre-slices data, sets parallel plan, calls dispatch + compiler. | + +### Key Design Decisions Made + +1. **Per-gene pre-slicing** — `dat$E` and `dat$weights` are sliced into a `gene_data_list` of small bundles + *before* entering the parallel stage. This avoids serializing the full EList to every worker. +2. **Formula passed as string** — `formula_str` (a plain character string) is passed to workers, which + reconstruct the formula locally. Avoids locked-environment errors from passing formula objects with + environment references across futures. +3. **Explicit `future.globals`** — `geneLME_dispatch()` explicitly lists all globals rather than relying + on future's automatic environment scan, preventing accidental capture of large objects. +4. **`on.exit(plan(sequential))`** — restores the user's parallel plan after the run completes. +5. **Singular fit flagged, not errored** — `isSingular()` is checked and sets `model_status = "singular_fit"`. + Results are returned for all genes; users filter downstream on `model_status`. Fixed effect estimates + and contrasts are numerically valid even for singular fits — only the random effect structure is unreliable. + +--- + +## Contrast System (`geneLME_dev.R`) + +Two contrast branches in `geneLME_fit()`: + +### Branch A: Interaction contrasts (`contrast_vars` contains `":"`) +- `contrast_vars` is a single string e.g. `"treatment:visit"`. +- **`contrast_spec`** (required): a data.frame with columns `contrast_ref` and `contrast_lvl`. + Each row defines one pairwise contrast across interaction cells. Contrast vectors are built from the + emmeans level ordering to guarantee index alignment. +- Use `geneLME_contrast_spec(dat$targets, contrast_vars = "treatment:visit")` to generate the full + pairwise template, then filter to the rows of interest before passing as `contrast_spec`. +- The formula must contain the corresponding interaction term (`treatment:visit` or `treatment*visit`); + `geneLME()` validation enforces this — emmeans would otherwise silently use additive margins. + +### Branch B: Non-interaction contrasts +- `contrast_vars` is a character vector of 1–2 variable names. +- `contrasts_primary`: named list of contrast vectors. Vector length must match the number of levels + of `contrast_vars[1]`, in alphabetical order (or the order emmeans sees them). +- `contrasts_secondary`: optional second-order contrast (contrast-of-contrasts). +- `contrast_var_2_levels`: optional filter on levels of the second `by` variable. + +### Helper: `geneLME_contrast_spec(targets, contrast_vars)` +Two modes determined by whether `contrast_vars` contains `":"`: + +- **Interaction mode** (`"treatment:visit"`): returns a data.frame with columns `contrast_ref` and + `contrast_lvl` **only** (no `contrast_index`). One row per pairwise combination of interaction-level + strings. User filters and passes as `contrast_spec` to `geneLME()`. + - `contrast_index` is NOT part of the `geneLME_contrast_spec()` output. Instead, `geneLME()` attaches + an indexed copy of the user's filtered spec to its return value as `$contrast_spec`, with + `contrast_index = 1:nrow(contrast_spec)` (simple row positions within the filtered spec). + This is the index users should reference when building `contrasts_secondary` vectors. + - Ordering guarantee: both `var_a` and `var_b` components of `contrast_ref` are always at a + level-index ≤ the corresponding component in `contrast_lvl` (factor level order for factors, + alphabetical for plain characters). This is enforced by explicit factor coercion before + `interaction(...) %>% levels()`. + - Swap tolerance: users may manually swap ref/lvl in any row after filtering. `geneLME_fit()` + handles either direction — the contrast estimate sign flips, but no error occurs. +- **Single-variable mode** (`"treatment"`): returns a data.frame with a single column `level`, listing + unique sorted values of that variable. This is a reference only — used when building `contrasts_primary` + vectors for Branch B. Not passed to `geneLME()` directly. + +--- + +## Function Signatures (current `geneLME_dev.R`) + +```r +geneLME_contrast_spec(targets, contrast_vars) +# Returns: data.frame(contrast_ref, contrast_lvl) [interaction mode] +# or named list of data.frame(level) [single/multi-variable mode] + +geneLME_fit(gene_name, expression_vec, weight_vec, targets, + formula_str, run_contrast, contrast_vars, + contrast_var_2_levels, contrast_spec, + contrasts_primary, contrasts_secondary) + +geneLME_dispatch(gene_data_list, targets_df, formula_str, + run_contrast, contrast_vars, contrast_var_2_levels, + contrast_spec, contrasts_primary, contrasts_secondary) + +geneLME_compiler(fit, fdr_method = "BH", contrast_spec = NULL) +# Returns: list(lme_anova, lme_contrast, lme_fit, lme_err, contrast_spec) + +geneLME(dat, formula_str, + model_weights = NULL, + run_contrast = NULL, + contrast_vars = NULL, + contrast_var_2_levels = NULL, + contrast_spec = NULL, # required when contrast_vars contains ":" + contrasts_primary = NULL, + contrasts_secondary = NULL, + fdr_method = "BH", # any method accepted by p.adjust() + n_cores = NULL) +# Returns: list(lme_anova, lme_contrast, lme_fit, lme_err, contrast_spec) +# $lme_contrast columns include contrast_ref and contrast_lvl: +# - Branch A first-order: populated from contrast_spec (unambiguous sign reference) +# - Branch A second-order, Branch B (all): NA (no single ref/lvl pair) +# $contrast_spec: indexed copy of the filtered contrast_spec (Branch A) or +# data.frame(contrast_index, contrast_name) from contrasts_primary (Branch B); +# NULL when no contrasts run. +# On soft-fail (wrong-length contrasts_secondary): all elements NULL except $contrast_spec. +``` + +Note: `contrast_by` parameter was removed in session 2026-02-19 (Branch A2 eliminated). + +--- + +## Validation Checks in `geneLME()` + +1. All formula variables present in `dat$targets` +2. `dat$weights` present and dimension-matched if `model_weights = TRUE` +3. `contrast_vars` not NULL when `run_contrast = TRUE` +4. **Formula-vs-`contrast_vars` consistency:** + - Interaction contrasts: interaction term must exist in `formula_str` (uses `attr(terms(), "term.labels")`) + - Non-interaction contrasts: each `contrast_vars` element warned if absent as main effect +5. `contrast_spec` required (not NULL) when `contrast_vars` contains `":"` +6. `contrast_spec` column names validated (`contrast_ref`, `contrast_lvl`) +7. All `contrast_vars` components present in `dat$targets` +8. `contrast_var_2_levels` values valid against actual data levels +9. `fdr_method` validated against `p.adjust.methods` +10. `contrasts_secondary` vector lengths validated (**soft-fail**, not hard stop): + - Branch A: each vector must have length == `nrow(contrast_spec)` (the filtered spec, NOT the full template) + - Branch B: each vector must have length == `length(contrasts_primary)` + - On mismatch: emits a `message()` naming offending vectors (observed vs expected length), + then returns `invisible(list(lme_anova=NULL, lme_contrast=NULL, lme_fit=NULL, lme_err=NULL, contrast_spec=indexed_contrast_spec))` + - `$contrast_spec` in the returned list contains the indexed spec so the user can inspect + row ordering and fix their vectors without needing to re-run everything +11. Warn on likely ID columns used as predictors + +--- + +## Current Status + +**Date of last update:** 2026-02-20 (Session 15) + +**Stage:** `geneLME_dev2.R` merged into `geneLME.R` (now the single active stable file). +All tests pass. Warning suppression added. Documentation updated. + +**Benchmark results (2,000 genes, 5 reps, 8 cores) — geneLME_benchmark2.html (Session 15):** + +| Metric | Result | +|---|---| +| Sign consistency (geneLME vs kimma) | 3300/3300 (100%) same direction — no flipped pairs | +| Estimate r (direction-corrected) | 1.00000000 | +| Estimate MAD (direction-corrected) | 1.16e-15 (effectively 0 — floating point only) | +| geneLME 3 contrasts median | 17.4 s (1.85× faster than kimma) | +| geneLME 6 contrasts median | 18.1 s (1.78× faster than kimma) | +| geneLME 66 contrasts median | 32.6 s (0.99× — effectively equal to kimma) | +| kimma (66 contrasts) median | 32.1 s | + +**Key findings:** +- geneLME and kimma always agree on contrast direction (100%) — no sign inversion issue +- kimma's `statistic` column is the **negated** t-statistic (opposite sign to estimate) +- geneLME is ~1.8× faster than kimma when running fewer contrasts (3–6); equal speed at 66 +- Warning suppression confirmed working — no package version or rescaling warnings in output + +**Possible future tasks:** +1. Build runtime scaling estimates: cores × genes × contrasts → expected runtime on this machine + +--- + +## Known Issues / Open Questions + +- [x] `geneLME.R` (stable) — interaction contrast support merged 2026-02-20 +- [x] `geneLME_dev2.R` — tested (all tests pass) and benchmarked (geneLME_benchmark2.html generated) +- **kimma `statistic` column:** kimma's `$lme.contrast$statistic` is `–(estimate/SE)` — the **negated** + t-statistic (opposite sign to estimate). To compare against geneLME `t.ratio`, use `-statistic`. + This was discovered during benchmark2 sign-consistency investigation. +- [ ] `contrast_spec` level ordering with non-alphabetical factors: explicit factor coercion is now + in place in `geneLME_contrast_spec()` (uses existing factor levels if present, otherwise + alphabetical). However, `emmeans` must also see the same level ordering — emmeans inherits + factor levels from the model data, so as long as `dat$targets` columns have matching factor + levels, alignment is guaranteed. This has not been explicitly tested with a non-alphabetical + user-defined factor ordering; worth a targeted test if this use case arises. +- [ ] With only 10 patients, `lmer` fits are singular (patient random effect collinear with + patient-level covariates sex/age). All 10 genes show `isSingular = TRUE` in tests. This is + expected with small mock N and does NOT affect correctness testing. Real data with larger N + will not have this issue. +- [ ] Branch B `contrasts_primary` vector construction: users must supply vectors whose length matches + the number of levels of `contrast_vars[1]` in the order emmeans sees them (alphabetical by + default). This is implicit and could be a source of user error. Consider adding a runtime check + that compares vector length to actual emmeans level count. +- [ ] Non-alphabetical factor level ordering: explicit factor coercion is now in place in + `geneLME_contrast_spec()` (preserves existing factor levels, otherwise alphabetical). + Alignment with `emmeans` internal ordering has not been explicitly tested with a + user-defined non-alphabetical factor; worth a targeted test if this arises. + +--- + +## Bugs Fixed + +### Session 2026-02-19 — Bug 1: Branch A1 returned 0-row contrast result +**Root cause:** `vector("list", n)` pre-allocates n NULL slots indexed by integer. Subsequent +named assignment (`contrasts_list[["name"]] <- cv`) *appends* rather than filling existing slots, +leaving the first n slots as unnamed NULLs. `contrast()` silently skipped NULLs, leaving a +malformed result that `as.data.frame()` could not coerce. +**Fix:** Build `contrasts_list` as `list()` and assign each element by name from the start. + +### Session 2026-02-19 — Bug 2: Branch A2 `"Nonconforming number of contrast coefficients"` +**Root cause:** The `"second"` direction (`var_b | var_a`) was incorrectly passing `contrast_var_2_levels` +keyed to `var_a`, injecting visit-level strings as treatment levels. Additionally, `contrasts_primary` +has length matching var_a's level count (3) but var_b has 4 levels — structurally incompatible. +**Resolution:** Branch A2 (`contrast_by`) was removed entirely per user decision (see Session Log). + +--- + +## Session Log + +### Session 15 — 2026-02-20 +- **Completed merge of `geneLME_dev2.R` → `geneLME.R`** (picked up where Session 14 left off). + Remaining edits applied to `geneLME()` function body: + - Added `contrasts_list <- NULL` and `spec_lookup <- NULL` initialisations. + - Added `geneLME_build_contrast_args()` pre-compute block (inside `run_contrast`, outside + `!is.null(contrast_spec)` — called only when `is_interaction`). + - Updated `plan()` calls: `plan(multisession)` → `future::plan(future::multisession, workers = workers)`; + `on.exit(plan(sequential))` → `on.exit(future::plan(future::sequential), add = TRUE)`. + - Updated `geneLME_dispatch()` call: `contrast_spec` → `contrasts_list` + `spec_lookup`. +- **Warning suppression** (from `Claud_next_steps.rtf`): + - **lmer() rescaling warning** (`"Some predictor variables are on very different scales"`): + Added `check.scaleX = "ignore"` to a shared `lme_ctrl <- lmerControl(...)` object used + for both weighted and unweighted `lmer()` calls in `geneLME_fit()`. + - **Package version warnings** (`"package 'lme4' was built under R version X.Y.Z"`): + Two-pronged approach: + 1. Removed `future.packages = c(...)` from `future_lapply`; added explicit + `suppressPackageStartupMessages(suppressWarnings({ library(lme4); ... }))` block at + the top of `geneLME_fit()` — gives full control over how packages load on workers. + 2. Wrapped `future_lapply(...)` in `withCallingHandlers()` in `geneLME_dispatch()` with + a handler that muffles any warning matching `"was built under R version"`, catching + warnings that the future framework re-raises on the main process after collecting worker results. +- **All `geneLME_test.R` tests pass** against merged `geneLME.R` (SOURCE_FILE updated to `"geneLME.R"`): + Branch A (10/10, 80 ANOVA + 60 contrast), Branch A2 (80 total rows, 60+20 orders), + Branch B (10/10, 70+60), validation 6a–6f (all PASS), soft-fail 6g (both branches PASS). + Worker-sourced package-version warnings no longer appear in test output. +- **Updated `geneLME_tutorial.Rmd`:** + - `source("geneLME_dev.R")` → `source("geneLME.R")` + - Key capabilities bullet: updated from "per-gene error capture" to `model_status` flagging description + - Error Handling section: replaced old stop-based singular fit description with new `model_status` + flagging explanation; updated code to check `!= "success"` for all non-success genes (not just failed) + and added filter example (`result$lme_contrast %>% filter(model_status == "success")`). +- **Updated `geneLME_function_overview.md`:** + - Companion file reference: `geneLME_dev.R` → `geneLME.R`; date updated to 2026-02-20 + - Function Map: added `geneLME_build_contrast_args()` row + - `lme_err` output table entry: updated to describe `"singular_fit"` as a valid value + - Error Handling section: rewrote to describe `model_status` flagging; removed old + `"Boundary (singular) fit"` stop-based description +- **Updated `CLAUDE_NOTES_geneLME.md`:** + - Key files: `geneLME_dev2.R` now "former dev2 (merged); kept for reference" + - Architecture: `geneLME_build_contrast_args()` added to function table + - Design decision 5: updated from "error" to "flagged, not errored" + - Current Status: stage updated to reflect completed merge +- **Re-ran `geneLME_benchmark2.Rmd`** against merged `geneLME.R` only (dev2 correctness section + removed; report now focuses solely on geneLME vs kimma). Fresh results (Session 15): + - Sign consistency: 3300/3300 (100%) same direction, r=1.00000000, MAD=1.16e-15 + - Speed: geneLME 1.85× faster than kimma at 3 contrasts, 1.78× at 6, effectively equal (0.99×) at 66 + - Rendered `geneLME_function_overview.html` via pandoc for easy review + +### Session 14 — 2026-02-20 +- **Confirmed `geneLME.R` (stable) is correct** — does NOT contain dev2 changes (no + `geneLME_build_contrast_args`, no `singular_fit` flagging); the concern about overwrite was unfounded. +- **Audited `geneLME_dev2.R` for unqualified package calls** — all non-base function calls + are correctly handled: `future::plan()`, `future::multisession`, `future::sequential`, and + `future.apply::future_lapply()` are fully qualified in `geneLME()`/`geneLME_dispatch()`; + worker-side calls (`lmer`, `emmeans`, etc.) rely on `future.packages` which loads all required + packages on each worker. No bare unqualified calls that would fail outside an interactive session. +- **All `geneLME_test.R` tests pass** against `geneLME_dev2.R`: + Branch A (10/10 success, 80 ANOVA + 60 contrast rows), Branch A second-order (60+20 rows), + Branch B (10/10 success, 70+60 rows), all 6 validation error tests (6a–6f), soft-fail (6g). +- **Debugged and fixed `geneLME_benchmark2.Rmd`:** + 1. `sign-inspect` chunk: kimma has no `contrast` string column → rewrote to compare + `contrast_ref`/`contrast_lvl` pairs directly using canonical pair keys. + 2. Kimma t-statistic: discovered kimma's `statistic` column is the **negated** t-statistic + (`–estimate/SE`), not the absolute value. Fixed reconstruction to `-statistic`. + 3. Inline `` `r ≈ 1.0` `` parsed as R code → converted to plain text. + 4. Multi-line `if/else` → wrapped in braces to prevent premature statement termination. + 5. `stable_success`: `lme_err` has no names → derived success genes from contrast table instead. +- **`geneLME_benchmark2.html` generated successfully** (44/44 chunks, no errors). +- **Benchmark results:** See Current Status table above. Summary: + - 100% direction agreement with kimma (no flipped pairs) + - dev2 estimates identical to stable (r=1.0, MAD=0) + - dev2 ~1.1–1.2× faster than stable; equal speed to kimma at 66 contrasts + +### Session 13 — 2026-02-20 +- **Created `geneLME_benchmark2.Rmd`** — v2 benchmarking report covering: + 1. **Sign consistency (Section 1):** Direction-aware join between geneLME and kimma. + Forward join (same ref/lvl) + flipped join (kimma ref/lvl swapped) combined. + Scatter plots of estimates and t-statistics before and after direction correction. + Summary table reports % pairs with direction agreement and corrected accuracy metrics. + 2. **dev2 correctness (Section 2):** geneLME.R vs geneLME_dev2.R on same 50-gene, + 6-contrast problem. Joined on (gene, contrast); reports r, MAD, max|Δ| for estimates. + Also shows model_status distribution (singular_fit vs success) in dev2 output. + 3. **Speed comparison (Section 3):** Microbenchmark (5 reps each) sourcing stable and + dev2 sequentially to avoid namespace collision. Bar chart + scaling line plot + comparing all three methods (stable, dev2, kimma) at 3/6/66 contrasts. + Summary table includes dev2 speedup ratios vs stable and vs kimma. +- Requires `gridExtra` package (added to libs chunk) + +### Session 12 — 2026-02-20 +- **Created `geneLME_dev2.R`** with two changes vs `geneLME.R`: + 1. **Singular fit → flag:** `isSingular()` check changed from `stop()` to setting + `model_status = "singular_fit"`. Results returned for all genes; users filter + downstream on `model_status`. `model_status` column now appears in both + `lme_anova` and `lme_contrast` output. + 2. **Pre-computed Branch A contrast structures:** New `geneLME_build_contrast_args()` + helper called once in `geneLME()` before parallel dispatch. Returns `contrasts_list` + (named vector list) and `spec_lookup` (ref/lvl join table). Workers receive these + ready-made; the per-gene `for` loop building contrast vectors is eliminated. + `geneLME_dispatch()` and `geneLME_fit()` signatures updated accordingly + (`contrast_spec` parameter replaced by `contrasts_list` + `spec_lookup`). + - Level ordering in `geneLME_build_contrast_args()` uses the same factor-coercion + logic as `geneLME_contrast_spec()` (preserves existing factor levels; alphabetical + otherwise), ensuring alignment with emmeans' internal ordering without requiring + a fitted model object. +- **Updated `geneLME_test.R`:** + - Added `SOURCE_FILE` variable at top for easy switching between implementations + - Test 4: added singular-fit verification block (checks non-NA estimates for flagged genes) + - Tests 4 and 5: updated output display to show `model_status` column + - Note: with 10-patient mock data all genes are expected to be `singular_fit`, not errors + +### Session 11 — 2026-02-20 +- Merged `geneLME_dev.R` → `geneLME.R` as the new stable version +- Added merge changelog to `geneLME.R` header +- Removed stale "PICKUP HERE" comment and dev-only header label +- Updated CLAUDE_NOTES key files and current status sections + +### Session 1 — 2026-02-19 (first session) +- Identified correct active worktree (`reverent-satoshi`) +- Created `CLAUDE_NOTES_geneLME.md` (this file) +- Created `geneLME_test.R` with mock EList: + - 10 patients × 3 treatments × 4 visits = 120 samples, 50 genes + - Patient-level covariates: `sex` (factor), `age` (continuous) + - Sample-level technical covariates: `rNANgUl`, `percent_duplication`, + `median_cv_coverage`, `lib.size`, `norm.factors` +- Ran full test suite; identified and fixed two bugs in `geneLME_dev.R` +- All three contrast branches (A1, A2, B) passed initially after fixes + +### Session 10 — 2026-02-19 (tenth context window) +- **Created `geneLME_benchmark.Rmd`** (→ `geneLME_benchmark.html`): kimma vs geneLME benchmarking report + - **Mock data:** 10 patients × 3 treatments × 4 visits = 120 samples, 2000 genes; genes 1–100 + have +2.5 log2 TrtC:V3 effect; `libID = sample_id` added to targets for kimma compatibility + - **Contrast subsets defined:** + - `spec_3` (3): longitudinal V2→V3 within-treatment + - `spec_6` (6): between-treatment within-visit at V2 and V3 + - `spec_full` (66): all pairwise — equivalent to kimma's default output + - **Section 1 (Accuracy):** both methods on 50-gene subset with all 66 contrasts. + Results joined on `(gene, contrast_ref, contrast_lvl)` — both methods now have these columns. + Reports Pearson r and MAD for estimates and t-statistics; scatter plots included. + Note: geneLME excludes singular-fit genes (treated as error); kimma returns estimates silently. + - **Section 2 (Selective contrast efficiency):** `microbenchmark(times=5)` on 2000-gene full + dataset comparing geneLME with 3/6/66 contrasts and kimma with 66 contrasts. Bar chart + + scaling plot showing linear runtime scaling with n_contrasts for geneLME. + - **Section 3 (Head-to-head):** reuses 66-contrast runs from Section 2 to compare + geneLME vs kimma directly at equal contrast count. Reports speedup ratio. + - **kimma interface notes:** + - `libraryID = "libID"` required — added `libID` column to targets + - `contrast_var = "treatment:visit"` runs all 66 pairwise automatically + - Output: `$lme.contrast` with columns `gene, contrast_ref, contrast_lvl, estimate, statistic, pval, FDR, std.error, df` + - `statistic` = t-ratio (same as geneLME `t.ratio`); `pval` = p-value; `std.error` = SE + - **Key finding expected:** estimates identical (r ≈ 1.0, MAD ≈ 0) for shared non-singular genes; + geneLME faster when running fewer contrasts; head-to-head speed depends on parallelism overhead + +### Session 9 — 2026-02-19 (ninth context window) +- **Added `contrast_ref` and `contrast_lvl` to `lme_contrast` output** in `geneLME_fit()`: + - Branch A first-order: built a `spec_lookup` data.frame keyed on the contrast name string + (`paste(contrast_lvl, contrast_ref, sep = " - ")`), then `left_join`-ed onto the emmeans + first-order result by `"contrast"`. This ensures `contrast_ref` = the −1 cell and + `contrast_lvl` = the +1 cell, eliminating sign ambiguity. + - Branch A second-order: `contrast_ref = NA_character_, contrast_lvl = NA_character_` + (no single ref/lvl pair applies to a contrast-of-contrasts). + - Branch B (all rows): `contrast_ref = NA_character_, contrast_lvl = NA_character_` + (Branch B uses named coefficient vectors, not a ref/lvl spec). + - Error stub data.frame in `tryCatch` handler: added both NA columns for schema consistency. +- **Tutorial Rmd updated:** added `contrast_ref`/`contrast_lvl` to `select()` calls in all + contrast output display chunks; added explanatory prose in the `lme_contrast` description. +- **CLAUDE_NOTES updated:** added columns to return value documentation. +- Next up: kimma benchmarking (2000-gene mock, speed + estimate comparison). + +### Session 8 — 2026-02-19 (eighth context window) +- **Redesigned `contrast_index` system** — root cause was that `contrast_index` in the + `geneLME_contrast_spec()` output was being misused: users built `contrasts_secondary` vectors + using the full template's row count as the vector length (234 elements) instead of the filtered + spec's row count. `which(my_spec$contrast_index == ...)` correctly identified positions within + `my_spec`, but the outer `rep(0, nrow(full_template))` produced vectors of the wrong length. +- **`geneLME_contrast_spec()` output simplified:** removed `contrast_index` column entirely. + Output is now a two-column data.frame (`contrast_ref`, `contrast_lvl`) only. +- **`geneLME()` now appends `$contrast_spec` to its return value:** + - Branch A: `contrast_spec %>% mutate(contrast_index = seq_len(n())) %>% select(contrast_index, everything())` + — `contrast_index` is `1:nrow(contrast_spec)` (row positions within the *filtered* spec the user passed in) + - Branch B: `data.frame(contrast_index = seq_along(contrasts_primary), contrast_name = names(contrasts_primary))` + - NULL when no contrasts are run +- **Soft-fail on wrong-length `contrasts_secondary`** (changed from hard `stop()`): + - Emits an informative `message()` naming offending vectors, observed length, expected length + - Returns `invisible(list(lme_anova=NULL, lme_contrast=NULL, lme_fit=NULL, lme_err=NULL, contrast_spec=indexed_contrast_spec))` + - User can inspect `result$contrast_spec` to fix their vectors without running any models +- **`geneLME_compiler()` updated:** now accepts `contrast_spec = NULL` argument; includes it in + return list; added column-existence guard for no-contrast runs (prevented crash on empty contrast tibble) +- **Tutorial Rmd updated:** + - `contrast-spec-interaction` chunk: added comment explaining `contrast_index` is NOT in + `spec_template`, is added by `geneLME()` to `$contrast_spec` + - Added `$contrast_spec` display to the Branch A output structure chunk + - Branch A second-order chunk: updated comments to reference `result_A$contrast_spec` for + row-order verification + - Programmatic secondary contrast section: rewritten to use `result_A$contrast_spec` (the + indexed spec from `geneLME()`) instead of `my_spec$contrast_index` + - Added soft-fail validation test (Test 7) to Input Validation section + - Quick-reference table: updated `contrast_spec` description; updated `contrasts_secondary` + description to mention soft-fail + - Key capabilities: added soft-fail bullet +- **`geneLME_test.R` updated:** + - Test 3: removed `contrast_index` range check; updated comment to explain columns + - Test 4b: updated comments to reference `test_A$contrast_spec` for indexed ordering; + corrected `contrasts_secondary` vectors to match actual alphabetical row ordering + (V2 rows 1–3 before V3 rows 4–6); added `print(test_A$contrast_spec)` before test_A2 call + - Added test 6g: soft-fail validation for both Branch A and Branch B wrong-length vectors + +### Session 7 — 2026-02-19 (seventh context window) +- **Root-caused the real-data NA/warning failure:** + - User shared real-data `contrasts_secondary` vectors: length 234 (= nrow of full + unfiltered spec_template). The filtered `contrast_spec` passed to `geneLME()` had far + fewer rows. emmeans throws "Nonconforming number of contrast coefficients" which is + silently caught by `tryCatch` inside `geneLME_fit()`, producing NA-filled rows. + - The programmatic builder from Session 4 used `rep(0, nrow(defined_contrasts_spec))` + where `defined_contrasts_spec` referred to the full template — should be + `nrow(filtered_spec)`. The `which(my_spec$contrast_index == ...)` index lookup was + correct; only the vector length was wrong. +- **Fix: added `contrasts_secondary` length validation to `geneLME()` pre-flight:** + - Branch A (interaction): checked that every vector in `contrasts_secondary` has length + == `nrow(contrast_spec)` (the filtered spec passed in, NOT the full template). + Error message names the offending vectors, shows observed vs expected length, and + explains the common cause (full template vs filtered spec confusion). + - Branch B (non-interaction): checked that every vector has length == + `length(contrasts_primary)`. + - Both checks fire before any parallel work launches — so the user gets a clear, + immediate error instead of silent NAs in results. +- User's real-data fix: rebuild `contrasts_secondary` vectors using + `rep(0, nrow(defined_contrasts_geneLME))` (filtered spec row count) and + `which(defined_contrasts_geneLME$contrast_index == ...)` for index lookup. + +### Session 6 — 2026-02-19 (sixth context window) +- **Diagnosed and fixed second-order contrast warning + spurious NA issue:** + - Symptom: rescaling warnings ("Some predictor variables are on very different scales") + and occasional NA-filled outputs when running with `contrasts_secondary`. + - Root cause (warnings): `emmeans::contrast()` called on a first-order contrast object + internally re-invokes lme4 machinery on the first-order estimate/SE values, which can be + on different scales from the original predictors → benign lme4 scale warning, no effect + on results. Confirmed output is correct by single-gene isolated test. + - Root cause (NAs): not a code bug — was a stochastic singular-fit outcome in the small + mock data (10 patients), not reproducible when re-run. `isSingular()` check is correct. + - Fix 1: wrapped second-order `contrast()` calls in `suppressWarnings()` in **both** Branch A + and Branch B of `geneLME_fit()`, with explanatory comment. Warning is cosmetic only. + - Fix 2 (latent bug): both Branch A and Branch B were gating second-order contrasts with + `length(contrasts_list) > 1` / `length(contrasts_primary) > 1`. This incorrectly + prevented second-order contrasts when only one first-order contrast existed. Changed + both to simply `!is.null(contrasts_secondary)` — the user's provision of + `contrasts_secondary` is the correct and sufficient gate. +- Verified fix: all 10 genes succeed, 60 first-order + 20 second-order rows, all + `p.value_adj` non-NA, zero warnings emitted. + +### Session 5 — 2026-02-19 (fifth context window) +- Added FDR-adjusted p-values (`p.value_adj`) to both `lme_anova` and `lme_contrast` outputs: + - **`geneLME_compiler()`**: now accepts `fdr_method = "BH"` argument; applies `p.adjust()` + within grouped sets after binding all per-gene results: + - `lme_anova`: grouped by `term` — each model term adjusted independently across genes + - `lme_contrast`: grouped by `contrast × contrast_order` — each contrast (including + first- and second-order separately) adjusted independently across genes + - **`geneLME()`**: added `fdr_method = "BH"` argument (any `p.adjust.methods` value valid); + validated against `p.adjust.methods` in pre-flight checks; passed through to compiler + - NA p-values (failed genes) propagate to `p.value_adj` as NA — they are excluded from the + effective adjustment set automatically by `p.adjust()` +- Updated `geneLME_tutorial.Rmd`: + - Added `p.value_adj` to `select()` calls in ANOVA and contrast output display chunks + - Updated descriptions of `lme_anova` and `lme_contrast` to explain grouping logic + - Added new "FDR Adjustment" section (before Input Validation) with grouping table, + method list, and BH vs Bonferroni comparison example + - Added `fdr_method` row to quick-reference argument table +- CLAUDE_NOTES: updated function signatures (added `geneLME_compiler` signature, `fdr_method` + arg to `geneLME`), validation check list (item 9), and current status + +### Session 4 — 2026-02-19 (fourth context window) +- User provided a real-data pattern for programmatic secondary contrast construction using + `contrast_index`. Reviewed approach and identified improvements: + - Replaced `group_by` + `reframe` with `last()`/`first()` with a more robust `summarise` + producing explicit `index_neg`/`index_pos` columns (two-column wide shape) + - Removed unused `nest` + `rowwise` + dead `secondary_contrast` list column + - Replaced opaque `[[i]][1,1][[1]]` nested indexing with `which(my_spec$contrast_index == ...)` + for transparent index-to-row-position mapping + - Replaced `for` loop with `setNames(lapply(...))` pattern consistent with how + `contrasts_primary` is specified elsewhere +- Added new tutorial section "Programmatic second-order contrast construction via `contrast_index`" + to `geneLME_tutorial.Rmd` (between Branch A second-order and Branch B sections) +- Updated quick-reference table to include `contrast_index` in `contrast_spec` column description + +### Session 3 — 2026-02-19 (third context window) +- Three improvements to `geneLME_contrast_spec()` interaction mode: + 1. **Factor-level ordering enforced**: both `var_a` and `var_b` are explicitly coerced to + factors before calling `interaction()`. Existing factor levels are preserved; plain character + columns get alphabetical ordering imposed. This guarantees `contrast_ref` component levels ≤ + `contrast_lvl` component levels (for both variables), consistent with emmeans' internal ordering. + 2. **Swap tolerance documented**: `geneLME_fit()` Branch A already handled swapped ref/lvl + gracefully (sign flip, no error). Added explicit comment in the contrast vector construction + loop documenting this behaviour. + 3. **`contrast_index` column added**: `geneLME_contrast_spec()` interaction mode now returns + a `contrast_index` column (integer, 1-based) as the first column. This index is stable — it + reflects the row's position in the *full unfiltered template* and survives user `filter()` calls. + Intended for downstream use (e.g. joining back to results, building `contrasts_secondary` + with explicit index reference). Tutorial example to be added in a future session. +- Updated `geneLME_test.R`: + - Added `contrast_index` range check in test 3 + - Updated test 4b comments to reflect corrected row ordering (alphabetical interaction levels: + V2 rows before V3 rows) and corrected `contrasts_secondary` vectors accordingly +- Both `geneLME_dev.R` and `geneLME.R` preserved separately (user preference) + +### Session 2 — 2026-02-19 (same day, second context window) +- Reviewed all three branches with user; demonstrated outputs +- **User decision:** eliminate Branch A2 (`contrast_by`) entirely + - Rationale: specifying contrasts in both directions is error-prone; pairwise contrasts + for the second variable can be computationally excessive with many levels +- **Design changes implemented:** + 1. Extended `geneLME_contrast_spec()` to two modes: + - Interaction mode: returns `contrast_ref`/`contrast_lvl` pairs (unchanged) + - Single-variable mode: returns `level` reference frame (new) + 2. Added formula-vs-`contrast_vars` consistency check using `attr(terms(), "term.labels")`: + - Interaction contrast on additive model → hard error + - Non-interaction contrast var absent as main effect → warning + 3. Made `contrast_spec` required (not optional) when `contrast_vars` contains `":"` + 4. Removed `contrast_by` parameter from all four functions + 5. Removed Branch A2 from `geneLME_fit()`; Branch A is now solely the `contrast_spec` approach + 6. Updated `geneLME_test.R`: removed A2 test; added single-variable `geneLME_contrast_spec()` + tests; added 4 new validation error tests (6c–6f) +- Final test run: all 8 tests pass (2 positive + 6 validation errors) + - Branch A: 10/10 genes `success`, 80 ANOVA rows, 60 contrast rows + - Branch B: 10/10 genes `success`, 70 ANOVA rows, 60 contrast rows + - Validation 6a–6f: all produce expected informative errors diff --git a/R_functions/geneLME.R b/R_functions/geneLME.R index 268da68..0f9d98e 100644 --- a/R_functions/geneLME.R +++ b/R_functions/geneLME.R @@ -2,292 +2,961 @@ ######################################################## # Scalable custom gene LMEs with contrast specification ######################################################## +# Merged from geneLME_dev.R (2026-02-20): +# - Added geneLME_contrast_spec() helper +# - Added Branch A: interaction contrast support via contrast_spec +# - Added FDR adjustment (fdr_method argument) +# - Added $contrast_spec in return value +# - Soft-fail on wrong-length contrasts_secondary +# - Added contrast_ref / contrast_lvl columns to lme_contrast output +# +# Merged from geneLME_dev2.R (2026-02-20): +# - Singular fit → model_status flag ("singular_fit") instead of error. +# Results returned for all genes; filter downstream on model_status. +# model_status column now present in both lme_anova and lme_contrast. +# - New geneLME_build_contrast_args() helper: pre-computes Branch A +# contrasts_list and spec_lookup once before parallel dispatch, +# eliminating nrow(contrast_spec) × n_genes per-gene rebuilds. +# - geneLME_fit() signature: contrast_spec replaced by contrasts_list +# and spec_lookup (Branch A). geneLME_dispatch() updated accordingly. +# - Benchmarked (geneLME_benchmark2.html): dev2 estimates identical to +# previous stable (r=1.0, MAD=0); ~1.1-1.2x faster; 100% direction +# agreement with kimma; equal speed to kimma at 66 contrasts. +######################################################## + + +######################################################## +# geneLME_contrast_spec +# Helper: returns a reference template of available contrast levels, +# formatted for use with the contrast_spec argument of geneLME(). +# +# Two modes depending on whether contrast_vars contains ":": +# +# Interaction mode (e.g. contrast_vars = "treatment:visit"): +# contrast_vars must be a single "var_a:var_b" string. +# Returns a data.frame with columns contrast_ref and contrast_lvl, +# one row per pairwise combination of interaction-level strings. +# Filter this to the contrasts of interest, then pass as +# contrast_spec to geneLME(). +# +# Single-variable mode (e.g. contrast_vars = c("treatment", "visit")): +# contrast_vars is a character vector of one or more plain variable +# names (no ":"). Returns a named list, one element per variable, +# each a data.frame with a single column 'level' listing the sorted +# unique values. The message printed for each variable explains how +# it maps to geneLME() arguments: +# - The first variable in contrast_vars → contrast_vars[1] in +# geneLME(); its levels define the length and position order of +# contrasts_primary vectors. +# - Additional variables → contrast_vars[2], used as the 'by' +# grouping variable; filter its levels via contrast_var_2_levels. +# This list is a reference only — it is not passed to geneLME(). +######################################################## + +geneLME_contrast_spec <- function(targets, contrast_vars) { + + has_interaction <- any(grepl(":", contrast_vars)) + has_plain <- any(!grepl(":", contrast_vars)) + + # Disallow mixing interaction and plain variable names in one call + if (has_interaction && has_plain) { + stop( + "contrast_vars mixes interaction terms (containing ':') and plain variable names.\n", + "Call geneLME_contrast_spec() separately for interaction and non-interaction variables." + ) + } + + if (has_interaction) { + + # ---- Interaction mode: must be a single "var_a:var_b" string ---- + if (length(contrast_vars) != 1) { + stop("Interaction mode requires a single 'var_a:var_b' string in contrast_vars.") + } + + vars <- strsplit(contrast_vars, ":")[[1]] + var_a <- vars[1] + var_b <- vars[2] + + if (!var_a %in% colnames(targets)) stop(paste0("Variable '", var_a, "' not found in targets.")) + if (!var_b %in% colnames(targets)) stop(paste0("Variable '", var_b, "' not found in targets.")) + + # Coerce each variable to a factor, preserving any existing factor level order. + # If the column is already a factor, levels() preserves the user-defined ordering. + # If it is a plain character vector, we impose alphabetical order explicitly. + # This ensures the interaction level string ordering — and therefore which member + # of each pair lands in contrast_ref vs contrast_lvl — is fully deterministic and + # consistent between geneLME_contrast_spec() and emmeans' internal ordering. + fac_a <- if (is.factor(targets[[var_a]])) targets[[var_a]] else factor(targets[[var_a]], levels = sort(unique(targets[[var_a]]))) + fac_b <- if (is.factor(targets[[var_b]])) targets[[var_b]] else factor(targets[[var_b]], levels = sort(unique(targets[[var_b]]))) + + # Build all interaction level strings in the same order emmeans will use. + # interaction() with two ordered factors produces a factor whose levels are + # var_a[1]:var_b[1], var_a[1]:var_b[2], ..., var_a[n]:var_b[m] — matching + # emmeans' default grid ordering (vary the rightmost variable fastest). + ixn_lvls <- levels(interaction(fac_a, fac_b, sep = " ")) + + # combn() traverses ixn_lvls in order, so for every pair the first element + # (contrast_ref) is always at a lower level-index than the second (contrast_lvl). + # This guarantees: level(var_a in ref) <= level(var_a in lvl) AND + # level(var_b in ref) <= level(var_b in lvl). + # Users may freely swap ref/lvl in any row after filtering — geneLME() and + # geneLME_fit() handle either direction correctly (the sign of the contrast + # estimate will flip, but no error will occur). + # Row-position indices for contrasts_secondary construction are added by geneLME() + # to its $contrast_spec output element after the user's filtered spec is received. + pairs <- combn(ixn_lvls, 2, simplify = FALSE) + spec <- data.frame( + contrast_ref = sapply(pairs, `[[`, 1), + contrast_lvl = sapply(pairs, `[[`, 2), + stringsAsFactors = FALSE + ) + + message( + nrow(spec), " pairwise combinations generated for '", contrast_vars, "'.\n", + "Filter this data.frame to your contrasts of interest, then pass as contrast_spec to geneLME().\n", + "geneLME() will attach an indexed copy of contrast_spec to its output ($contrast_spec)\n", + "showing the row-position index used for contrasts_secondary vector construction." + ) + + return(spec) + + } else { + + # ---- Single/multi-variable mode ---- + # Validate all variables exist + missing_vars <- setdiff(contrast_vars, colnames(targets)) + if (length(missing_vars) > 0) { + stop(paste0("Variable(s) not found in targets: ", paste(missing_vars, collapse = ", "))) + } + + result <- lapply(seq_along(contrast_vars), function(i) { + v <- contrast_vars[i] + lvls <- sort(unique(as.character(targets[[v]]))) + spec <- data.frame(level = lvls, stringsAsFactors = FALSE) + + if (i == 1) { + message( + "'", v, "' — primary contrast variable (contrast_vars[1] in geneLME).\n", + length(lvls), " levels (alphabetical order = position order for contrasts_primary vectors):\n", + paste(seq_along(lvls), lvls, sep = ". ", collapse = "\n"), "\n", + "→ contrast vectors passed to contrasts_primary must have length ", length(lvls), ",\n", + " with each element weighted by position (e.g. '", lvls[length(lvls)], " vs ", lvls[1], + "' = c(", paste(c(-1, rep(0, length(lvls) - 2), 1), collapse = ", "), "))." + ) + } else { + message( + "'", v, "' — secondary 'by' grouping variable (contrast_vars[", i, "] in geneLME).\n", + length(lvls), " levels available:\n", + paste(seq_along(lvls), lvls, sep = ". ", collapse = "\n"), "\n", + "→ pass a subset of these levels to contrast_var_2_levels in geneLME() to restrict\n", + " which groups the primary contrasts are computed within." + ) + } + + spec + }) + + names(result) <- contrast_vars + return(result) + } +} + + +######################################################## +# geneLME_build_contrast_args +# Pre-computes Branch A contrast structures once, before parallel dispatch. +# +# Called by geneLME() when contrast_vars contains ":". +# Returns a list with two elements that are passed directly to workers: +# +# $contrasts_list — named list of contrast vectors, one per row of +# contrast_spec. Each vector has length == number of interaction cells, +# named by emmeans level string ("TrtA V1", "TrtB V2", ...). +# contrast_ref cell receives -1; contrast_lvl cell receives +1. +# Name of each vector is "contrast_lvl - contrast_ref" — the string +# emmeans will use as the 'contrast' column in its output. +# +# $spec_lookup — data.frame(contrast, contrast_ref, contrast_lvl) for +# joining ref/lvl back onto the emmeans first-order result by contrast +# name. Avoids reconstructing this inside every geneLME_fit() call. +# +# The emmeans level ordering is derived from factor levels in targets (the +# same source emmeans uses internally), guaranteeing index alignment without +# requiring a fitted model object. +######################################################## + +geneLME_build_contrast_args <- function(targets, contrast_vars, contrast_spec) { + + vars <- strsplit(contrast_vars, ":")[[1]] + var_a <- vars[1] + var_b <- vars[2] + + # Reproduce the exact factor-level ordering that emmeans will use. + # emmeans builds its reference grid from the factor levels of the model + # data (targets); using the same coercion logic as geneLME_contrast_spec() + # guarantees the level strings here match those emmeans returns. + fac_a <- if (is.factor(targets[[var_a]])) targets[[var_a]] else factor(targets[[var_a]], levels = sort(unique(targets[[var_a]]))) + fac_b <- if (is.factor(targets[[var_b]])) targets[[var_b]] else factor(targets[[var_b]], levels = sort(unique(targets[[var_b]]))) + + # Build interaction level strings: "var_a_level var_b_level" in emmeans order. + # emmeans varies the rightmost variable fastest, matching interaction() default. + lvls_a <- levels(fac_a) + lvls_b <- levels(fac_b) + emm_lvls <- as.vector(outer(lvls_a, lvls_b, paste)) # row = var_a, col = var_b + + # Template zero vector, one element per interaction cell + default_vec <- rep(0, length(emm_lvls)) + names(default_vec) <- emm_lvls + + # Build contrasts_list once — identical for all genes + contrasts_list <- list() + for (k in seq_len(nrow(contrast_spec))) { + cv <- default_vec + cv[contrast_spec$contrast_ref[k]] <- -1 + cv[contrast_spec$contrast_lvl[k]] <- 1 + contrast_name <- paste(contrast_spec$contrast_lvl[k], + contrast_spec$contrast_ref[k], sep = " - ") + contrasts_list[[contrast_name]] <- cv + } + + # Build spec_lookup once — identical for all genes + spec_lookup <- data.frame( + contrast = paste(contrast_spec$contrast_lvl, + contrast_spec$contrast_ref, sep = " - "), + contrast_ref = contrast_spec$contrast_ref, + contrast_lvl = contrast_spec$contrast_lvl, + stringsAsFactors = FALSE + ) + + list(contrasts_list = contrasts_list, spec_lookup = spec_lookup) +} + ######################################################## -# Starting point / framework +# geneLME_fit +# Core per-gene fitting function. Called inside future_lapply. +# Receives only the minimal pre-extracted data needed for one gene, +# not the full EList object. +# +# Singular fit is no longer an error; model_status = "singular_fit" +# is recorded and results are returned for user-side filtering. +# Branch A receives pre-computed contrasts_list + spec_lookup instead +# of contrast_spec. The per-gene contrast vector loop is eliminated. ######################################################## +geneLME_fit <- + function(gene_name, + expression_vec, # named numeric vector: expression values for this gene + weight_vec, # named numeric vector or NULL: per-sample weights + targets, # data.frame: sample metadata (dat$targets) + formula_str, # character: formula RHS e.g. "~ treatment*visit + (1|ptID)" + run_contrast, + contrast_vars, + contrast_var_2_levels, + contrasts_list, # Branch A: pre-computed named contrast vector list, or NULL + spec_lookup, # Branch A: pre-computed ref/lvl join table, or NULL + contrasts_primary, # Branch B: named list of contrast vectors, or NULL + contrasts_secondary) { # both branches: second-order contrast vectors, or NULL + + # Load worker-side packages quietly. + # suppressPackageStartupMessages: silences "Loading required package: Matrix" etc. + # suppressWarnings: silences "package 'lme4' was built under R version X.Y.Z". + suppressPackageStartupMessages(suppressWarnings({ + library(lme4) + library(emmeans) + library(car) + library(broom.mixed) + library(dplyr) + library(tibble) + })) -geneLME_fit<- - function(i, - dat=dat, - formula_obj=formula_obj, - model_weights=model_weights, - run_contrast=run_contrast, - contrast_vars = contrast_vars, - contrast_var_2_levels=contrast_var_2_levels, - contrasts_primary = contrasts_primary, - contrasts_secondary = contrasts_secondary){ - - gene_i <- rownames(dat$E)[i] - - # Initialize an empty status - model_status <- "success" - - # Wrap the entire process in tryCatch result <- tryCatch({ - + + # --- BUILD MODEL DATA --- + model_data <- targets + model_data$expression <- expression_vec + + # --- RECONSTRUCT FORMULA LOCALLY --- + # Built from the raw string so its enclosing environment is this call frame. + # lmer resolves 'weight_vec' and other names in local scope — no + # environment stripping needed, no locked-environment errors. + formula_obj <- as.formula(paste("expression", formula_str)) + # --- FIT MODEL --- - # We use a localized data frame to avoid passing the whole 'dat' object to emmeans later - model_data <- dat$targets - model_data$expression <- dat$E[i, ] - - if(is.null(model_weights)){ - lme_i <- lmer( - formula_obj, - data = model_data, - control = lmerControl(calc.derivs = FALSE, autoscale = TRUE)) - - } else if(!is.null(model_weights) & model_weights){ - - model_data$current_weights <- dat$weights[i, ] - - lme_i <- lmer( - formula_obj, - weights = current_weights, - data = model_data, - control = lmerControl(calc.derivs = FALSE, autoscale = TRUE) - ) + # check.scaleX = "ignore": silences the "predictor variables on very + # different scales" warning. This is expected with mixed RNA-seq covariates + # (e.g. lib.size vs binary treatment indicator) and does not affect model fit. + lme_ctrl <- lmerControl(calc.derivs = FALSE, check.scaleX = "ignore") + if (is.null(weight_vec)) { + lme_i <- lmer(formula_obj, data = model_data, control = lme_ctrl) + } else { + lme_i <- lmer(formula_obj, weights = weight_vec, data = model_data, control = lme_ctrl) } - - # Check for singularity manually to be explicit - if(isSingular(lme_i)) stop("Boundary (singular) fit") - - # --- EXTRACT DATA (The only parts we keep) --- - - # 1. AIC - aic_res <- data.frame(gene = gene_i, AIC = AIC(lme_i)) - - # 2. ANOVA & Summary - lme_i_anova <- car::Anova(lme_i) %>% broom.mixed::tidy() - lme_i_summary <- summary(lme_i)$coefficients %>% - as.data.frame() %>% + + # --- SINGULAR FIT: flag, do not stop --- + # isSingular() indicates the random effect variance hit its boundary (zero). + # Fixed effect estimates and contrasts are still numerically valid; the user + # should decide whether to retain or filter these genes downstream. + # model_status records the flag; "success" indicates a clean fit. + model_status <- if (isSingular(lme_i)) "singular_fit" else "success" + + # --- EXTRACT: AIC --- + aic_res <- data.frame(gene = gene_name, AIC = AIC(lme_i)) + + # --- EXTRACT: ANOVA & coefficient summary --- + lme_i_anova <- car::Anova(lme_i) %>% broom.mixed::tidy() + lme_i_summary <- summary(lme_i)$coefficients %>% + as.data.frame() %>% rownames_to_column("variable") - - # 3. Process ANOVA Table + + # --- BUILD ANOVA TABLE --- lme_i_anova_tab <- lme_i_anova %>% - rowwise()%>% + rowwise() %>% mutate( - gene = gene_i, + gene = gene_name, + model_status = model_status, # "success" or "singular_fit" predictor_class = case_when( - grepl(":", term) ~ "interaction", - #!term %in% colnames(model_data) ~ "other", - is.numeric(model_data[[term]]) ~ "continuous", - !(is.numeric(model_data[[term]])) & length(unique(model_data[[term]])) == 2 ~ "two-level-categorical", - !(is.numeric(model_data[[term]])) & length(unique(model_data[[term]])) > 2 ~ "multi-level-categorical" + grepl(":", term) ~ "interaction", + is.numeric(model_data[[term]]) ~ "continuous", + !is.numeric(model_data[[term]]) & length(unique(model_data[[term]])) == 2 ~ "two-level-categorical", + !is.numeric(model_data[[term]]) & length(unique(model_data[[term]])) > 2 ~ "multi-level-categorical" ), Estimate_source = case_when( - predictor_class %in% c("continuous", "two-level-categorical") ~ "lme_summary", - predictor_class == "multi-level-categorical" ~ "seeContrasts", - predictor_class == "interaction" & - length(grep(":", lme_i_summary$variable)) == 1 ~ "lme_summary", - predictor_class == "interaction" & - length(grep(":", lme_i_summary$variable)) > 1 ~ "seeContrasts" + predictor_class %in% c("continuous", "two-level-categorical") ~ "lme_summary", + predictor_class == "multi-level-categorical" ~ "seeContrasts", + predictor_class == "interaction" & length(grep(":", lme_i_summary$variable)) == 1 ~ "lme_summary", + predictor_class == "interaction" & length(grep(":", lme_i_summary$variable)) > 1 ~ "seeContrasts" ), Estimate = case_when( - predictor_class == "continuous" ~ lme_i_summary$Estimate[match(term, lme_i_summary$variable)][[1]], + predictor_class == "continuous" ~ lme_i_summary$Estimate[match(term, lme_i_summary$variable)][[1]], predictor_class == "two-level-categorical" ~ lme_i_summary$Estimate[match(term, lme_i_summary$variable)][[1]], - predictor_class == "interaction" & Estimate_source=="lme_summary" ~ lme_i_summary$Estimate[grep(":", lme_i_summary$variable)][[1]], - predictor_class == "interaction" & Estimate_source == "seeContrasts" ~ NA, - .default = NA + # Guard: grep(":", ...) returns integer(0) when the model has no interaction + # coefficient (e.g. a main-effects-only formula). [[1]] on integer(0) errors, + # so fall through to .default = NA_real_ via the length check. + predictor_class == "interaction" & Estimate_source == "lme_summary" & + length(grep(":", lme_i_summary$variable)) >= 1 ~ + lme_i_summary$Estimate[grep(":", lme_i_summary$variable)[1L]], + .default = NA_real_ ), Estimate_SE = case_when( - predictor_class == "continuous" ~ lme_i_summary$`Std. Error`[match(term, lme_i_summary$variable)][[1]], + predictor_class == "continuous" ~ lme_i_summary$`Std. Error`[match(term, lme_i_summary$variable)][[1]], predictor_class == "two-level-categorical" ~ lme_i_summary$`Std. Error`[match(term, lme_i_summary$variable)][[1]], - predictor_class == "interaction" & Estimate_source=="lme_summary" ~ lme_i_summary$`Std. Error`[grep(":", lme_i_summary$variable)][[1]], - predictor_class == "interaction" & Estimate_source == "seeContrasts" ~ NA, - .default = NA + predictor_class == "interaction" & Estimate_source == "lme_summary" & + length(grep(":", lme_i_summary$variable)) >= 1 ~ + lme_i_summary$`Std. Error`[grep(":", lme_i_summary$variable)[1L]], + .default = NA_real_ ) ) - - # 4. Contrasts - - # --- CONTRASTS (Dynamic) --- - - if(!is.null(run_contrast)){ - - spec_formula <- as.formula(paste("~", paste(contrast_vars, collapse = "|"))) - - if(length(contrast_vars)==2 & !is.null(contrast_var_2_levels)){ # set levels of second term if multiple terms provided. - by_list<-set_names(list(contrast_var_2_levels), paste(contrast_vars[2]))} else { - by_list<-NULL - } - - emm_1st <- emmeans(lme_i, spec = spec_formula, at=by_list, data=model_data) %>% - contrast(method = contrasts_primary, adjust = "none") - - if(!is.null(contrasts_secondary) & length(contrasts_primary)>1){ - - emm_2nd <- contrast(emm_1st, method = contrasts_secondary, adjust = "none") - - contrast_res <- bind_rows( - as.data.frame(emm_1st) %>% mutate(contrast_order = "first_order"), - as.data.frame(emm_2nd) %>% mutate(contrast_order = "second_order") - ) %>% mutate(gene = gene_i) + + # --- CONTRASTS --- + if (isTRUE(run_contrast)) { + + is_interaction <- any(grepl(":", contrast_vars)) + + if (is_interaction) { + + # ---- BRANCH A: Interaction contrast via pre-computed contrasts_list ---- + # contrasts_list and spec_lookup were built once in geneLME() before + # parallel dispatch — no per-gene reconstruction needed here. + vars <- strsplit(contrast_vars, ":")[[1]] + var_a <- vars[1] + var_b <- vars[2] + + emm_obj <- emmeans( + lme_i, + spec = as.formula(paste("~", var_a, "*", var_b)), + data = model_data + ) + + emm_1st_A <- contrast(emm_obj, method = contrasts_list, adjust = "none") + + if (!is.null(contrasts_secondary)) { + # suppressWarnings: emmeans internally calls lmer on the first-order contrast + # estimates when computing second-order contrasts; this can emit a benign + # lme4 scale warning when the estimate/SE values differ in magnitude from + # the original predictors. The output is unaffected — the warning is cosmetic. + emm_2nd_A <- suppressWarnings( + contrast(emm_1st_A, method = contrasts_secondary, adjust = "none") + ) + contrast_res <- bind_rows( + # First-order: join ref/lvl from pre-computed spec_lookup by contrast name + as.data.frame(emm_1st_A) %>% + mutate(contrast_order = "first_order") %>% + left_join(spec_lookup, by = "contrast"), + # Second-order: contrasts-of-contrasts have no single ref/lvl pair + as.data.frame(emm_2nd_A) %>% + mutate(contrast_order = "second_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_) + ) %>% mutate(gene = gene_name, model_status = model_status) + } else { + contrast_res <- as.data.frame(emm_1st_A) %>% + mutate(contrast_order = "first_order") %>% + left_join(spec_lookup, by = "contrast") %>% + mutate(gene = gene_name, model_status = model_status) + } + + } else { + + # ---- BRANCH B: Non-interaction ---- + spec_formula <- as.formula(paste("~", paste(contrast_vars, collapse = "|"))) + + by_list <- if (length(contrast_vars) == 2 && !is.null(contrast_var_2_levels)) { + setNames(list(contrast_var_2_levels), contrast_vars[2]) + } else { + NULL + } + + emm_1st <- emmeans(lme_i, spec = spec_formula, at = by_list, data = model_data) %>% + contrast(method = contrasts_primary, adjust = "none") + + if (!is.null(contrasts_secondary)) { + # suppressWarnings: same benign lme4 scale warning as Branch A second-order step. + emm_2nd <- suppressWarnings( + contrast(emm_1st, method = contrasts_secondary, adjust = "none") + ) + contrast_res <- bind_rows( + # Branch B uses named contrast vectors, not a ref/lvl spec — set NA for both orders + as.data.frame(emm_1st) %>% + mutate(contrast_order = "first_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_), + as.data.frame(emm_2nd) %>% + mutate(contrast_order = "second_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_) + ) %>% mutate(gene = gene_name, model_status = model_status) + } else { + contrast_res <- as.data.frame(emm_1st) %>% + mutate(contrast_order = "first_order", + contrast_ref = NA_character_, + contrast_lvl = NA_character_, + gene = gene_name, + model_status = model_status) + } + } + } else { - contrast_res <- - as.data.frame(emm_1st) %>% mutate(contrast_order = "first_order") - }} else {contrast_res=NULL} - - # Return list of results. Once this function returns, 'lme_i' is destroyed. - list(aic = aic_res, anova = lme_i_anova_tab, contrasts = contrast_res, model_status=setNames(c("success"), gene_i)) - + contrast_res <- NULL + } + + list( + aic = aic_res, + anova = lme_i_anova_tab, + contrasts = contrast_res, + model_status = setNames(model_status, gene_name) + ) + }, error = function(e) { - # This block triggers for actual errors AND our forced 'singular fit' stop - err_msg <- as.character(e$message) - + err_msg <- conditionMessage(e) list( - aic = data.frame(gene = gene_i, AIC = NA, model_status = err_msg), - # Return a 1-row ANOVA skeleton so the gene is not lost - + aic = data.frame(gene = gene_name, AIC = NA_real_), anova = data.frame( - term = NA, statistic=NA, df=NA, p.value =NA, gene=gene_i, - predictor_class=NA, Estimate_source=NA, Estimate=NA, Estimate_SE=NA, - model_status = err_msg), - # Return a 1-row Contrast skeleton - contrasts=data.frame(contrast =NA, estimate=NA, SE=NA, df=NA, t.ratio=NA, p.value=NA, - contrast_order=NA, gene=gene_i, model_status = err_msg), - model_status= setNames(paste(err_msg), gene_i) + term = NA_character_, + statistic = NA_real_, + df = NA_real_, + p.value = NA_real_, + gene = gene_name, + model_status = err_msg, + predictor_class = NA_character_, + Estimate_source = NA_character_, + Estimate = NA_real_, + Estimate_SE = NA_real_ + ), + contrasts = data.frame( + contrast = NA_character_, + estimate = NA_real_, + SE = NA_real_, + df = NA_real_, + t.ratio = NA_real_, + p.value = NA_real_, + contrast_order = NA_character_, + contrast_ref = NA_character_, + contrast_lvl = NA_character_, + gene = gene_name + ), + model_status = setNames(err_msg, gene_name) ) }) - - + return(result) } +######################################################## +# geneLME_compiler +# Aggregates list of per-gene results into named result tables, +# then appends FDR-adjusted p-values within each grouping unit. +# +# FDR grouping strategy: +# lme_anova: adjust within each model term (across all genes). +# Each term's p-values form one adjustment set. +# lme_contrast: adjust within each contrast x contrast_order combination +# (across all genes). Branch B contrast labels already encode +# the 'by' variable level (e.g. "TrtC vs TrtA, visit = V2"), +# so grouping by contrast alone is sufficient. +# NA p-values (failed gene models) are preserved as NA in p.value_adj. +# singular_fit genes are included in the adjustment set (their p-values +# are valid; users filter on model_status if they wish to exclude them). +######################################################## + +geneLME_compiler <- function(fit, fdr_method = "BH", contrast_spec = NULL) { + + lme_anova <- map_dfr(fit, "anova") %>% + group_by(term) %>% + mutate(p.value_adj = p.adjust(p.value, method = fdr_method)) %>% + relocate(p.value_adj, .after = p.value)%>% + ungroup() + + lme_contrast_raw <- map_dfr(fit, "contrasts") + lme_contrast <- if (ncol(lme_contrast_raw) > 0 && + all(c("contrast", "contrast_order") %in% colnames(lme_contrast_raw))) { + lme_contrast_raw %>% + group_by(contrast, contrast_order) %>% + mutate(p.value_adj = p.adjust(p.value, method = fdr_method)) %>% + relocate(p.value_adj, .after = p.value)%>% + ungroup() + } else { + lme_contrast_raw # no contrasts run; return as-is (empty or NULL-row stub) + } + + list( + lme_anova = lme_anova, + lme_contrast = lme_contrast, + lme_fit = map_dfr(fit, "aic"), + lme_err = map_chr(fit, "model_status"), + contrast_spec = contrast_spec # indexed spec; NULL when no contrasts run + ) +} + + +######################################################## +# geneLME_dispatch +# Runs future_lapply with explicit global declaration to prevent +# future's automatic environment scan from capturing large objects. +# +# Key design decisions: +# 1. Iterate over an integer index — a plain integer sequence carries +# no environment baggage for future to scan. +# 2. All shared objects passed via future.globals, bypassing automatic +# scanning entirely. +# +# contrast_spec replaced by contrasts_list + spec_lookup (Branch A) +# in both the function signature and future.globals. +######################################################## -# Separate function for compiling results +geneLME_dispatch <- function(gene_data_list, + targets_df, + formula_str, + run_contrast, + contrast_vars, + contrast_var_2_levels, + contrasts_list, + spec_lookup, + contrasts_primary, + contrasts_secondary) { + n_genes <- length(gene_data_list) -geneLME_compiler<-function(fit){ - list(lme_anova_tab= map_dfr(fit, "anova"), # gather anova tables - lme_contrast=map_dfr(fit, "contrasts"), # gather contrast tables - lme_fit=map_dfr(fit, "aic"), # gather model fit - lme_err=map_chr(fit, "model_status") %>% # extract the status string for each gene & preserve names - set_names(map_chr(fit, ~ names(.x$model_status)))) + # Suppress "package 'X' was built under R version Y" warnings that + # future re-raises on the main process after collecting worker results. + # These are cosmetic version-mismatch notices — not actionable errors. + withCallingHandlers( + future.apply::future_lapply( + seq_len(n_genes), + FUN = function(i) { + gene_data <- gene_data_list[[i]] + geneLME_fit( + gene_name = gene_data$gene_name, + expression_vec = gene_data$expression_vec, + weight_vec = gene_data$weight_vec, + targets = targets_df, + formula_str = formula_str, + run_contrast = run_contrast, + contrast_vars = contrast_vars, + contrast_var_2_levels = contrast_var_2_levels, + contrasts_list = contrasts_list, + spec_lookup = spec_lookup, + contrasts_primary = contrasts_primary, + contrasts_secondary = contrasts_secondary + ) + }, + future.globals = list( + gene_data_list = gene_data_list, + targets_df = targets_df, + formula_str = formula_str, + run_contrast = run_contrast, + contrast_vars = contrast_vars, + contrast_var_2_levels = contrast_var_2_levels, + contrasts_list = contrasts_list, + spec_lookup = spec_lookup, + contrasts_primary = contrasts_primary, + contrasts_secondary = contrasts_secondary, + geneLME_fit = geneLME_fit + ), + future.seed = TRUE + ), + warning = function(w) { + if (grepl("was built under R version", conditionMessage(w))) { + invokeRestart("muffleWarning") + } + }) } +######################################################## +# geneLME +# User-facing wrapper: validates inputs, sets up parallel plan, +# pre-extracts per-gene data, dispatches geneLME_fit in parallel. +# +# Calls geneLME_build_contrast_args() for Branch A to pre-compute +# contrasts_list and spec_lookup before the parallel stage. +######################################################## + +geneLME <- + function(dat, + formula_str, + model_weights = NULL, + run_contrast = NULL, + contrast_vars = NULL, + contrast_var_2_levels = NULL, + contrast_spec = NULL, # data.frame(contrast_ref, contrast_lvl) or NULL + # required when contrast_vars contains ":" + contrasts_primary = NULL, + contrasts_secondary = NULL, + fdr_method = "BH", # any method accepted by p.adjust() + n_cores = NULL) { + + # --- PRE-FLIGHT VALIDATION --- + + # Build a local formula object solely for variable checking — never + # passed to workers. Workers reconstruct the formula from formula_str. + formula_obj_local <- as.formula(paste("expression", formula_str)) + + # Initialise objects populated during the contrast validation block below. + indexed_contrast_spec <- NULL + contrasts_list <- NULL # Branch A: pre-computed named contrast vectors + spec_lookup <- NULL # Branch A: pre-computed ref/lvl join table + + # 1. Formula variables present in targets + required_vars <- all.vars(formula_obj_local) + required_vars <- required_vars[required_vars != "expression"] + + missing_vars <- setdiff(required_vars, colnames(dat$targets)) + if (length(missing_vars) > 0) { + stop(paste( + "The following variables in the formula are missing from dat$targets:", + paste(missing_vars, collapse = ", ") + )) + } + + # 2. Weights alignment + if (isTRUE(model_weights)) { + if (is.null(dat$weights)) { + stop("model_weights = TRUE but dat$weights is NULL.") + } + if (!identical(dim(dat$weights), dim(dat$E))) { + stop("Dimensions of dat$weights do not match dat$E.") + } + } + + # 3. Contrast variables and spec + if (isTRUE(run_contrast)) { -geneLME <- - function(dat, - formula_str, - model_weights=NULL, - run_contrast=NULL, - contrast_vars=NULL, - contrast_var_2_levels=NULL, - contrasts_primary=NULL, - contrasts_secondary=NULL, - n_cores = NULL) { - -#. --- PRE-FLIGHT DATA VALIDATION --- - - # 1. Check Formula Variables - # Extract all variables from the formula (excluding the LHS 'expression') - required_vars <- all.vars(formula_obj) - required_vars <- required_vars[required_vars != "expression"] - - missing_vars <- setdiff(required_vars, colnames(dat$targets)) - if (length(missing_vars) > 0) { - stop(paste("The following variables in the formula are missing from dat$targets:", - paste(missing_vars, collapse = ", "))) + if (is.null(contrast_vars)) { + stop("run_contrast = TRUE but contrast_vars is NULL.") } - - # 2. Check Weights Alignment - if (!is.null(model_weights)) { - if (is.null(dat$weights)) { - stop("model_weights is specified, but dat$weights is missing.") + + is_interaction <- any(grepl(":", contrast_vars)) + + # 3a. Formula-vs-contrast_vars consistency check. + # Uses terms() so that both "a*b" and "a:b" formula syntax are handled correctly — + # "a*b" expands to include "a:b" in term labels, so both forms are detected. + formula_terms <- attr(terms(formula_obj_local), "term.labels") + + if (is_interaction) { + # For interaction contrasts, verify the interaction term is actually in the model. + # emmeans will silently run contrasts on additive margins if the interaction is absent, + # which is statistically misleading. + ixn_term <- contrast_vars[grep(":", contrast_vars)][1] # e.g. "treatment:visit" + if (!ixn_term %in% formula_terms) { + stop(paste0( + "contrast_vars specifies an interaction contrast for '", ixn_term, "', ", + "but this interaction term is not present in formula_str.\n", + "Either add the interaction to the formula (e.g. '~ treatment * visit + ...') ", + "or change contrast_vars to a non-interaction term." + )) } - if (nrow(dat$weights) != nrow(dat$E) || ncol(dat$weights) != ncol(dat$E)) { - stop("Dimensions of dat$weights do not match dat$E. Weights must be a matrix of same size.") + } else { + # For non-interaction contrasts, verify each contrast variable is in the formula + # as a main effect. Warn (not error) in case it is part of an interaction only. + for (cv in contrast_vars) { + if (!cv %in% formula_terms) { + warning(paste0( + "contrast_vars includes '", cv, "' but this term does not appear as a ", + "main effect in formula_str. Contrasts may be computed from interaction ", + "margins only — verify this is the intended model structure." + )) + } } } - - # 3. Check Contrast Variables & Levels - if (!is.null(run_contrast)) { - # Verify contrast_vars exist in targets - missing_contrast_vars <- setdiff(contrast_vars, colnames(dat$targets)) - if (length(missing_contrast_vars) > 0) { - stop(paste("contrast_vars missing from dat$targets:", - paste(missing_contrast_vars, collapse = ", "))) + + # 3b. contrast_spec validation (required for interaction contrasts) + if (is_interaction && is.null(contrast_spec)) { + stop(paste0( + "contrast_vars specifies an interaction contrast ('", contrast_vars, "') ", + "but contrast_spec is NULL.\n", + "Use geneLME_contrast_spec(dat$targets, contrast_vars = '", contrast_vars, + "') to generate a template, filter it to your contrasts of interest, ", + "then pass it as contrast_spec." + )) + } + + if (!is.null(contrast_spec)) { + if (!is.data.frame(contrast_spec)) { + stop("contrast_spec must be a data.frame.") + } + if (!all(c("contrast_ref", "contrast_lvl") %in% colnames(contrast_spec))) { + stop("contrast_spec must have columns 'contrast_ref' and 'contrast_lvl'.") + } + if (!is_interaction) { + warning("contrast_spec is provided but contrast_vars contains no interaction term (':'). ", + "contrast_spec is designed for interaction contrasts — did you mean to use contrasts_primary?") } - - # Verify Level Filtering for the 'by' variable - if (length(contrast_vars) == 2 && !is.null(contrast_var_2_levels)) { - actual_levels <- unique(as.character(dat$targets[[contrast_vars[2]]])) - invalid_levels <- setdiff(contrast_var_2_levels, actual_levels) - - if (length(invalid_levels) > 0) { - stop(paste0("The specified levels for '", contrast_vars[2], - "' do not exist in the data: ", paste(invalid_levels, collapse = ", "))) + + # Build the indexed contrast_spec that will be attached to the return value. + # contrast_index here is simply 1:nrow(contrast_spec) — the actual row position + # within the filtered spec passed by the user. This is what contrasts_secondary + # vectors must index into (not any index from the full unfiltered template). + indexed_contrast_spec <- contrast_spec %>% + mutate(contrast_index = seq_len(n())) %>% + select(contrast_index, everything()) + + # Print the indexed spec as a reminder whenever contrasts_secondary is provided. + if (is_interaction && !is.null(contrasts_secondary)) { + n_first <- nrow(contrast_spec) + message( + "contrasts_secondary will be applied to the first-order interaction contrasts ", + "in the order they appear in contrast_spec (", n_first, " contrasts):\n", + paste(seq_len(n_first), + paste(contrast_spec$contrast_lvl, contrast_spec$contrast_ref, sep = " - "), + sep = ". ", collapse = "\n"), + "\nEnsure contrasts_secondary vectors have length ", n_first, + ", with each element corresponding to the contrast at that position.", + "\nThe indexed contrast_spec is returned as $contrast_spec in the output." + ) + + # Soft-fail: wrong-length vectors produce silent NAs deep inside geneLME_fit(). + # Catch this here and return early with only $contrast_spec populated so the + # user has the indexed reference they need to fix their vectors. + bad_lens <- sapply(contrasts_secondary, length) + bad_names <- names(bad_lens)[bad_lens != n_first] + if (length(bad_names) > 0) { + message( + "\ncontrasts_secondary vector(s) have wrong length — returning early without running models.\n", + "Expected length ", n_first, " (= nrow(contrast_spec) after filtering).\n", + "Offending vector(s):\n", + paste0(" '", bad_names, "': length ", bad_lens[bad_names], collapse = "\n"), "\n\n", + "Common cause: vectors were built against nrow() of the full unfiltered\n", + "spec_template rather than nrow() of the filtered contrast_spec passed here.\n", + "Use $contrast_spec in the returned object to re-specify your secondary contrast vectors:\n", + " each vector must have length ", n_first, ", one element per row of $contrast_spec." + ) + return(invisible(list( + lme_anova = NULL, + lme_contrast = NULL, + lme_fit = NULL, + lme_err = NULL, + contrast_spec = indexed_contrast_spec + ))) } } } - - # 4. Check for Character columns that should be Factors - # emmeans and lmer handle strings, but it's safer to warn if a predictor has 100+ unique strings - for (v in required_vars) { - if (is.character(dat$targets[[v]]) && length(unique(dat$targets[[v]])) > (nrow(dat$targets)/2)) { - warning(paste("Variable '", v, "' has a very high number of unique strings. Is it a unique ID instead of a factor?")) + + # --- PRE-COMPUTE BRANCH A CONTRAST STRUCTURES --- + # Build contrasts_list and spec_lookup once here, before parallel dispatch. + # Workers receive these ready-made objects instead of rebuilding from + # contrast_spec on every gene — eliminates nrow(contrast_spec) × n_genes + # iterations of R-level vector construction. + if (is_interaction) { + contrast_args <- geneLME_build_contrast_args(dat$targets, contrast_vars, contrast_spec) + contrasts_list <- contrast_args$contrasts_list + spec_lookup <- contrast_args$spec_lookup + } + + # 3c. contrast_vars present in targets (split on ":" for interaction) + vars_to_check <- unique(unlist(strsplit(contrast_vars, ":"))) + missing_contrast_vars <- setdiff(vars_to_check, colnames(dat$targets)) + if (length(missing_contrast_vars) > 0) { + stop(paste( + "contrast_vars not found in dat$targets:", + paste(missing_contrast_vars, collapse = ", ") + )) + } + + # 3d. contrast_var_2_levels validation (Branch B only) + if (!is_interaction && length(contrast_vars) == 2 && !is.null(contrast_var_2_levels)) { + actual_levels <- unique(as.character(dat$targets[[contrast_vars[2]]])) + invalid_levels <- setdiff(contrast_var_2_levels, actual_levels) + if (length(invalid_levels) > 0) { + stop(paste0( + "Levels specified for '", contrast_vars[2], + "' not found in data: ", paste(invalid_levels, collapse = ", ") + )) } } - - message("Input check complete: Data formatting looks good. Starting parallel LME fit...") - - - - - # Set Parallel Plan - workers <- if(is.null(n_cores)) parallel::detectCores() - 4 else n_cores - plan(multisession, workers = workers) - - # Convert string formula to object - - formula_obj <- as.formula(paste0("expression", formula_str)) - - # Run in parallel - # We pass the contrast specs into the function - fit_results <- future_lapply( - 1:nrow(dat$E), - FUN = function(x){ - geneLME_fit(i = x, - dat = dat, - formula_obj = formula_obj, - model_weights=model_weights, - run_contrast=run_contrast, - contrast_vars = contrast_vars, - contrast_var_2_levels=contrast_var_2_levels, - contrasts_primary = contrasts_primary, - contrasts_secondary = contrasts_secondary)}, - future.seed = TRUE - ) - - return(geneLME_compiler(fit_results)) -} + + # 3e. Branch B: build indexed contrast_spec from contrasts_primary names, + # and soft-fail on wrong-length contrasts_secondary vectors. + if (!is_interaction && !is.null(contrasts_primary)) { + indexed_contrast_spec <- data.frame( + contrast_index = seq_along(contrasts_primary), + contrast_name = names(contrasts_primary), + stringsAsFactors = FALSE + ) + + if (!is.null(contrasts_secondary)) { + n_primary <- length(contrasts_primary) + bad_lens <- sapply(contrasts_secondary, length) + bad_names <- names(bad_lens)[bad_lens != n_primary] + if (length(bad_names) > 0) { + message( + "\ncontrasts_secondary vector(s) have wrong length — returning early without running models.\n", + "Expected length ", n_primary, " (= number of contrasts_primary vectors).\n", + "Offending vector(s):\n", + paste0(" '", bad_names, "': length ", bad_lens[bad_names], collapse = "\n"), "\n\n", + "Use $contrast_spec in the returned object to confirm the primary contrast ordering:\n", + " each contrasts_secondary vector must have length ", n_primary, + ", one element per row of $contrast_spec." + ) + return(invisible(list( + lme_anova = NULL, + lme_contrast = NULL, + lme_fit = NULL, + lme_err = NULL, + contrast_spec = indexed_contrast_spec + ))) + } + } + } + } + + # 4. fdr_method must be a recognised p.adjust method + valid_fdr_methods <- p.adjust.methods # exported character vector from stats + if (!fdr_method %in% valid_fdr_methods) { + stop(paste0( + "'", fdr_method, "' is not a valid p.adjust method.\n", + "Choose one of: ", paste(valid_fdr_methods, collapse = ", ") + )) + } + + # 5. Warn on likely ID columns used as predictors + for (v in required_vars) { + if (is.character(dat$targets[[v]]) && + length(unique(dat$targets[[v]])) > nrow(dat$targets) / 2) { + warning(paste0( + "Variable '", v, "' has many unique string values — is it an ID column rather than a predictor?" + )) + } + } + + message("Input validation passed. Launching parallel LME fits...") + + # --- PRE-SLICE INTO PER-GENE LIST --- + W_mat <- if (isTRUE(model_weights)) dat$weights else NULL + + gene_data_list <- lapply(rownames(dat$E), function(g) { + list( + gene_name = g, + expression_vec = dat$E[g, ], + weight_vec = if (!is.null(W_mat)) W_mat[g, ] else NULL + ) + }) + + targets_df <- dat$targets + + # --- PARALLEL PLAN --- + workers <- if (is.null(n_cores)) max(1L, parallel::detectCores() - 4L) else n_cores + future::plan(future::multisession, workers = workers) + on.exit(future::plan(future::sequential), add = TRUE) + + # --- DISPATCH via clean-scope helper --- + fit_results <- geneLME_dispatch( + gene_data_list = gene_data_list, + targets_df = targets_df, + formula_str = formula_str, + run_contrast = run_contrast, + contrast_vars = contrast_vars, + contrast_var_2_levels = contrast_var_2_levels, + contrasts_list = contrasts_list, + spec_lookup = spec_lookup, + contrasts_primary = contrasts_primary, + contrasts_secondary = contrasts_secondary + ) + + return(geneLME_compiler(fit_results, fdr_method = fdr_method, + contrast_spec = indexed_contrast_spec)) + } ######################################################## -# Testing +# Testing (see geneLME_test.R for full test suite) ######################################################## -# dat_sub<-RNAetc::subset_voom(dat, gene_keep = head(rownames(dat$E), 100)) -# -# test_mods<- + +# --- Branch A: interaction contrast via contrast_spec --- +# # Step 1: generate level reference template +# spec_template <- geneLME_contrast_spec( +# targets = dat$targets, +# contrast_vars = "treatment:visit" # interaction string +# ) +# # Step 2: filter to contrasts of interest +# my_spec <- spec_template %>% +# dplyr::filter(...) # e.g. same-visit cross-treatment, or within-treatment longitudinal +# +# test_mods_A <- # geneLME( -# dat = dat_sub, -# formula_str = "~ treatment*visit + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)", -# model_weights = TRUE, -# run_contrast=TRUE, -# contrast_vars = c("treatment", "visit"), -# contrast_var_2_levels=c("-1", "S9", "S13"), -# contrasts_primary = list("Dual vs Placebo"=c(-1, 0, 1), "SLIT vs Placebo"=c(-1, 1, 0)), -# contrasts_secondary = list("Second Order Test"= c(1, -1)), -# n_cores = 10 -# +# dat = dat_sub, +# formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", +# model_weights = TRUE, +# run_contrast = TRUE, +# contrast_vars = "treatment:visit", # must match an interaction term in formula_str +# contrast_spec = my_spec, # required for interaction contrasts +# n_cores = 10 +# ) +# +# # Optional: second-order contrasts on top of Branch A first-order contrasts. +# # Vectors must have length == nrow(my_spec), ordered to match my_spec rows. +# # geneLME() will print the numbered list of first-order contrasts at validation time. +# test_mods_A2 <- +# geneLME( +# dat = dat_sub, +# formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", +# model_weights = TRUE, +# run_contrast = TRUE, +# contrast_vars = "treatment:visit", +# contrast_spec = my_spec, +# contrasts_secondary = list( +# "TrtA vs TrtB: V3 minus V2 effect" = c(1, 0, -1, 0, 0, 0), +# "TrtA vs TrtC: V3 minus V2 effect" = c(0, 1, 0, -1, 0, 0) +# ), +# n_cores = 10 # ) - -#PICKUP HERE: - -# needs to debug scaling up and hadling of parallel tasks. We didn't sort this out perfectlt. -# need to modify how contrasts are handled to be able to specify multiple levels of interaction contrasts. - +# --- Branch B: non-interaction --- +# # Step 1 (optional): inspect available levels for each contrast variable +# geneLME_contrast_spec(dat$targets, contrast_vars = "treatment") # reference only +# # Treatment levels (alphabetical): TrtA, TrtB, TrtC +# # contrasts_primary vectors have length 3: positions = [TrtA, TrtB, TrtC] +# +# test_mods_B <- +# geneLME( +# dat = dat_sub, +# formula_str = "~ treatment + visit + age + sex + rNANgUl + (1|ptID)", +# model_weights = TRUE, +# run_contrast = TRUE, +# contrast_vars = c("treatment", "visit"), +# contrast_var_2_levels = c("V2", "V3"), +# contrasts_primary = list("TrtC vs TrtA" = c(-1, 0, 1), +# "TrtB vs TrtA" = c(-1, 1, 0)), +# contrasts_secondary = list("TrtC vs TrtB" = c(1, -1)), +# n_cores = 10 +# ) diff --git a/R_functions/geneLME_benchmark.Rmd b/R_functions/geneLME_benchmark.Rmd new file mode 100644 index 0000000..2f27d20 --- /dev/null +++ b/R_functions/geneLME_benchmark.Rmd @@ -0,0 +1,635 @@ +--- +title: "geneLME vs kimma: Benchmarking Report" +subtitle: "Estimate accuracy and computational efficiency" +date: "`r Sys.Date()`" +output: + html_document: + toc: true + toc_float: + collapsed: false + smooth_scroll: true + toc_depth: 3 + theme: flatly + highlight: tango + code_folding: hide + df_print: paged +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set( + echo = TRUE, + message = FALSE, + warning = FALSE, + comment = "#>", + fig.align = "center", + fig.width = 8, + fig.height = 5 +) +``` + +--- + +# Overview + +This report benchmarks `geneLME()` against `kimma::kmFit()`, a published package for +per-gene linear mixed effects modelling of RNA-seq data. Both functions use the same +underlying engines (`lme4` for model fitting, `emmeans` for contrast estimation), so +they should produce statistically identical results — but differ in contrast specification +philosophy and computational efficiency. + +**Three questions addressed:** + +1. **Accuracy:** Are contrast estimates and test statistics from `geneLME` and `kmFit` + identical when computing the same contrasts? +2. **Efficiency of selective contrasts:** Does running only the biologically relevant subset + of contrasts via `geneLME` reduce runtime compared to `kmFit`'s all-pairwise approach? +3. **Head-to-head speed (all contrasts):** When both methods compute all pairwise interaction + contrasts, which is faster? + +**Key design difference:** + +| Feature | `geneLME` | `kmFit` | +|---|---|---| +| Contrast specification | User-defined subset via `contrast_spec` | All pairwise automatically | +| Singular fit handling | Treated as error → NA | Estimates returned (no check) | +| Parallelisation | `future_lapply` (multisession) | Native parallel via `processors` | +| Second-order contrasts | Supported | Not supported | +| FDR adjustment | Within-term / within-contrast grouping | BH across all genes per variable | + +--- + +# Setup + +```{r libs} +library(lme4) +library(emmeans) +library(car) +library(broom.mixed) +library(dplyr) +library(tibble) +library(purrr) +library(future) +library(future.apply) +library(kimma) +library(microbenchmark) +library(ggplot2) +library(tidyr) +library(scales) + +source("geneLME_dev.R") +``` + +--- + +# Mock Data (2,000 genes) + +The benchmark uses a simulated dataset matching a realistic paired longitudinal RNA-seq +design: **10 patients × 3 treatments × 4 visits = 120 samples, 2,000 genes**. + +Genes 1–100 have a simulated `TrtC:V3` upregulation of +2.5 log2 units to provide a +detectable interaction signal. A patient-level random effect is added to all genes. + +```{r mock-data} +set.seed(42) + +n_patients <- 10 +treatments <- c("TrtA", "TrtB", "TrtC") +visits <- c("V1", "V2", "V3", "V4") +n_genes <- 2000 +n_signal <- 100 # genes with simulated TrtC:V3 effect + +# Patient-level covariates +patient_meta <- data.frame( + ptID = paste0("pt", sprintf("%02d", 1:n_patients)), + sex = factor(sample(c("M", "F"), n_patients, replace = TRUE)), + age = round(rnorm(n_patients, mean = 38, sd = 10)), + stringsAsFactors = FALSE +) + +# Sample metadata (targets) — includes libID for kimma compatibility +targets <- expand.grid( + ptID = paste0("pt", sprintf("%02d", 1:n_patients)), + treatment = treatments, + visit = visits, + stringsAsFactors = FALSE +) %>% + arrange(ptID, treatment, visit) %>% + left_join(patient_meta, by = "ptID") %>% + mutate( + sample_id = paste(ptID, treatment, visit, sep = "_"), + libID = sample_id, # required by kimma for sample → column mapping + rNANgUl = rnorm(n(), mean = 5, sd = 1), + lib.size = round(rnorm(n(), mean = 20e6, sd = 3e6)), + norm.factors = rnorm(n(), mean = 1, sd = 0.05) + ) +rownames(targets) <- targets$sample_id + +n_samples <- nrow(targets) + +# Expression matrix +E_mat <- matrix( + rnorm(n_genes * n_samples, mean = 8, sd = 2), + nrow = n_genes, ncol = n_samples, + dimnames = list( + paste0("gene", sprintf("%04d", 1:n_genes)), + targets$sample_id + ) +) + +# Simulated TrtC:V3 effect on signal genes +trtC_v3 <- which(targets$treatment == "TrtC" & targets$visit == "V3") +E_mat[1:n_signal, trtC_v3] <- E_mat[1:n_signal, trtC_v3] + 2.5 + +# Patient random effect +for (pt in unique(targets$ptID)) { + idx <- which(targets$ptID == pt) + E_mat[, idx] <- E_mat[, idx] + rnorm(1, 0, 1) +} + +# Voom-like precision weights +W_mat <- matrix( + abs(rnorm(n_genes * n_samples, mean = 1, sd = 0.1)), + nrow = n_genes, ncol = n_samples, + dimnames = dimnames(E_mat) +) + +# Full 2000-gene dataset (used for benchmarking) +dat_bench <- list(E = E_mat, weights = W_mat, targets = targets) + +# 50-gene subset (used for accuracy comparison — faster repeated runs) +dat_50 <- list( + E = E_mat[1:50, ], + weights = W_mat[1:50, ], + targets = targets +) + +cat("Design:", n_patients, "patients ×", length(treatments), "treatments ×", + length(visits), "visits =", n_samples, "samples\n") +cat("Full dataset:", n_genes, "genes\n") +cat("Accuracy subset:", nrow(dat_50$E), "genes\n") +cat("Signal genes (TrtC:V3 +2.5 log2):", n_signal, "\n") +``` + +--- + +# Contrast Specification + +## All pairwise interaction contrasts (66 total) + +`kmFit` always computes all pairwise contrasts when given `contrast_var = "treatment:visit"`. +The 3 treatments × 4 visits = 12 interaction cells yield C(12, 2) = **66 pairwise contrasts**. + +`geneLME` uses an explicit `contrast_spec` data frame. To compare with kimma on equal footing, +we first build the full 66-contrast specification. + +```{r contrast-spec-full} +spec_full <- geneLME_contrast_spec( + targets = dat_bench$targets, + contrast_vars = "treatment:visit" +) + +cat("Full pairwise spec: ", nrow(spec_full), "contrasts\n") +``` + +## Selective contrast subsets + +For the efficiency demonstration, we also define two biologically focused subsets: + +```{r contrast-spec-subsets} +# Subset A: between-treatment comparisons within the same visit (V2 and V3 only) +spec_6 <- spec_full %>% + filter( + sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl), # same visit + sub(".* ", "", contrast_ref) %in% c("V2", "V3") # V2 and V3 + ) + +# Subset B: longitudinal (V2→V3) within the same treatment +spec_3 <- spec_full %>% + filter( + sub(" .*", "", contrast_ref) == sub(" .*", "", contrast_lvl), # same treatment + sub(".* ", "", contrast_ref) == "V2", + sub(".* ", "", contrast_lvl) == "V3" + ) + +cat("Subset A (between-treatment, V2 & V3):", nrow(spec_6), "contrasts\n") +cat("Subset B (longitudinal V2→V3, same treatment):", nrow(spec_3), "contrasts\n") +cat("Full set (all pairwise):", nrow(spec_full), "contrasts\n") + +knitr::kable( + data.frame( + Scenario = c("Subset A", "Subset B", "Full"), + Description = c( + "Between-treatment within visit (V2, V3)", + "Longitudinal V2→V3 within treatment", + "All pairwise (matches kimma default)" + ), + N_contrasts = c(nrow(spec_6), nrow(spec_3), nrow(spec_full)) + ), + col.names = c("Scenario", "Description", "N contrasts"), + caption = "Contrast subsets used in benchmarks" +) +``` + +--- + +# 1. Accuracy Comparison + +Both methods run on the **50-gene subset** with the **full 66 pairwise contrasts**, +ensuring they compute exactly the same comparisons. + +> **Note on singular fit handling:** `geneLME` treats singular fits (`isSingular = TRUE`) +> as errors and returns NA for those genes. `kmFit` does not perform this check and returns +> estimates regardless. Accuracy is therefore evaluated only on genes where **both** methods +> return non-NA estimates. + +## Run both methods (50 genes, all 66 contrasts) + +```{r accuracy-run, cache=TRUE} +# geneLME — full 66-contrast spec +res_geneLME_50 <- geneLME( + dat = dat_50, + formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = spec_full, + n_cores = 8 +) + +# kimma — full pairwise (equivalent to all 66 contrasts) +res_kimma_50 <- suppressMessages( + kmFit( + dat = dat_50, + run_lme = TRUE, + use_weights = TRUE, + model = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + run_contrast = TRUE, + contrast_var = "treatment:visit", + patientID = "ptID", + libraryID = "libID", + processors = 8 + ) +) + +cat("geneLME: ", sum(res_geneLME_50$lme_err == "success"), "/ 50 genes succeeded\n") +cat("kimma: ", length(unique(res_kimma_50$lme.contrast$gene)), "/ 50 genes in contrast output\n") +``` + +## Result alignment + +```{r accuracy-join} +# Standardise geneLME columns +glme <- res_geneLME_50$lme_contrast %>% + filter(contrast_order == "first_order", !is.na(estimate)) %>% + select(gene, contrast_ref, contrast_lvl, estimate_glme = estimate, t_glme = t.ratio, + p_glme = p.value) + +# Standardise kimma columns +# kimma contrast direction: estimate = lvl - ref (positive = lvl > ref) +# same convention as geneLME +km <- res_kimma_50$lme.contrast %>% + select(gene, contrast_ref, contrast_lvl, estimate_kimma = estimate, + t_kimma = statistic, p_kimma = pval) + +# Join on gene + ref/lvl pair (both methods use the same "ref - lvl" convention) +joined <- inner_join(glme, km, by = c("gene", "contrast_ref", "contrast_lvl")) + +cat("Matched rows (gene × contrast, both methods non-NA):", nrow(joined), "\n") +cat("Unique genes in comparison:", length(unique(joined$gene)), "\n") +cat("Unique contrasts per gene:", nrow(joined) / length(unique(joined$gene)), "\n") +``` + +## Estimate correlation + +```{r accuracy-scatter-estimate, fig.height=5, fig.width=8} +cor_est <- cor(joined$estimate_glme, joined$estimate_kimma, use = "complete.obs") +mad_est <- mean(abs(joined$estimate_glme - joined$estimate_kimma), na.rm = TRUE) + +ggplot(joined, aes(x = estimate_glme, y = estimate_kimma)) + + geom_point(alpha = 0.15, size = 0.8, colour = "#2c7bb6") + + geom_abline(slope = 1, intercept = 0, colour = "firebrick", linewidth = 0.8, linetype = "dashed") + + annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4, + label = sprintf("r = %.6f\nMAD = %.2e", cor_est, mad_est), + size = 4, family = "mono") + + labs( + title = "Contrast estimates: geneLME vs kimma", + subtitle = sprintf("50 genes × 66 contrasts | n = %d matched pairs", nrow(joined)), + x = "geneLME estimate", + y = "kimma estimate" + ) + + theme_bw(base_size = 12) +``` + +## t-statistic correlation + +```{r accuracy-scatter-t, fig.height=5, fig.width=8} +cor_t <- cor(joined$t_glme, joined$t_kimma, use = "complete.obs") +mad_t <- mean(abs(joined$t_glme - joined$t_kimma), na.rm = TRUE) + +ggplot(joined, aes(x = t_glme, y = t_kimma)) + + geom_point(alpha = 0.15, size = 0.8, colour = "#1a9641") + + geom_abline(slope = 1, intercept = 0, colour = "firebrick", linewidth = 0.8, linetype = "dashed") + + annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4, + label = sprintf("r = %.6f\nMAD = %.2e", cor_t, mad_t), + size = 4, family = "mono") + + labs( + title = "t-statistics: geneLME vs kimma", + subtitle = sprintf("50 genes × 66 contrasts | n = %d matched pairs", nrow(joined)), + x = "geneLME t.ratio", + y = "kimma statistic" + ) + + theme_bw(base_size = 12) +``` + +## Accuracy summary + +```{r accuracy-summary} +# Any divergent pairs (|estimate difference| > 0.01)? +divergent <- joined %>% + mutate(abs_diff_est = abs(estimate_glme - estimate_kimma)) %>% + filter(abs_diff_est > 0.01) %>% + arrange(desc(abs_diff_est)) + +knitr::kable( + data.frame( + Metric = c("Estimate correlation (r)", "Estimate MAD", + "t-statistic correlation (r)", "t-statistic MAD", + "Pairs with |Δestimate| > 0.01"), + Value = c(sprintf("%.8f", cor_est), sprintf("%.2e", mad_est), + sprintf("%.8f", cor_t), sprintf("%.2e", mad_t), + nrow(divergent)) + ), + col.names = c("Metric", "Value"), + caption = "Accuracy summary: geneLME vs kimma (50 genes, 66 contrasts each)" +) +``` + +```{r accuracy-divergent, eval=nrow(divergent) > 0} +cat("Divergent contrast pairs (|Δestimate| > 0.01):\n") +print(divergent %>% select(gene, contrast_ref, contrast_lvl, + estimate_glme, estimate_kimma, abs_diff_est) %>% head(20)) +``` + +> **Interpretation:** Both methods use `lme4::lmer()` for model fitting and `emmeans::contrast()` +> for marginal means — identical numerical results are expected for genes where both methods +> converge. Any small residual differences reflect floating-point precision only. +> Genes absent from geneLME output were flagged as singular fits (`isSingular = TRUE`); +> kimma returns estimates for these regardless. + +--- + +# 2. Speed: Selective vs Full Contrasts + +`geneLME` allows users to specify only the contrasts of biological interest. Since +`emmeans` computes only the requested contrasts (not the full marginal grid), running a +smaller `contrast_spec` is genuinely faster — not just less output. + +`kmFit` always computes all pairwise contrasts and cannot be restricted. + +```{r bench-selective, cache=TRUE} +cat("Running microbenchmark: selective vs full contrasts (2,000 genes, 5 reps each)...\n") +cat("This will take several minutes.\n") + +mb_selective <- microbenchmark( + geneLME_3contrast = { + geneLME( + dat = dat_bench, + formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = spec_3, + n_cores = 8 + ) + }, + geneLME_6contrast = { + geneLME( + dat = dat_bench, + formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = spec_6, + n_cores = 8 + ) + }, + geneLME_66contrast = { + geneLME( + dat = dat_bench, + formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + model_weights = TRUE, + run_contrast = TRUE, + contrast_vars = "treatment:visit", + contrast_spec = spec_full, + n_cores = 8 + ) + }, + kimma_66contrast = { + suppressMessages(kmFit( + dat = dat_bench, + run_lme = TRUE, + use_weights = TRUE, + model = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)", + run_contrast = TRUE, + contrast_var = "treatment:visit", + patientID = "ptID", + libraryID = "libID", + processors = 8 + )) + }, + times = 5, + unit = "sec" +) + +print(mb_selective) +``` + +```{r bench-selective-plot, fig.height=5, fig.width=8} +# Summary table +bench_summary <- summary(mb_selective) %>% + as.data.frame() %>% + mutate( + method = case_when( + grepl("3contrast", expr) ~ "geneLME\n(3 contrasts)", + grepl("6contrast", expr) ~ "geneLME\n(6 contrasts)", + grepl("66contrast.*geneLME", expr) ~ "geneLME\n(66 contrasts)", + TRUE ~ "kimma\n(66 contrasts)" + ), + n_contrasts = c(3, 6, 66, 66), + package = ifelse(grepl("kimma", expr), "kimma", "geneLME") + ) + +ggplot(bench_summary, aes(x = reorder(method, median), y = median, + fill = package, colour = package)) + + geom_col(alpha = 0.75, width = 0.6) + + geom_errorbar(aes(ymin = lq, ymax = uq), width = 0.2, linewidth = 0.7) + + scale_fill_manual(values = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) + + scale_colour_manual(values = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) + + labs( + title = "Runtime: selective vs full contrasts (2,000 genes, 5 reps)", + subtitle = "Error bars = lower and upper quartile across 5 runs", + x = NULL, + y = "Median runtime (seconds)", + fill = "Method", colour = "Method" + ) + + theme_bw(base_size = 12) + + theme(legend.position = "none", + panel.grid.major.x = element_blank()) +``` + +```{r bench-selective-table} +knitr::kable( + bench_summary %>% + mutate(across(c(min, lq, median, mean, uq, max), ~ round(., 1))) %>% + select(Scenario = method, N_contrasts = n_contrasts, + Min = min, Q1 = lq, Median = median, Mean = mean, Q3 = uq, Max = max), + caption = "Runtime summary (seconds) — selective vs full contrasts, 2,000 genes, 5 repetitions" +) +``` + +```{r bench-selective-scaling, fig.height=4, fig.width=7} +# Scaling plot: n contrasts vs time for geneLME +glme_scaling <- bench_summary %>% + filter(package == "geneLME") %>% + mutate(n_contrasts = c(3, 6, 66)) + +ggplot(glme_scaling, aes(x = n_contrasts, y = median)) + + geom_point(size = 3, colour = "#2c7bb6") + + geom_line(colour = "#2c7bb6", linewidth = 0.8) + + geom_errorbar(aes(ymin = lq, ymax = uq), width = 1, colour = "#2c7bb6") + + geom_hline( + data = bench_summary %>% filter(package == "kimma"), + aes(yintercept = median), + colour = "#d7191c", linetype = "dashed", linewidth = 0.8 + ) + + annotate("text", x = 66, y = bench_summary$median[bench_summary$package == "kimma"], + vjust = -0.7, hjust = 1, colour = "#d7191c", size = 3.5, + label = "kimma (66 contrasts)") + + scale_x_continuous(breaks = c(3, 6, 66)) + + labs( + title = "geneLME runtime scales with number of contrasts", + subtitle = "Dashed red line = kimma running the equivalent 66 contrasts", + x = "Number of contrasts specified", + y = "Median runtime (seconds)" + ) + + theme_bw(base_size = 12) +``` + +--- + +# 3. Head-to-Head: geneLME vs kimma (all 66 contrasts) + +When both methods compute the same 66 pairwise contrasts on the full 2,000-gene dataset, +this section directly compares their runtimes. + +> **Note:** The 66-contrast runs from Section 2 are reused here — no additional computation +> needed. + +```{r bench-headtohead} +# Extract the two relevant scenarios from mb_selective +hth <- bench_summary %>% + filter(n_contrasts == 66) %>% + mutate(method_label = c("geneLME", "kimma")) + +speedup <- hth$median[hth$method_label == "kimma"] / + hth$median[hth$method_label == "geneLME"] + +knitr::kable( + hth %>% + mutate(across(c(min, lq, median, mean, uq, max), ~ round(., 1))) %>% + select(Method = method_label, Min = min, Q1 = lq, + Median = median, Mean = mean, Q3 = uq, Max = max), + caption = "Head-to-head runtime (seconds) — 66 pairwise contrasts, 2,000 genes, 5 repetitions" +) +``` + +```{r bench-headtohead-plot, fig.height=4, fig.width=6} +ggplot(hth, aes(x = method_label, y = median, + fill = method_label, colour = method_label)) + + geom_col(alpha = 0.75, width = 0.5) + + geom_errorbar(aes(ymin = lq, ymax = uq), width = 0.15, linewidth = 0.8) + + scale_fill_manual(values = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) + + scale_colour_manual(values = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) + + labs( + title = "Head-to-head: geneLME vs kimma (66 contrasts, 2,000 genes)", + subtitle = sprintf("Speedup ratio (kimma / geneLME): %.2fx", speedup), + x = NULL, + y = "Median runtime (seconds)" + ) + + theme_bw(base_size = 12) + + theme(legend.position = "none", + panel.grid.major.x = element_blank()) +``` + +--- + +# Summary + +```{r summary-table} +knitr::kable( + bench_summary %>% + mutate( + Method = method, + `N contrasts` = n_contrasts, + `Median (s)` = round(median, 1), + `Relative to kimma` = paste0(round(bench_summary$median[bench_summary$package == "kimma"][1] / median, 2), "×") + ) %>% + select(Method, `N contrasts`, `Median (s)`, `Relative to kimma`), + caption = "Benchmark summary: all scenarios vs kimma baseline (2,000 genes, 5 reps)" +) +``` + +## Key findings + +```{r key-findings} +glme_66_med <- bench_summary$median[grepl("66.*geneLME|geneLME.*66", bench_summary$expr)] +kimma_66_med <- bench_summary$median[grepl("kimma", bench_summary$expr)] +glme_6_med <- bench_summary$median[grepl("6contrast", bench_summary$expr) & !grepl("66", bench_summary$expr)] +glme_3_med <- bench_summary$median[grepl("3contrast", bench_summary$expr)] + +cat(sprintf( +"1. ACCURACY + Contrast estimates: r = %.8f, MAD = %.2e + t-statistics: r = %.8f, MAD = %.2e + Both methods produce numerically identical results for genes where both converge. + Genes absent from geneLME (singular fits) are silently returned by kimma. + +2. EFFICIENCY OF SELECTIVE CONTRASTS + geneLME (3 contrasts): %.1f s + geneLME (6 contrasts): %.1f s + geneLME (66 contrasts): %.1f s + kimma (66 contrasts): %.1f s + Running 6 instead of 66 contrasts in geneLME saves ~%.0f%% of runtime. + kimma cannot be restricted to a subset and always runs all 66 contrasts. + +3. HEAD-TO-HEAD (66 contrasts, 2,000 genes) + geneLME: %.1f s median + kimma: %.1f s median + Speed ratio: %.2fx %s", + cor_est, mad_est, + cor_t, mad_t, + glme_3_med, glme_6_med, glme_66_med, kimma_66_med, + (1 - glme_6_med / kimma_66_med) * 100, + glme_66_med, kimma_66_med, speedup, + ifelse(speedup > 1, "(geneLME faster)", "(kimma faster)") +)) +``` + +## Method comparison + +| Dimension | geneLME | kimma | +|---|---|---| +| **Estimates** | Identical to kimma (both use `lme4` + `emmeans`) | Reference | +| **Contrast flexibility** | Any user-defined subset or custom vectors | All pairwise only | +| **Second-order contrasts** | Supported (contrasts-of-contrasts) | Not supported | +| **Singular fit handling** | Error — gene excluded, NA rows returned | Silent — estimates returned regardless | +| **Scalability** | Sub-linear with fewer contrasts | Fixed cost (always all pairwise) | +| **FDR grouping** | Within term / contrast label | BH across all genes per variable | + +--- + +# Session Info + +```{r session-info} +sessionInfo() +``` diff --git a/R_functions/geneLME_benchmark.html b/R_functions/geneLME_benchmark.html new file mode 100644 index 0000000..0aae7ee --- /dev/null +++ b/R_functions/geneLME_benchmark.html @@ -0,0 +1,4032 @@ + + + + +
+ + + + + + + + + +This report benchmarks geneLME() against
+kimma::kmFit(), a published package for per-gene linear
+mixed effects modelling of RNA-seq data. Both functions use the same
+underlying engines (lme4 for model fitting,
+emmeans for contrast estimation), so they should produce
+statistically identical results — but differ in contrast specification
+philosophy and computational efficiency.
Three questions addressed:
+geneLME and kmFit identical
+when computing the same contrasts?geneLME reduce runtime compared to kmFit’s
+all-pairwise approach?Key design difference:
+| Feature | +geneLME |
+kmFit |
+
|---|---|---|
| Contrast specification | +User-defined subset via contrast_spec |
+All pairwise automatically | +
| Singular fit handling | +Treated as error → NA | +Estimates returned (no check) | +
| Parallelisation | +future_lapply (multisession) |
+Native parallel via processors |
+
| Second-order contrasts | +Supported | +Not supported | +
| FDR adjustment | +Within-term / within-contrast grouping | +BH across all genes per variable | +
library(lme4)
+library(emmeans)
+library(car)
+library(broom.mixed)
+library(dplyr)
+library(tibble)
+library(purrr)
+library(future)
+library(future.apply)
+library(kimma)
+library(microbenchmark)
+library(ggplot2)
+library(tidyr)
+library(scales)
+
+source("geneLME_dev.R")The benchmark uses a simulated dataset matching a realistic paired +longitudinal RNA-seq design: 10 patients × 3 treatments × 4 +visits = 120 samples, 2,000 genes.
+Genes 1–100 have a simulated TrtC:V3 upregulation of
++2.5 log2 units to provide a detectable interaction signal. A
+patient-level random effect is added to all genes.
set.seed(42)
+
+n_patients <- 10
+treatments <- c("TrtA", "TrtB", "TrtC")
+visits <- c("V1", "V2", "V3", "V4")
+n_genes <- 2000
+n_signal <- 100 # genes with simulated TrtC:V3 effect
+
+# Patient-level covariates
+patient_meta <- data.frame(
+ ptID = paste0("pt", sprintf("%02d", 1:n_patients)),
+ sex = factor(sample(c("M", "F"), n_patients, replace = TRUE)),
+ age = round(rnorm(n_patients, mean = 38, sd = 10)),
+ stringsAsFactors = FALSE
+)
+
+# Sample metadata (targets) — includes libID for kimma compatibility
+targets <- expand.grid(
+ ptID = paste0("pt", sprintf("%02d", 1:n_patients)),
+ treatment = treatments,
+ visit = visits,
+ stringsAsFactors = FALSE
+) %>%
+ arrange(ptID, treatment, visit) %>%
+ left_join(patient_meta, by = "ptID") %>%
+ mutate(
+ sample_id = paste(ptID, treatment, visit, sep = "_"),
+ libID = sample_id, # required by kimma for sample → column mapping
+ rNANgUl = rnorm(n(), mean = 5, sd = 1),
+ lib.size = round(rnorm(n(), mean = 20e6, sd = 3e6)),
+ norm.factors = rnorm(n(), mean = 1, sd = 0.05)
+ )
+rownames(targets) <- targets$sample_id
+
+n_samples <- nrow(targets)
+
+# Expression matrix
+E_mat <- matrix(
+ rnorm(n_genes * n_samples, mean = 8, sd = 2),
+ nrow = n_genes, ncol = n_samples,
+ dimnames = list(
+ paste0("gene", sprintf("%04d", 1:n_genes)),
+ targets$sample_id
+ )
+)
+
+# Simulated TrtC:V3 effect on signal genes
+trtC_v3 <- which(targets$treatment == "TrtC" & targets$visit == "V3")
+E_mat[1:n_signal, trtC_v3] <- E_mat[1:n_signal, trtC_v3] + 2.5
+
+# Patient random effect
+for (pt in unique(targets$ptID)) {
+ idx <- which(targets$ptID == pt)
+ E_mat[, idx] <- E_mat[, idx] + rnorm(1, 0, 1)
+}
+
+# Voom-like precision weights
+W_mat <- matrix(
+ abs(rnorm(n_genes * n_samples, mean = 1, sd = 0.1)),
+ nrow = n_genes, ncol = n_samples,
+ dimnames = dimnames(E_mat)
+)
+
+# Full 2000-gene dataset (used for benchmarking)
+dat_bench <- list(E = E_mat, weights = W_mat, targets = targets)
+
+# 50-gene subset (used for accuracy comparison — faster repeated runs)
+dat_50 <- list(
+ E = E_mat[1:50, ],
+ weights = W_mat[1:50, ],
+ targets = targets
+)
+
+cat("Design:", n_patients, "patients ×", length(treatments), "treatments ×",
+ length(visits), "visits =", n_samples, "samples\n")#> Design: 10 patients × 3 treatments × 4 visits = 120 samples
+
+#> Full dataset: 2000 genes
+
+#> Accuracy subset: 50 genes
+
+#> Signal genes (TrtC:V3 +2.5 log2): 100
+kmFit always computes all pairwise contrasts when given
+contrast_var = "treatment:visit". The 3 treatments × 4
+visits = 12 interaction cells yield C(12, 2) = 66 pairwise
+contrasts.
geneLME uses an explicit contrast_spec data
+frame. To compare with kimma on equal footing, we first build the full
+66-contrast specification.
spec_full <- geneLME_contrast_spec(
+ targets = dat_bench$targets,
+ contrast_vars = "treatment:visit"
+)
+
+cat("Full pairwise spec: ", nrow(spec_full), "contrasts\n")#> Full pairwise spec: 66 contrasts
+For the efficiency demonstration, we also define two biologically +focused subsets:
+# Subset A: between-treatment comparisons within the same visit (V2 and V3 only)
+spec_6 <- spec_full %>%
+ filter(
+ sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl), # same visit
+ sub(".* ", "", contrast_ref) %in% c("V2", "V3") # V2 and V3
+ )
+
+# Subset B: longitudinal (V2→V3) within the same treatment
+spec_3 <- spec_full %>%
+ filter(
+ sub(" .*", "", contrast_ref) == sub(" .*", "", contrast_lvl), # same treatment
+ sub(".* ", "", contrast_ref) == "V2",
+ sub(".* ", "", contrast_lvl) == "V3"
+ )
+
+cat("Subset A (between-treatment, V2 & V3):", nrow(spec_6), "contrasts\n")#> Subset A (between-treatment, V2 & V3): 6 contrasts
+
+#> Subset B (longitudinal V2→V3, same treatment): 3 contrasts
+
+#> Full set (all pairwise): 66 contrasts
+knitr::kable(
+ data.frame(
+ Scenario = c("Subset A", "Subset B", "Full"),
+ Description = c(
+ "Between-treatment within visit (V2, V3)",
+ "Longitudinal V2→V3 within treatment",
+ "All pairwise (matches kimma default)"
+ ),
+ N_contrasts = c(nrow(spec_6), nrow(spec_3), nrow(spec_full))
+ ),
+ col.names = c("Scenario", "Description", "N contrasts"),
+ caption = "Contrast subsets used in benchmarks"
+)| Scenario | +Description | +N contrasts | +
|---|---|---|
| Subset A | +Between-treatment within visit (V2, V3) | +6 | +
| Subset B | +Longitudinal V2→V3 within treatment | +3 | +
| Full | +All pairwise (matches kimma default) | +66 | +
Both methods run on the 50-gene subset with the +full 66 pairwise contrasts, ensuring they compute +exactly the same comparisons.
+++Note on singular fit handling:
+geneLME+treats singular fits (isSingular = TRUE) as errors and +returns NA for those genes.kmFitdoes not perform this +check and returns estimates regardless. Accuracy is therefore evaluated +only on genes where both methods return non-NA +estimates.
# geneLME — full 66-contrast spec
+res_geneLME_50 <- geneLME(
+ dat = dat_50,
+ formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+ model_weights = TRUE,
+ run_contrast = TRUE,
+ contrast_vars = "treatment:visit",
+ contrast_spec = spec_full,
+ n_cores = 8
+)
+
+# kimma — full pairwise (equivalent to all 66 contrasts)
+res_kimma_50 <- suppressMessages(
+ kmFit(
+ dat = dat_50,
+ run_lme = TRUE,
+ use_weights = TRUE,
+ model = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+ run_contrast = TRUE,
+ contrast_var = "treatment:visit",
+ patientID = "ptID",
+ libraryID = "libID",
+ processors = 8
+ )
+)
+
+cat("geneLME: ", sum(res_geneLME_50$lme_err == "success"), "/ 50 genes succeeded\n")#> geneLME: 50 / 50 genes succeeded
+
+#> kimma: 50 / 50 genes in contrast output
+# Standardise geneLME columns
+glme <- res_geneLME_50$lme_contrast %>%
+ filter(contrast_order == "first_order", !is.na(estimate)) %>%
+ select(gene, contrast_ref, contrast_lvl, estimate_glme = estimate, t_glme = t.ratio,
+ p_glme = p.value)
+
+# Standardise kimma columns
+# kimma contrast direction: estimate = lvl - ref (positive = lvl > ref)
+# same convention as geneLME
+km <- res_kimma_50$lme.contrast %>%
+ select(gene, contrast_ref, contrast_lvl, estimate_kimma = estimate,
+ t_kimma = statistic, p_kimma = pval)
+
+# Join on gene + ref/lvl pair (both methods use the same "ref - lvl" convention)
+joined <- inner_join(glme, km, by = c("gene", "contrast_ref", "contrast_lvl"))
+
+cat("Matched rows (gene × contrast, both methods non-NA):", nrow(joined), "\n")#> Matched rows (gene × contrast, both methods non-NA): 3300
+
+#> Unique genes in comparison: 50
+
+#> Unique contrasts per gene: 66
+cor_est <- cor(joined$estimate_glme, joined$estimate_kimma, use = "complete.obs")
+mad_est <- mean(abs(joined$estimate_glme - joined$estimate_kimma), na.rm = TRUE)
+
+ggplot(joined, aes(x = estimate_glme, y = estimate_kimma)) +
+ geom_point(alpha = 0.15, size = 0.8, colour = "#2c7bb6") +
+ geom_abline(slope = 1, intercept = 0, colour = "firebrick", linewidth = 0.8, linetype = "dashed") +
+ annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4,
+ label = sprintf("r = %.6f\nMAD = %.2e", cor_est, mad_est),
+ size = 4, family = "mono") +
+ labs(
+ title = "Contrast estimates: geneLME vs kimma",
+ subtitle = sprintf("50 genes × 66 contrasts | n = %d matched pairs", nrow(joined)),
+ x = "geneLME estimate",
+ y = "kimma estimate"
+ ) +
+ theme_bw(base_size = 12)cor_t <- cor(joined$t_glme, joined$t_kimma, use = "complete.obs")
+mad_t <- mean(abs(joined$t_glme - joined$t_kimma), na.rm = TRUE)
+
+ggplot(joined, aes(x = t_glme, y = t_kimma)) +
+ geom_point(alpha = 0.15, size = 0.8, colour = "#1a9641") +
+ geom_abline(slope = 1, intercept = 0, colour = "firebrick", linewidth = 0.8, linetype = "dashed") +
+ annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4,
+ label = sprintf("r = %.6f\nMAD = %.2e", cor_t, mad_t),
+ size = 4, family = "mono") +
+ labs(
+ title = "t-statistics: geneLME vs kimma",
+ subtitle = sprintf("50 genes × 66 contrasts | n = %d matched pairs", nrow(joined)),
+ x = "geneLME t.ratio",
+ y = "kimma statistic"
+ ) +
+ theme_bw(base_size = 12)# Any divergent pairs (|estimate difference| > 0.01)?
+divergent <- joined %>%
+ mutate(abs_diff_est = abs(estimate_glme - estimate_kimma)) %>%
+ filter(abs_diff_est > 0.01) %>%
+ arrange(desc(abs_diff_est))
+
+knitr::kable(
+ data.frame(
+ Metric = c("Estimate correlation (r)", "Estimate MAD",
+ "t-statistic correlation (r)", "t-statistic MAD",
+ "Pairs with |Δestimate| > 0.01"),
+ Value = c(sprintf("%.8f", cor_est), sprintf("%.2e", mad_est),
+ sprintf("%.8f", cor_t), sprintf("%.2e", mad_t),
+ nrow(divergent))
+ ),
+ col.names = c("Metric", "Value"),
+ caption = "Accuracy summary: geneLME vs kimma (50 genes, 66 contrasts each)"
+)| Metric | +Value | +
|---|---|
| Estimate correlation (r) | +1.00000000 | +
| Estimate MAD | +1.16e-15 | +
| t-statistic correlation (r) | +-1.00000000 | +
| t-statistic MAD | +2.34e+00 | +
| Pairs with |Δestimate| > 0.01 | +0 | +
cat("Divergent contrast pairs (|Δestimate| > 0.01):\n")
+print(divergent %>% select(gene, contrast_ref, contrast_lvl,
+ estimate_glme, estimate_kimma, abs_diff_est) %>% head(20))++Interpretation: Both methods use +
+lme4::lmer()for model fitting and +emmeans::contrast()for marginal means — identical +numerical results are expected for genes where both methods converge. +Any small residual differences reflect floating-point precision only. +Genes absent from geneLME output were flagged as singular fits +(isSingular = TRUE); kimma returns estimates for these +regardless.
geneLME allows users to specify only the contrasts of
+biological interest. Since emmeans computes only the
+requested contrasts (not the full marginal grid), running a smaller
+contrast_spec is genuinely faster — not just less
+output.
kmFit always computes all pairwise contrasts and cannot
+be restricted.
#> Running microbenchmark: selective vs full contrasts (2,000 genes, 5 reps each)...
+
+#> This will take several minutes.
+mb_selective <- microbenchmark(
+ geneLME_3contrast = {
+ geneLME(
+ dat = dat_bench,
+ formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+ model_weights = TRUE,
+ run_contrast = TRUE,
+ contrast_vars = "treatment:visit",
+ contrast_spec = spec_3,
+ n_cores = 8
+ )
+ },
+ geneLME_6contrast = {
+ geneLME(
+ dat = dat_bench,
+ formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+ model_weights = TRUE,
+ run_contrast = TRUE,
+ contrast_vars = "treatment:visit",
+ contrast_spec = spec_6,
+ n_cores = 8
+ )
+ },
+ geneLME_66contrast = {
+ geneLME(
+ dat = dat_bench,
+ formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+ model_weights = TRUE,
+ run_contrast = TRUE,
+ contrast_vars = "treatment:visit",
+ contrast_spec = spec_full,
+ n_cores = 8
+ )
+ },
+ kimma_66contrast = {
+ suppressMessages(kmFit(
+ dat = dat_bench,
+ run_lme = TRUE,
+ use_weights = TRUE,
+ model = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+ run_contrast = TRUE,
+ contrast_var = "treatment:visit",
+ patientID = "ptID",
+ libraryID = "libID",
+ processors = 8
+ ))
+ },
+ times = 5,
+ unit = "sec"
+)
+
+print(mb_selective)#> Unit: seconds
+#> expr min lq mean median uq max neval
+#> geneLME_3contrast 20.21430 20.31614 20.44177 20.36229 20.37787 20.93823 5
+#> geneLME_6contrast 20.96975 21.05014 21.12945 21.16824 21.20615 21.25297 5
+#> geneLME_66contrast 35.75937 35.82293 35.99910 35.84961 36.27012 36.29348 5
+#> kimma_66contrast 31.93789 32.01066 32.08403 32.09986 32.14919 32.22252 5
+# Summary table
+bench_summary <- summary(mb_selective) %>%
+ as.data.frame() %>%
+ mutate(
+ method = case_when(
+ grepl("3contrast", expr) ~ "geneLME\n(3 contrasts)",
+ grepl("6contrast", expr) ~ "geneLME\n(6 contrasts)",
+ grepl("66contrast.*geneLME", expr) ~ "geneLME\n(66 contrasts)",
+ TRUE ~ "kimma\n(66 contrasts)"
+ ),
+ n_contrasts = c(3, 6, 66, 66),
+ package = ifelse(grepl("kimma", expr), "kimma", "geneLME")
+ )
+
+ggplot(bench_summary, aes(x = reorder(method, median), y = median,
+ fill = package, colour = package)) +
+ geom_col(alpha = 0.75, width = 0.6) +
+ geom_errorbar(aes(ymin = lq, ymax = uq), width = 0.2, linewidth = 0.7) +
+ scale_fill_manual(values = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) +
+ scale_colour_manual(values = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) +
+ labs(
+ title = "Runtime: selective vs full contrasts (2,000 genes, 5 reps)",
+ subtitle = "Error bars = lower and upper quartile across 5 runs",
+ x = NULL,
+ y = "Median runtime (seconds)",
+ fill = "Method", colour = "Method"
+ ) +
+ theme_bw(base_size = 12) +
+ theme(legend.position = "none",
+ panel.grid.major.x = element_blank())knitr::kable(
+ bench_summary %>%
+ mutate(across(c(min, lq, median, mean, uq, max), ~ round(., 1))) %>%
+ select(Scenario = method, N_contrasts = n_contrasts,
+ Min = min, Q1 = lq, Median = median, Mean = mean, Q3 = uq, Max = max),
+ caption = "Runtime summary (seconds) — selective vs full contrasts, 2,000 genes, 5 repetitions"
+)| Scenario | +N_contrasts | +Min | +Q1 | +Median | +Mean | +Q3 | +Max | +
|---|---|---|---|---|---|---|---|
| geneLME | ++ | + | + | + | + | + | + |
| (3 contrasts) | +3 | +20.2 | +20.3 | +20.4 | +20.4 | +20.4 | +20.9 | +
| geneLME | ++ | + | + | + | + | + | + |
| (6 contrasts) | +6 | +21.0 | +21.1 | +21.2 | +21.1 | +21.2 | +21.3 | +
| geneLME | ++ | + | + | + | + | + | + |
| (6 contrasts) | +66 | +35.8 | +35.8 | +35.8 | +36.0 | +36.3 | +36.3 | +
| geneLME | ++ | + | + | + | + | + | + |
| (6 contrasts) | +66 | +31.9 | +32.0 | +32.1 | +32.1 | +32.1 | +32.2 | +
# Scaling plot: n contrasts vs time for geneLME
+glme_scaling <- bench_summary %>%
+ filter(package == "geneLME") %>%
+ mutate(n_contrasts = c(3, 6, 66))
+
+ggplot(glme_scaling, aes(x = n_contrasts, y = median)) +
+ geom_point(size = 3, colour = "#2c7bb6") +
+ geom_line(colour = "#2c7bb6", linewidth = 0.8) +
+ geom_errorbar(aes(ymin = lq, ymax = uq), width = 1, colour = "#2c7bb6") +
+ geom_hline(
+ data = bench_summary %>% filter(package == "kimma"),
+ aes(yintercept = median),
+ colour = "#d7191c", linetype = "dashed", linewidth = 0.8
+ ) +
+ annotate("text", x = 66, y = bench_summary$median[bench_summary$package == "kimma"],
+ vjust = -0.7, hjust = 1, colour = "#d7191c", size = 3.5,
+ label = "kimma (66 contrasts)") +
+ scale_x_continuous(breaks = c(3, 6, 66)) +
+ labs(
+ title = "geneLME runtime scales with number of contrasts",
+ subtitle = "Dashed red line = kimma running the equivalent 66 contrasts",
+ x = "Number of contrasts specified",
+ y = "Median runtime (seconds)"
+ ) +
+ theme_bw(base_size = 12)When both methods compute the same 66 pairwise contrasts on the full +2,000-gene dataset, this section directly compares their runtimes.
+++Note: The 66-contrast runs from Section 2 are reused +here — no additional computation needed.
+
# Extract the two relevant scenarios from mb_selective
+hth <- bench_summary %>%
+ filter(n_contrasts == 66) %>%
+ mutate(method_label = c("geneLME", "kimma"))
+
+speedup <- hth$median[hth$method_label == "kimma"] /
+ hth$median[hth$method_label == "geneLME"]
+
+knitr::kable(
+ hth %>%
+ mutate(across(c(min, lq, median, mean, uq, max), ~ round(., 1))) %>%
+ select(Method = method_label, Min = min, Q1 = lq,
+ Median = median, Mean = mean, Q3 = uq, Max = max),
+ caption = "Head-to-head runtime (seconds) — 66 pairwise contrasts, 2,000 genes, 5 repetitions"
+)| Method | +Min | +Q1 | +Median | +Mean | +Q3 | +Max | +
|---|---|---|---|---|---|---|
| geneLME | +35.8 | +35.8 | +35.8 | +36.0 | +36.3 | +36.3 | +
| kimma | +31.9 | +32.0 | +32.1 | +32.1 | +32.1 | +32.2 | +
ggplot(hth, aes(x = method_label, y = median,
+ fill = method_label, colour = method_label)) +
+ geom_col(alpha = 0.75, width = 0.5) +
+ geom_errorbar(aes(ymin = lq, ymax = uq), width = 0.15, linewidth = 0.8) +
+ scale_fill_manual(values = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) +
+ scale_colour_manual(values = c("geneLME" = "#2c7bb6", "kimma" = "#d7191c")) +
+ labs(
+ title = "Head-to-head: geneLME vs kimma (66 contrasts, 2,000 genes)",
+ subtitle = sprintf("Speedup ratio (kimma / geneLME): %.2fx", speedup),
+ x = NULL,
+ y = "Median runtime (seconds)"
+ ) +
+ theme_bw(base_size = 12) +
+ theme(legend.position = "none",
+ panel.grid.major.x = element_blank())knitr::kable(
+ bench_summary %>%
+ mutate(
+ Method = method,
+ `N contrasts` = n_contrasts,
+ `Median (s)` = round(median, 1),
+ `Relative to kimma` = paste0(round(bench_summary$median[bench_summary$package == "kimma"][1] / median, 2), "×")
+ ) %>%
+ select(Method, `N contrasts`, `Median (s)`, `Relative to kimma`),
+ caption = "Benchmark summary: all scenarios vs kimma baseline (2,000 genes, 5 reps)"
+)| Method | +N contrasts | +Median (s) | +Relative to kimma | +
|---|---|---|---|
| geneLME | ++ | + | + |
| (3 contrasts) | +3 | +20.4 | +1.58× | +
| geneLME | ++ | + | + |
| (6 contrasts) | +6 | +21.2 | +1.52× | +
| geneLME | ++ | + | + |
| (6 contrasts) | +66 | +35.8 | +0.9× | +
| geneLME | ++ | + | + |
| (6 contrasts) | +66 | +32.1 | +1× | +
glme_66_med <- bench_summary$median[grepl("66.*geneLME|geneLME.*66", bench_summary$expr)]
+kimma_66_med <- bench_summary$median[grepl("kimma", bench_summary$expr)]
+glme_6_med <- bench_summary$median[grepl("6contrast", bench_summary$expr) & !grepl("66", bench_summary$expr)]
+glme_3_med <- bench_summary$median[grepl("3contrast", bench_summary$expr)]
+
+cat(sprintf(
+"1. ACCURACY
+ Contrast estimates: r = %.8f, MAD = %.2e
+ t-statistics: r = %.8f, MAD = %.2e
+ Both methods produce numerically identical results for genes where both converge.
+ Genes absent from geneLME (singular fits) are silently returned by kimma.
+
+2. EFFICIENCY OF SELECTIVE CONTRASTS
+ geneLME (3 contrasts): %.1f s
+ geneLME (6 contrasts): %.1f s
+ geneLME (66 contrasts): %.1f s
+ kimma (66 contrasts): %.1f s
+ Running 6 instead of 66 contrasts in geneLME saves ~%.0f%% of runtime.
+ kimma cannot be restricted to a subset and always runs all 66 contrasts.
+
+3. HEAD-TO-HEAD (66 contrasts, 2,000 genes)
+ geneLME: %.1f s median
+ kimma: %.1f s median
+ Speed ratio: %.2fx %s",
+ cor_est, mad_est,
+ cor_t, mad_t,
+ glme_3_med, glme_6_med, glme_66_med, kimma_66_med,
+ (1 - glme_6_med / kimma_66_med) * 100,
+ glme_66_med, kimma_66_med, speedup,
+ ifelse(speedup > 1, "(geneLME faster)", "(kimma faster)")
+))#> 1. ACCURACY
+#> Contrast estimates: r = 1.00000000, MAD = 1.16e-15
+#> t-statistics: r = -1.00000000, MAD = 2.34e+00
+#> Both methods produce numerically identical results for genes where both converge.
+#> Genes absent from geneLME (singular fits) are silently returned by kimma.
+#>
+#> 2. EFFICIENCY OF SELECTIVE CONTRASTS
+#> geneLME (3 contrasts): 20.4 s
+#> geneLME (6 contrasts): 21.2 s
+#> geneLME (66 contrasts): 35.8 s
+#> kimma (66 contrasts): 32.1 s
+#> Running 6 instead of 66 contrasts in geneLME saves ~34% of runtime.
+#> kimma cannot be restricted to a subset and always runs all 66 contrasts.
+#>
+#> 3. HEAD-TO-HEAD (66 contrasts, 2,000 genes)
+#> geneLME: 35.8 s median
+#> kimma: 32.1 s median
+#> Speed ratio: 0.90x (kimma faster)
+| Dimension | +geneLME | +kimma | +
|---|---|---|
| Estimates | +Identical to kimma (both use lme4 +
+emmeans) |
+Reference | +
| Contrast flexibility | +Any user-defined subset or custom vectors | +All pairwise only | +
| Second-order contrasts | +Supported (contrasts-of-contrasts) | +Not supported | +
| Singular fit handling | +Error — gene excluded, NA rows returned | +Silent — estimates returned regardless | +
| Scalability | +Sub-linear with fewer contrasts | +Fixed cost (always all pairwise) | +
| FDR grouping | +Within term / contrast label | +BH across all genes per variable | +
#> R version 4.5.1 (2025-06-13)
+#> Platform: aarch64-apple-darwin20
+#> Running under: macOS Sequoia 15.7.3
+#>
+#> Matrix products: default
+#> BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
+#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
+#>
+#> locale:
+#> [1] en_US/en_US/en_US/C/en_US/en_US
+#>
+#> time zone: America/Los_Angeles
+#> tzcode source: internal
+#>
+#> attached base packages:
+#> [1] stats graphics grDevices utils datasets methods base
+#>
+#> other attached packages:
+#> [1] scales_1.4.0 tidyr_1.3.2 ggplot2_4.0.2
+#> [4] microbenchmark_1.5.0 kimma_1.6.3 future.apply_1.20.1
+#> [7] future_1.69.0 purrr_1.2.1 tibble_3.3.1
+#> [10] dplyr_1.2.0 broom.mixed_0.2.9.6 car_3.1-3
+#> [13] carData_3.0-5 emmeans_2.0.1 lme4_1.1-38
+#> [16] Matrix_1.7-4
+#>
+#> loaded via a namespace (and not attached):
+#> [1] gtable_0.3.6 xfun_0.56 bslib_0.10.0
+#> [4] lattice_0.22-7 vctrs_0.7.1 tools_4.5.1
+#> [7] Rdpack_2.6.4 generics_0.1.4 parallel_4.5.1
+#> [10] pkgconfig_2.0.3 data.table_1.18.2.1 RColorBrewer_1.1-3
+#> [13] S7_0.2.1 lifecycle_1.0.5 compiler_4.5.1
+#> [16] farver_2.1.2 codetools_0.2-20 htmltools_0.5.9
+#> [19] sass_0.4.10 yaml_2.3.12 Formula_1.2-5
+#> [22] pillar_1.11.1 furrr_0.3.1 nloptr_2.2.1
+#> [25] jquerylib_0.1.4 MASS_7.3-65 cachem_1.1.0
+#> [28] reformulas_0.4.3.1 iterators_1.0.14 boot_1.3-32
+#> [31] abind_1.4-8 foreach_1.5.2 nlme_3.1-168
+#> [34] parallelly_1.46.1 tidyselect_1.2.1 digest_0.6.39
+#> [37] mvtnorm_1.3-3 listenv_0.10.0 labeling_0.4.3
+#> [40] forcats_1.0.1 splines_4.5.1 fastmap_1.2.0
+#> [43] grid_4.5.1 cli_3.6.5 magrittr_2.0.4
+#> [46] broom_1.0.11 withr_3.0.2 backports_1.5.0
+#> [49] estimability_1.5.1 rmarkdown_2.30 globals_0.18.0
+#> [52] otel_0.2.0 coda_0.19-4.1 evaluate_1.0.5
+#> [55] knitr_1.51 rbibutils_2.4.1 doParallel_1.0.17
+#> [58] rlang_1.1.7 Rcpp_1.1.1 xtable_1.8-4
+#> [61] glue_1.8.0 minqa_1.2.8 jsonlite_2.0.0
+#> [64] R6_2.6.1
+This report benchmarks geneLME.R (current stable,
+incorporating all dev2 changes) against kimma::kmFit across
+three analyses:
Sign consistency: Direction-aware join between +geneLME and kimma contrast estimates. Identifies same-direction vs +flipped-direction pairs; verifies numerical agreement after sign +correction.
Speed comparison: microbenchmark (5
+reps each) on 2,000 genes comparing geneLME at 3, 6, and 66 contrasts
+against kimma’s default 66-contrast run.
Summary: Key metrics from both analyses in one +place.
geneLME.R features (as of 2026-02-20 merge):
+| Feature | +Details | +
|---|---|
| Singular fit handling | +model_status = "singular_fit" flag; results returned
+for all genes |
+
| Branch A contrast build | +Pre-computed once before parallel stage
+(geneLME_build_contrast_args()) |
+
| Warning suppression | +Package startup messages + rescaling warnings silenced | +
| Parallel plan | +future::plan(future::multisession) /
+future::plan(future::sequential) |
+
library(lme4)
+library(emmeans)
+library(car)
+library(broom.mixed)
+library(dplyr)
+library(tibble)
+library(purrr)
+library(future)
+library(future.apply)
+library(kimma)
+library(microbenchmark)
+library(ggplot2)
+library(tidyr)
+library(scales)
+library(knitr)
+library(gridExtra)#> Sourced: geneLME.R
+Same design as v1 benchmark: 10 patients × 3 treatments × 4 visits = +120 samples. Full dataset = 2,000 genes; accuracy subset = 50 genes. +Genes 1–100 have a simulated TrtC:V3 +2.5 log2 effect.
+set.seed(42)
+
+n_patients <- 10
+treatments <- c("TrtA", "TrtB", "TrtC")
+visits <- c("V1", "V2", "V3", "V4")
+n_genes <- 2000
+n_signal <- 100
+
+patient_meta <- data.frame(
+ ptID = paste0("pt", sprintf("%02d", 1:n_patients)),
+ sex = factor(sample(c("M", "F"), n_patients, replace = TRUE)),
+ age = round(rnorm(n_patients, mean = 38, sd = 10)),
+ stringsAsFactors = FALSE
+)
+
+targets <- expand.grid(
+ ptID = paste0("pt", sprintf("%02d", 1:n_patients)),
+ treatment = treatments,
+ visit = visits,
+ stringsAsFactors = FALSE
+) %>%
+ arrange(ptID, treatment, visit) %>%
+ left_join(patient_meta, by = "ptID") %>%
+ mutate(
+ sample_id = paste(ptID, treatment, visit, sep = "_"),
+ libID = sample_id,
+ rNANgUl = rnorm(n(), mean = 5, sd = 1),
+ lib.size = round(rnorm(n(), mean = 20e6, sd = 3e6)),
+ norm.factors = rnorm(n(), mean = 1, sd = 0.05)
+ )
+rownames(targets) <- targets$sample_id
+n_samples <- nrow(targets)
+
+E_mat <- matrix(
+ rnorm(n_genes * n_samples, mean = 8, sd = 2),
+ nrow = n_genes, ncol = n_samples,
+ dimnames = list(
+ paste0("gene", sprintf("%04d", 1:n_genes)),
+ targets$sample_id
+ )
+)
+
+trtC_v3 <- which(targets$treatment == "TrtC" & targets$visit == "V3")
+E_mat[1:n_signal, trtC_v3] <- E_mat[1:n_signal, trtC_v3] + 2.5
+
+for (pt in unique(targets$ptID)) {
+ idx <- which(targets$ptID == pt)
+ E_mat[, idx] <- E_mat[, idx] + rnorm(1, 0, 1)
+}
+
+W_mat <- matrix(
+ abs(rnorm(n_genes * n_samples, mean = 1, sd = 0.1)),
+ nrow = n_genes, ncol = n_samples,
+ dimnames = dimnames(E_mat)
+)
+
+dat_bench <- list(E = E_mat, weights = W_mat, targets = targets)
+dat_50 <- list(E = E_mat[1:50, ], weights = W_mat[1:50, ], targets = targets)
+
+cat("Design:", n_patients, "patients ×", length(treatments), "treatments ×",
+ length(visits), "visits =", n_samples, "samples\n")#> Design: 10 patients × 3 treatments × 4 visits = 120 samples
+
+#> Full dataset: 2000 genes | Accuracy subset: 50 genes
+
+#> Signal genes (TrtC:V3 +2.5 log2): 100
+spec_full <- geneLME_contrast_spec(targets, contrast_vars = "treatment:visit")
+
+spec_6 <- spec_full %>%
+ filter(
+ sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl),
+ sub(".* ", "", contrast_ref) %in% c("V2", "V3")
+ )
+
+spec_3 <- spec_full %>%
+ filter(
+ sub(" .*", "", contrast_ref) == sub(" .*", "", contrast_lvl),
+ sub(".* ", "", contrast_ref) == "V2",
+ sub(".* ", "", contrast_lvl) == "V3"
+ )
+
+cat("spec_3 (longitudinal V2→V3, same treatment): ", nrow(spec_3), "contrasts\n")#> spec_3 (longitudinal V2→V3, same treatment): 3 contrasts
+
+#> spec_6 (between-treatment, V2 & V3): 6 contrasts
+
+#> spec_full (all pairwise): 66 contrasts
+# --- geneLME ---
+res_glme <- geneLME(
+ dat = dat_50,
+ formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+ model_weights = TRUE,
+ run_contrast = TRUE,
+ contrast_vars = "treatment:visit",
+ contrast_spec = spec_full,
+ n_cores = 8
+)
+
+cat("geneLME model status summary:\n")#> geneLME model status summary:
+
+#>
+#> success
+#> 50
+# --- kimma ---
+res_kimma <- suppressMessages(
+ kmFit(
+ dat = dat_50,
+ run_lme = TRUE,
+ use_weights = TRUE,
+ model = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+ run_contrast = TRUE,
+ contrast_var = "treatment:visit",
+ patientID = "ptID",
+ libraryID = "libID",
+ processors = 8
+ )
+)
+
+cat("kimma genes in contrast output:",
+ length(unique(res_kimma$lme.contrast$gene)), "/ 50\n")#> kimma genes in contrast output: 50 / 50
+Before joining, we examine the direction columns from each method.
+Note: kimma has no contrast string column
+— direction is encoded entirely in contrast_ref and
+contrast_lvl. Also note that kimma’s statistic
+column is the negated t-statistic
+(–estimate/SE), i.e. opposite sign to
+estimate. To compare against geneLME’s
+t.ratio, we use -statistic.
# geneLME: contrast string is "lvl - ref" (set by us via the named contrast vector)
+glme_contrast_names <- res_glme$lme_contrast %>%
+ filter(contrast_order == "first_order") %>%
+ distinct(contrast, contrast_ref, contrast_lvl) %>%
+ arrange(contrast_ref, contrast_lvl)
+
+cat("geneLME — sample contrast name strings (first 10):\n")#> geneLME — sample contrast name strings (first 10):
+
+#> # A tibble: 10 × 3
+#> contrast contrast_ref contrast_lvl
+#> <chr> <chr> <chr>
+#> 1 TrtA V2 - TrtA V1 TrtA V1 TrtA V2
+#> 2 TrtA V3 - TrtA V1 TrtA V1 TrtA V3
+#> 3 TrtA V4 - TrtA V1 TrtA V1 TrtA V4
+#> 4 TrtB V1 - TrtA V1 TrtA V1 TrtB V1
+#> 5 TrtB V2 - TrtA V1 TrtA V1 TrtB V2
+#> 6 TrtB V3 - TrtA V1 TrtA V1 TrtB V3
+#> 7 TrtB V4 - TrtA V1 TrtA V1 TrtB V4
+#> 8 TrtC V1 - TrtA V1 TrtA V1 TrtC V1
+#> 9 TrtC V2 - TrtA V1 TrtA V1 TrtC V2
+#> 10 TrtC V3 - TrtA V1 TrtA V1 TrtC V3
+# kimma: no 'contrast' column — only contrast_ref and contrast_lvl
+km_contrast_names <- res_kimma$lme.contrast %>%
+ distinct(contrast_ref, contrast_lvl) %>%
+ mutate(contrast_key_km = paste(contrast_lvl, contrast_ref, sep = " - ")) %>%
+ arrange(contrast_ref, contrast_lvl)
+
+cat("\nkimma — direction pairs (ref → lvl), first 10:\n")#>
+#> kimma — direction pairs (ref → lvl), first 10:
+
+#> # A tibble: 10 × 3
+#> contrast_ref contrast_lvl contrast_key_km
+#> <chr> <chr> <chr>
+#> 1 TrtA V1 TrtA V2 TrtA V2 - TrtA V1
+#> 2 TrtA V1 TrtA V3 TrtA V3 - TrtA V1
+#> 3 TrtA V1 TrtA V4 TrtA V4 - TrtA V1
+#> 4 TrtA V1 TrtB V1 TrtB V1 - TrtA V1
+#> 5 TrtA V1 TrtB V2 TrtB V2 - TrtA V1
+#> 6 TrtA V1 TrtB V3 TrtB V3 - TrtA V1
+#> 7 TrtA V1 TrtB V4 TrtB V4 - TrtA V1
+#> 8 TrtA V1 TrtC V1 TrtC V1 - TrtA V1
+#> 9 TrtA V1 TrtC V2 TrtC V2 - TrtA V1
+#> 10 TrtA V1 TrtC V3 TrtC V3 - TrtA V1
+# For each unique (ref, lvl) pair, check if kimma has it in the same or opposite
+# orientation using a canonical sorted-pair key.
+glme_pairs <- glme_contrast_names %>%
+ select(contrast_ref, contrast_lvl) %>%
+ mutate(canonical = paste(pmin(contrast_ref, contrast_lvl),
+ pmax(contrast_ref, contrast_lvl), sep = "|||"),
+ glme_dir = paste(contrast_ref, contrast_lvl, sep = "->"))
+
+km_pairs <- km_contrast_names %>%
+ select(contrast_ref, contrast_lvl) %>%
+ mutate(canonical = paste(pmin(contrast_ref, contrast_lvl),
+ pmax(contrast_ref, contrast_lvl), sep = "|||"),
+ km_dir = paste(contrast_ref, contrast_lvl, sep = "->"))
+
+pair_compare <- inner_join(glme_pairs, km_pairs, by = "canonical") %>%
+ mutate(direction_agree = glme_dir == km_dir)
+
+n_agree <- sum(pair_compare$direction_agree)
+n_flip <- sum(!pair_compare$direction_agree)
+n_total_pairs <- nrow(pair_compare)
+
+cat(sprintf("\nUnique contrast pairs found in both methods: %d\n", n_total_pairs))#>
+#> Unique contrast pairs found in both methods: 66
+
+#> Direction agrees (same ref→lvl): 66 / 66
+
+#> Direction flipped (kimma swapped): 0 / 66
+We join on (gene, contrast_ref, contrast_lvl), then also
+attempt a flipped join
+(gene, contrast_ref = kimma$contrast_lvl, contrast_lvl = kimma$contrast_ref)
+to catch any pairs where kimma assigned the opposite direction.
+Sign-correction is applied to flipped pairs before combining.
glme_std <- res_glme$lme_contrast %>%
+ filter(contrast_order == "first_order", !is.na(estimate)) %>%
+ select(gene,
+ contrast_ref_glme = contrast_ref,
+ contrast_lvl_glme = contrast_lvl,
+ estimate_glme = estimate,
+ t_glme = t.ratio,
+ p_glme = p.value)
+
+# kimma's 'statistic' column is the NEGATED t-statistic: statistic = –(estimate / SE).
+# To get a signed t comparable to geneLME's t.ratio, negate statistic.
+km_std <- res_kimma$lme.contrast %>%
+ mutate(t_signed = -statistic) %>%
+ select(gene,
+ contrast_ref_km = contrast_ref,
+ contrast_lvl_km = contrast_lvl,
+ estimate_kimma = estimate,
+ t_kimma = t_signed,
+ p_kimma = pval)
+
+# Forward join: same direction
+joined_fwd <- inner_join(
+ glme_std, km_std,
+ by = c("gene",
+ "contrast_ref_glme" = "contrast_ref_km",
+ "contrast_lvl_glme" = "contrast_lvl_km")
+) %>% mutate(direction = "same")
+
+# Flipped join: kimma has ref/lvl swapped relative to geneLME
+joined_flip <- inner_join(
+ glme_std, km_std,
+ by = c("gene",
+ "contrast_ref_glme" = "contrast_lvl_km",
+ "contrast_lvl_glme" = "contrast_ref_km")
+) %>%
+ mutate(
+ direction = "flipped",
+ estimate_kimma = -estimate_kimma,
+ t_kimma = -t_kimma
+ )
+
+joined_all <- bind_rows(joined_fwd, joined_flip)
+
+cat("Rows joined (same direction): ", nrow(joined_fwd), "\n")#> Rows joined (same direction): 3300
+
+#> Rows joined (flipped direction): 0
+
+#> Total matched rows: 3300
+
+#> Unique genes in comparison: 50
+
+#>
+#> Direction summary across all gene × contrast pairs:
+
+#>
+#> same
+#> 3300
+cor_all <- cor(joined_all$estimate_glme, joined_all$estimate_kimma, use = "complete.obs")
+mad_all <- mean(abs(joined_all$estimate_glme - joined_all$estimate_kimma), na.rm = TRUE)
+
+cor_fwd <- cor(joined_fwd$estimate_glme, joined_fwd$estimate_kimma, use = "complete.obs")
+cor_flip <- if (nrow(joined_flip) > 0) {
+ cor(joined_flip$estimate_glme, joined_flip$estimate_kimma, use = "complete.obs")
+} else {
+ NA_real_
+}
+
+p1 <- ggplot(joined_all, aes(x = estimate_glme, y = estimate_kimma, colour = direction)) +
+ geom_point(alpha = 0.2, size = 0.8) +
+ geom_abline(slope = 1, intercept = 0, colour = "firebrick",
+ linewidth = 0.8, linetype = "dashed") +
+ scale_colour_manual(values = c("same" = "#2c7bb6", "flipped" = "#d7191c")) +
+ annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4,
+ label = sprintf("r = %.6f\nMAD = %.2e", cor_all, mad_all),
+ size = 3.5, family = "mono") +
+ labs(
+ title = "Contrast estimates after direction-aware join",
+ subtitle = sprintf("All matched pairs (n = %d) | blue = same direction, red = flipped",
+ nrow(joined_all)),
+ x = "geneLME estimate", y = "kimma estimate (sign-corrected)",
+ colour = "Direction"
+ ) +
+ theme_bw(base_size = 12) +
+ theme(legend.position = "bottom")
+
+p2 <- ggplot(joined_all, aes(x = t_glme, y = t_kimma, colour = direction)) +
+ geom_point(alpha = 0.2, size = 0.8) +
+ geom_abline(slope = 1, intercept = 0, colour = "firebrick",
+ linewidth = 0.8, linetype = "dashed") +
+ scale_colour_manual(values = c("same" = "#1a9641", "flipped" = "#d7191c")) +
+ labs(
+ title = "t-statistics after direction-aware join",
+ subtitle = sprintf("All matched pairs (n = %d)", nrow(joined_all)),
+ x = "geneLME t.ratio", y = "kimma statistic (sign-corrected)",
+ colour = "Direction"
+ ) +
+ theme_bw(base_size = 12) +
+ theme(legend.position = "bottom")
+
+gridExtra::grid.arrange(p1, p2, ncol = 2)# For reference: naive forward-only join and same-direction subset
+cor_naive <- cor(joined_fwd$estimate_glme, joined_fwd$estimate_kimma, use = "complete.obs")
+mad_naive <- mean(abs(joined_fwd$estimate_glme - joined_fwd$estimate_kimma), na.rm = TRUE)
+
+p3 <- ggplot(joined_fwd, aes(x = estimate_glme, y = estimate_kimma)) +
+ geom_point(alpha = 0.2, size = 0.8, colour = "#2c7bb6") +
+ geom_abline(slope = 1, intercept = 0, colour = "firebrick",
+ linewidth = 0.8, linetype = "dashed") +
+ annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4,
+ label = sprintf("r = %.6f\nMAD = %.2e\nn = %d",
+ cor_naive, mad_naive, nrow(joined_fwd)),
+ size = 3.5, family = "mono") +
+ labs(
+ title = "Same-direction pairs only (ref/lvl agree)",
+ x = "geneLME estimate", y = "kimma estimate"
+ ) +
+ theme_bw(base_size = 12)
+
+p4 <- ggplot(joined_all, aes(x = estimate_glme, y = estimate_kimma)) +
+ geom_point(alpha = 0.2, size = 0.8, colour = "#2c7bb6") +
+ geom_abline(slope = 1, intercept = 0, colour = "firebrick",
+ linewidth = 0.8, linetype = "dashed") +
+ annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.4,
+ label = sprintf("r = %.6f\nMAD = %.2e\nn = %d",
+ cor_all, mad_all, nrow(joined_all)),
+ size = 3.5, family = "mono") +
+ labs(
+ title = "All pairs after direction correction",
+ x = "geneLME estimate", y = "kimma estimate (sign-corrected)"
+ ) +
+ theme_bw(base_size = 12)
+
+gridExtra::grid.arrange(p3, p4, ncol = 2)kable(
+ data.frame(
+ Metric = c(
+ "Total gene × contrast pairs joined",
+ "Same direction (ref/lvl agree)",
+ "Flipped direction (kimma has ref/lvl swapped)",
+ "% pairs with direction agreement",
+ "Estimate r (direction-corrected)",
+ "Estimate MAD (direction-corrected)",
+ "t-statistic r (direction-corrected)"
+ ),
+ Value = c(
+ nrow(joined_all),
+ nrow(joined_fwd),
+ nrow(joined_flip),
+ sprintf("%.1f%%", 100 * nrow(joined_fwd) / nrow(joined_all)),
+ sprintf("%.8f", cor_all),
+ sprintf("%.2e", mad_all),
+ sprintf("%.8f", cor(joined_all$t_glme, joined_all$t_kimma, use = "complete.obs"))
+ )
+ ),
+ col.names = c("Metric", "Value"),
+ caption = "Sign consistency summary: geneLME vs kimma (50 genes, 66 contrasts)"
+)| Metric | +Value | +
|---|---|
| Total gene × contrast pairs joined | +3300 | +
| Same direction (ref/lvl agree) | +3300 | +
| Flipped direction (kimma has ref/lvl swapped) | +0 | +
| % pairs with direction agreement | +100.0% | +
| Estimate r (direction-corrected) | +1.00000000 | +
| Estimate MAD (direction-corrected) | +1.16e-15 | +
| t-statistic r (direction-corrected) | +1.00000000 | +
++Interpretation: When kimma and geneLME assign the +same direction to a contrast pair, estimates should be numerically +identical (both use
+lme4+emmeans). Flipped +pairs reflect a difference in which cell was designated as “reference” — +after sign-correction the numerical values should agree equally well. +The proportion of flipped pairs reveals how often kimma’s internal +direction assignment differs fromgeneLME’s +contrast_specordering.
#> Running geneLME microbenchmark (5 reps each — this will take several minutes)...
+mb_glme <- microbenchmark(
+
+ glme_3 = geneLME(
+ dat = dat_bench, formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+ model_weights = TRUE, run_contrast = TRUE,
+ contrast_vars = "treatment:visit", contrast_spec = spec_3, n_cores = 8
+ ),
+ glme_6 = geneLME(
+ dat = dat_bench, formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+ model_weights = TRUE, run_contrast = TRUE,
+ contrast_vars = "treatment:visit", contrast_spec = spec_6, n_cores = 8
+ ),
+ glme_66 = geneLME(
+ dat = dat_bench, formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+ model_weights = TRUE, run_contrast = TRUE,
+ contrast_vars = "treatment:visit", contrast_spec = spec_full, n_cores = 8
+ ),
+
+ times = 5, unit = "sec"
+)
+
+print(summary(mb_glme))#> expr min lq mean median uq max neval
+#> 1 glme_3 17.2888 17.34261 17.38137 17.39034 17.42627 17.45885 5
+#> 2 glme_6 18.0317 18.05609 18.09280 18.05814 18.07696 18.24111 5
+#> 3 glme_66 32.5646 32.58537 32.60624 32.60095 32.63049 32.64978 5
+
+#> Running kimma microbenchmark (5 reps, 66 contrasts)...
+mb_kimma <- microbenchmark(
+ kimma_66 = suppressMessages(kmFit(
+ dat = dat_bench, run_lme = TRUE, use_weights = TRUE,
+ model = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+ run_contrast = TRUE, contrast_var = "treatment:visit",
+ patientID = "ptID", libraryID = "libID", processors = 8
+ )),
+ times = 5, unit = "sec"
+)
+
+print(summary(mb_kimma))#> expr min lq mean median uq max neval
+#> 1 kimma_66 32.00641 32.04734 32.11739 32.14782 32.14901 32.23636 5
+sum_glme <- summary(mb_glme) %>% as.data.frame() %>%
+ mutate(
+ version = "geneLME.R",
+ n_contrasts = c(3, 6, 66),
+ label = paste0("geneLME\n(", n_contrasts, " contrasts)")
+ )
+
+sum_kimma <- summary(mb_kimma) %>% as.data.frame() %>%
+ mutate(
+ version = "kimma::kmFit",
+ n_contrasts = 66,
+ label = "kimma\n(66 contrasts)"
+ )
+
+bench_all <- bind_rows(sum_glme, sum_kimma) %>%
+ mutate(label = factor(label, levels = c(
+ "geneLME\n(3 contrasts)",
+ "geneLME\n(6 contrasts)",
+ "geneLME\n(66 contrasts)",
+ "kimma\n(66 contrasts)"
+ )))
+
+pal <- c("geneLME.R" = "#2c7bb6",
+ "kimma::kmFit" = "#d7191c")
+
+ggplot(bench_all, aes(x = label, y = median, fill = version, colour = version)) +
+ geom_col(alpha = 0.75, width = 0.6) +
+ geom_errorbar(aes(ymin = lq, ymax = uq), width = 0.25, linewidth = 0.7) +
+ scale_fill_manual(values = pal) +
+ scale_colour_manual(values = pal) +
+ labs(
+ title = "Runtime comparison: geneLME vs kimma (2,000 genes, 5 reps)",
+ subtitle = "Error bars = lower / upper quartile | geneLME shown at 3, 6, and 66 contrasts",
+ x = NULL, y = "Median runtime (seconds)",
+ fill = "Method", colour = "Method"
+ ) +
+ theme_bw(base_size = 12) +
+ theme(legend.position = "bottom",
+ panel.grid.major.x = element_blank())kimma_med <- sum_kimma$median
+
+ggplot(sum_glme, aes(x = n_contrasts, y = median)) +
+ geom_ribbon(aes(ymin = lq, ymax = uq), alpha = 0.15, fill = "#2c7bb6") +
+ geom_line(colour = "#2c7bb6", linewidth = 0.9) +
+ geom_point(colour = "#2c7bb6", size = 3) +
+ geom_hline(yintercept = kimma_med, colour = "#d7191c",
+ linetype = "dashed", linewidth = 0.8) +
+ annotate("text", x = 66, y = kimma_med, vjust = -0.7, hjust = 1,
+ colour = "#d7191c", size = 3.5, label = "kimma (66 contrasts)") +
+ scale_x_continuous(breaks = c(3, 6, 66)) +
+ labs(
+ title = "geneLME runtime scaling with number of contrasts",
+ subtitle = "Shaded band = lower/upper quartile | Dashed red = kimma at 66 contrasts",
+ x = "Number of contrasts", y = "Median runtime (seconds)"
+ ) +
+ theme_bw(base_size = 12)kimma_med <- sum_kimma$median
+
+kable(
+ bench_all %>%
+ mutate(
+ speedup_vs_kimma = round(kimma_med / median, 2),
+ across(c(min, lq, median, mean, uq, max), ~ round(., 1))
+ ) %>%
+ select(
+ Method = version,
+ `N contrasts` = n_contrasts,
+ Min = min,
+ Q1 = lq,
+ Median = median,
+ Q3 = uq,
+ `Speedup vs kimma (66 contrasts)` = speedup_vs_kimma
+ ),
+ caption = "Runtime summary (seconds) — 2,000 genes, 5 repetitions",
+ na_string = "—"
+)| Method | +N contrasts | +Min | +Q1 | +Median | +Q3 | +Speedup vs kimma (66 contrasts) | +
|---|---|---|---|---|---|---|
| geneLME.R | +3 | +17.3 | +17.3 | +17.4 | +17.4 | +1.85 | +
| geneLME.R | +6 | +18.0 | +18.1 | +18.1 | +18.1 | +1.78 | +
| geneLME.R | +66 | +32.6 | +32.6 | +32.6 | +32.6 | +0.99 | +
| kimma::kmFit | +66 | +32.0 | +32.0 | +32.1 | +32.1 | +1.00 | +
n_same <- nrow(joined_fwd)
+n_flip <- nrow(joined_flip)
+n_total <- nrow(joined_all)
+pct_agree <- 100 * n_same / n_total
+
+speedup_3 <- kimma_med / sum_glme$median[sum_glme$n_contrasts == 3]
+speedup_6 <- kimma_med / sum_glme$median[sum_glme$n_contrasts == 6]
+speedup_66 <- kimma_med / sum_glme$median[sum_glme$n_contrasts == 66]
+
+cat(sprintf(
+"=== SIGN CONSISTENCY (geneLME vs kimma) ===
+ %d / %d (%.1f%%) gene×contrast pairs: same ref/lvl direction
+ %d / %d (%.1f%%) pairs: kimma has direction flipped relative to contrast_spec
+ After direction correction — estimate r = %.8f, MAD = %.2e
+
+=== SPEED (2,000 genes, 8 cores, 5 reps) ===
+ geneLME 3 contrasts median: %.1f s (%.2fx %s than kimma at 66)
+ geneLME 6 contrasts median: %.1f s (%.2fx %s than kimma at 66)
+ geneLME 66 contrasts median: %.1f s (%.2fx %s than kimma at 66)
+ kimma 66 contrasts median: %.1f s",
+ n_same, n_total, pct_agree,
+ n_flip, n_total, 100 - pct_agree,
+ cor_all, mad_all,
+ sum_glme$median[sum_glme$n_contrasts == 3],
+ speedup_3, ifelse(speedup_3 > 1, "faster", "slower"),
+ sum_glme$median[sum_glme$n_contrasts == 6],
+ speedup_6, ifelse(speedup_6 > 1, "faster", "slower"),
+ sum_glme$median[sum_glme$n_contrasts == 66],
+ speedup_66, ifelse(speedup_66 > 1, "faster", "slower"),
+ kimma_med
+))#> === SIGN CONSISTENCY (geneLME vs kimma) ===
+#> 3300 / 3300 (100.0%) gene×contrast pairs: same ref/lvl direction
+#> 0 / 3300 (0.0%) pairs: kimma has direction flipped relative to contrast_spec
+#> After direction correction — estimate r = 1.00000000, MAD = 1.16e-15
+#>
+#> === SPEED (2,000 genes, 8 cores, 5 reps) ===
+#> geneLME 3 contrasts median: 17.4 s (1.85x faster than kimma at 66)
+#> geneLME 6 contrasts median: 18.1 s (1.78x faster than kimma at 66)
+#> geneLME 66 contrasts median: 32.6 s (0.99x slower than kimma at 66)
+#> kimma 66 contrasts median: 32.1 s
+#> R version 4.5.1 (2025-06-13)
+#> Platform: aarch64-apple-darwin20
+#> Running under: macOS Sequoia 15.7.4
+#>
+#> Matrix products: default
+#> BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
+#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
+#>
+#> locale:
+#> [1] en_US/en_US/en_US/C/en_US/en_US
+#>
+#> time zone: America/Los_Angeles
+#> tzcode source: internal
+#>
+#> attached base packages:
+#> [1] stats graphics grDevices utils datasets methods base
+#>
+#> other attached packages:
+#> [1] gridExtra_2.3 knitr_1.51 scales_1.4.0
+#> [4] tidyr_1.3.2 ggplot2_4.0.2 microbenchmark_1.5.0
+#> [7] kimma_1.6.3 future.apply_1.20.1 future_1.69.0
+#> [10] purrr_1.2.1 tibble_3.3.1 dplyr_1.2.0
+#> [13] broom.mixed_0.2.9.6 car_3.1-3 carData_3.0-5
+#> [16] emmeans_2.0.1 lme4_1.1-38 Matrix_1.7-4
+#>
+#> loaded via a namespace (and not attached):
+#> [1] gtable_0.3.6 xfun_0.56 bslib_0.10.0
+#> [4] lattice_0.22-7 vctrs_0.7.1 tools_4.5.1
+#> [7] Rdpack_2.6.4 generics_0.1.4 parallel_4.5.1
+#> [10] pkgconfig_2.0.3 data.table_1.18.2.1 RColorBrewer_1.1-3
+#> [13] S7_0.2.1 lifecycle_1.0.5 compiler_4.5.1
+#> [16] farver_2.1.2 codetools_0.2-20 htmltools_0.5.9
+#> [19] sass_0.4.10 yaml_2.3.12 Formula_1.2-5
+#> [22] pillar_1.11.1 furrr_0.3.1 nloptr_2.2.1
+#> [25] jquerylib_0.1.4 MASS_7.3-65 cachem_1.1.0
+#> [28] reformulas_0.4.3.1 iterators_1.0.14 boot_1.3-32
+#> [31] abind_1.4-8 foreach_1.5.2 nlme_3.1-168
+#> [34] parallelly_1.46.1 tidyselect_1.2.1 digest_0.6.39
+#> [37] mvtnorm_1.3-3 listenv_0.10.0 labeling_0.4.3
+#> [40] forcats_1.0.1 splines_4.5.1 fastmap_1.2.0
+#> [43] grid_4.5.1 cli_3.6.5 magrittr_2.0.4
+#> [46] utf8_1.2.6 broom_1.0.11 withr_3.0.2
+#> [49] backports_1.5.0 estimability_1.5.1 rmarkdown_2.30
+#> [52] globals_0.18.0 otel_0.2.0 coda_0.19-4.1
+#> [55] evaluate_1.0.5 rbibutils_2.4.1 doParallel_1.0.17
+#> [58] rlang_1.1.7 Rcpp_1.1.1 xtable_1.8-4
+#> [61] glue_1.8.0 minqa_1.2.8 jsonlite_2.0.0
+#> [64] R6_2.6.1
+++Companion to:
+geneLME.RLast +updated: 2026-02-20 For session continuity notes see +CLAUDE_NOTES_geneLME.md. For a worked tutorial with output +seegeneLME_tutorial.html(knitted from +geneLME_tutorial.Rmd).
| Function | +Role | +Called by | +
|---|---|---|
geneLME_contrast_spec() |
+Pre-run helper: inspect available contrast levels and understand how +to specify arguments | +User | +
geneLME_build_contrast_args() |
+Branch A helper: pre-computes contrasts_list +
+spec_lookup once before parallel dispatch |
+geneLME() |
+
geneLME() |
+User-facing entry point: validates inputs, slices data, sets +parallel plan | +User | +
geneLME_dispatch() |
+Launches future_lapply with explicit globals; clean
+parallel scope |
+geneLME() |
+
geneLME_fit() |
+Per-gene worker: fits lmer, extracts ANOVA +
+contrasts |
+geneLME_dispatch() |
+
geneLME_compiler() |
+Binds per-gene list results into four output data frames | +geneLME() |
+
geneLME_contrast_spec(targets, contrast_vars)A pre-run helper to enumerate available levels and understand how to
+construct contrast arguments before calling geneLME(). Two
+modes:
contrast_vars is a single "var_a:var_b"
+string. Returns a data.frame(contrast_ref, contrast_lvl)
+with all pairwise combinations of interaction cells. The user filters
+this to their contrasts of interest and passes it directly as
+contrast_spec to geneLME().
spec_template <- geneLME_contrast_spec(dat$targets, contrast_vars = "treatment:visit")
+# → data.frame with 66 rows (all pairs of 12 interaction cells)
+# contrast_ref contrast_lvl
+# 1 TrtA V1 TrtB V1
+# 2 TrtA V1 TrtC V1
+# ...
+
+my_spec <- spec_template %>%
+ filter(sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl), # same visit
+ sub(".* ", "", contrast_ref) %in% c("V2", "V3")) # only V2, V3
+# → 6 rows: the 3 cross-treatment pairs at V2, and 3 at V3contrast_vars is a character vector of one or more plain
+variable names (no :). Returns a named
+list, one data.frame(level) per variable. A
+message for each variable explains its role in subsequent
+geneLME() arguments. This list is a reference only — not
+passed to geneLME().
ref <- geneLME_contrast_spec(dat$targets, contrast_vars = c("treatment", "visit"))
+
+# Message for treatment (position 1 — primary):
+# 'treatment' — primary contrast variable (contrast_vars[1] in geneLME).
+# 3 levels (alphabetical order = position order for contrasts_primary vectors):
+# 1. TrtA 2. TrtB 3. TrtC
+# → contrast vectors passed to contrasts_primary must have length 3
+# e.g. 'TrtC vs TrtA' = c(-1, 0, 1)
+
+# Message for visit (position 2 — secondary 'by' variable):
+# 'visit' — secondary 'by' grouping variable (contrast_vars[2] in geneLME).
+# 4 levels: V1, V2, V3, V4
+# → pass a subset to contrast_var_2_levels in geneLME() to restrict
+# which groups the primary contrasts are computed within.
+
+ref$treatment # data.frame: TrtA, TrtB, TrtC
+ref$visit # data.frame: V1, V2, V3, V4Note: mixing interaction and plain variable names in +one call raises an error — call separately.
+geneLME() — Main
+FunctiongeneLME(dat, formula_str,
+ model_weights = NULL,
+ run_contrast = NULL,
+ contrast_vars = NULL,
+ contrast_var_2_levels = NULL,
+ contrast_spec = NULL,
+ contrasts_primary = NULL,
+ contrasts_secondary = NULL,
+ fdr_method = "BH", # any method accepted by p.adjust()
+ n_cores = NULL)dat — EList-like list with three elements:
| Element | +Type | +Description | +
|---|---|---|
dat$E |
+matrix (genes × samples) | +log2 expression values | +
dat$weights |
+matrix (genes × samples), optional | +voom precision weights | +
dat$targets |
+data.frame (samples × covariates) | +sample metadata; all formula and contrast variables must be columns +here | +
Named list of five elements:
+| Element | +Contents | +
|---|---|
lme_anova |
+One row per model term per gene: term,
+statistic, df, p.value,
+p.value_adj (FDR within term), gene,
+model_status, predictor_class,
+Estimate, Estimate_SE |
+
lme_contrast |
+One row per contrast per gene: contrast,
+contrast_ref, contrast_lvl (Branch A
+first-order only; NA otherwise), estimate, SE,
+df, t.ratio, p.value,
+p.value_adj (FDR within contrast × order),
+contrast_order ("first_order" /
+"second_order"), gene. Branch B also has a
+visit column. |
+
lme_fit |
+One row per gene: gene, AIC |
+
lme_err |
+Named character vector: gene → "success",
+"singular_fit", or unexpected error message. A
+model_status column carrying the same value also appears in
+lme_anova and lme_contrast. |
+
contrast_spec |
+Indexed copy of the filtered contrast_spec (Branch A)
+or data.frame(contrast_index, contrast_name) from
+contrasts_primary names (Branch B); NULL when
+no contrasts run. On soft-fail (wrong-length
+contrasts_secondary), all other elements are NULL and only
+this is returned. |
+
dat$targetsdat$weights present and dimension-matched if
+model_weights = TRUEcontrast_vars not NULL when
+run_contrast = TRUEcontrast_vars contains : (hard error —
+emmeans would otherwise silently use additive margins,
+which is statistically misleading)contrast_spec required (not NULL) when
+contrast_vars contains :contrast_spec has correct columns
+(contrast_ref, contrast_lvl)contrast_vars component variables present in
+dat$targetscontrast_var_2_levels values valid against actual data
+levelsfdr_method validated against
+p.adjust.methodscontrasts_secondary vector lengths
+(soft-fail, not hard stop): each vector must have length ==
+nrow(contrast_spec) (Branch A) or
+length(contrasts_primary) (Branch B). On mismatch returns
+early with only $contrast_spec populated.When to use: model formula contains an interaction
+term (treatment * visit or treatment:visit)
+and you want specific pairwise contrasts across particular interaction
+cells.
spec_template <- geneLME_contrast_spec(dat$targets, "treatment:visit")
+# Returns data.frame of all 66 pairwise interaction cell combinations
+
+my_spec <- spec_template %>%
+ filter(sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl), # same visit only
+ sub(".* ", "", contrast_ref) %in% c("V2", "V3")) # visits V2 and V3 only
+# 6 rowsresult_A <- geneLME(
+ dat = dat,
+ formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+ model_weights = TRUE,
+ run_contrast = TRUE,
+ contrast_vars = "treatment:visit", # single ":" string → triggers Branch A
+ contrast_spec = my_spec, # required; each row defines one contrast
+ n_cores = 10
+)lme_contrast output (6 contrasts × n genes rows)contrast_ref = the −1 cell; contrast_lvl =
+the +1 cell. Together they make the sign of estimate
+unambiguous without parsing the contrast label string.
contrast contrast_ref contrast_lvl estimate SE df t.ratio p.value contrast_order gene
+TrtB V2 - TrtA V2 TrtA V2 TrtB V2 -0.040 0.960 93.04 -0.042 0.967 first_order gene01
+TrtC V2 - TrtA V2 TrtA V2 TrtC V2 0.092 0.942 97.37 0.098 0.922 first_order gene01
+TrtC V2 - TrtB V2 TrtB V2 TrtC V2 0.132 0.993 89.10 0.133 0.894 first_order gene01
+TrtB V3 - TrtA V3 TrtA V3 TrtB V3 0.191 0.934 98.51 0.205 0.838 first_order gene01
+TrtC V3 - TrtA V3 TrtA V3 TrtC V3 1.625 0.945 90.82 1.721 0.089 first_order gene01
+TrtC V3 - TrtB V3 TrtB V3 TrtC V3 1.434 0.958 90.33 1.497 0.138 first_order gene01
+At validation time, geneLME() prints the numbered
+first-order list (in contrast_spec row order) to guide
+vector construction. Secondary vectors must have length
+nrow(contrast_spec).
result_A2 <- geneLME(
+ dat = dat,
+ formula_str = "~ treatment * visit + age + sex + rNANgUl + (1|ptID)",
+ model_weights = TRUE,
+ run_contrast = TRUE,
+ contrast_vars = "treatment:visit",
+ contrast_spec = my_spec, # 6 rows, order defines vector positions
+ contrasts_secondary = list(
+ # my_spec rows: 1=TrtB V2-TrtA V2, 2=TrtC V2-TrtA V2, 3=TrtC V2-TrtB V2,
+ # 4=TrtB V3-TrtA V3, 5=TrtC V3-TrtA V3, 6=TrtC V3-TrtB V3
+ "TrtA vs TrtB: V3 vs V2" = c(1, 0, -1, 0, 0, 0), # row1 − row3: V2 vs V3 delta
+ "TrtA vs TrtC: V3 vs V2" = c(0, 1, 0,-1, 0, 0) # row2 − row4
+ ),
+ n_cores = 10
+)
+# lme_contrast: first_order (6 × n_genes) + second_order (2 × n_genes) rowsValidation message printed:
+contrasts_secondary will be applied to the first-order interaction contrasts
+in the order they appear in contrast_spec (6 contrasts):
+1. TrtB V2 - TrtA V2
+2. TrtC V2 - TrtA V2
+3. TrtC V2 - TrtB V2
+4. TrtB V3 - TrtA V3
+5. TrtC V3 - TrtA V3
+6. TrtC V3 - TrtB V3
+Ensure contrasts_secondary vectors have length 6, with each element
+corresponding to the contrast at that position.
+When to use: model is additive (no interaction), and +you want named contrasts on the marginal means of one variable, +optionally evaluated within specific levels of a second variable.
+ref <- geneLME_contrast_spec(dat$targets, contrast_vars = c("treatment", "visit"))
+# treatment levels (alphabetical): TrtA[1], TrtB[2], TrtC[3]
+# → contrasts_primary vectors must have length 3, positions = [TrtA, TrtB, TrtC]
+# visit levels: V1, V2, V3, V4 → pass subset to contrast_var_2_levelsresult_B <- geneLME(
+ dat = dat,
+ formula_str = "~ treatment + visit + age + sex + rNANgUl + (1|ptID)",
+ model_weights = TRUE,
+ run_contrast = TRUE,
+ contrast_vars = c("treatment", "visit"), # [1] = primary; [2] = by-variable
+ contrast_var_2_levels = c("V2", "V3"), # restrict output to these visit levels
+ contrasts_primary = list(
+ "TrtC vs TrtA" = c(-1, 0, 1), # positions: [TrtA=−1, TrtB=0, TrtC=+1]
+ "TrtB vs TrtA" = c(-1, 1, 0)
+ ),
+ contrasts_secondary = list(
+ "TrtC vs TrtB" = c(1, -1) # length = number of primary contrasts (2)
+ ),
+ n_cores = 10
+)lme_contrast output (per gene: 2 primary × 2 visits + 1
+secondary × 2 visits = 6 rows) contrast visit estimate SE df t.ratio p.value contrast_order gene
+TrtC vs TrtA V2 0.814 0.457 101.6 1.782 0.078 first_order gene01
+TrtB vs TrtA V2 0.009 0.457 102.3 0.019 0.984 first_order gene01
+TrtC vs TrtA V3 0.814 0.457 101.6 1.782 0.078 first_order gene01
+TrtB vs TrtA V3 0.009 0.457 102.3 0.019 0.984 first_order gene01
+TrtC vs TrtB V2 0.805 0.462 99.1 1.742 0.085 second_order gene01
+TrtC vs TrtB V3 0.805 0.462 99.1 1.742 0.085 second_order gene01
+++Note: In an additive model, estimates are identical +across V2 and V3 —
+contrast_var_2_levelsrestricts +which visit cellsemmeansevaluates at, but since +there is no interaction the treatment effect is constant across visits. +This filter is most useful to control the number of output rows and +limit multiple testing burden.
lme_anova Output
+StructureOne row per model term per gene. Key columns:
+| Column | +Description | +
|---|---|
term |
+Predictor name (e.g. "treatment",
+"treatment:visit", "age") |
+
statistic |
+Chi-square statistic from car::Anova() |
+
df |
+Degrees of freedom | +
p.value |
+p-value | +
gene |
+Gene identifier | +
model_status |
+"success" or error message |
+
predictor_class |
+"continuous", "two-level-categorical",
+"multi-level-categorical", or
+"interaction" |
+
Estimate_source |
+"lme_summary" (value in
+Estimate/Estimate_SE) or
+"seeContrasts" (NA — use contrast output for estimate) |
+
Estimate |
+Coefficient estimate where available; NA for multi-level or complex +interactions | +
Estimate_SE |
+Standard error of estimate where available | +
Per-gene outcomes are tracked via a model_status column
+in both lme_anova and lme_contrast, and
+summarised in lme_err (named character vector, gene →
+status). Possible values:
"success" — model converged cleanly;
+all results are reliable."singular_fit" —
+isSingular() was TRUE; random effect variance
+hit its boundary (zero). Fixed effect estimates and contrasts are still
+returned and are numerically valid. Common with small N or highly
+collinear random/fixed effects. Filter downstream if desired.tryCatch
+caught an unexpected error; rows for that gene contain NAs.Note: singular fits were previously treated as a hard
+stop() that excluded the gene entirely. They are now
+flagged and returned, giving users the choice to retain or filter.
table(result$lme_err)
+non_success <- names(result$lme_err)[result$lme_err != "success"]
+
+# Filter to clean fits only downstream:
+result$lme_contrast %>% filter(model_status == "success")geneLME fits one linear mixed effects model per gene
+across a full RNA-seq expression matrix, extracting ANOVA tables and
+user-defined emmeans-based contrasts in parallel via the
+future framework. It is designed around the
+limma EList data structure.
lmer models per gene in parallel
+(future_lapply)dat$weights)car::Anova() type II ANOVA tables with
+per-predictor coefficient estimatescontrast_spec data
+framecontrasts_secondary: returns
+$contrast_spec (indexed) for debugging without running
+modelsAll examples use a simulated EList-like object with a
+3-treatment × 4-visit paired design (10 patients, 120
+samples, 50 genes). Genes 1–10 have a simulated TrtC:V3
+upregulation of +2.5 log2 units to provide a detectable signal in Branch
+A examples.
set.seed(42)
+
+n_patients <- 10
+treatments <- c("TrtA", "TrtB", "TrtC")
+visits <- c("V1", "V2", "V3", "V4")
+n_genes <- 50
+
+# Patient-level covariates
+patient_meta <- data.frame(
+ ptID = paste0("pt", sprintf("%02d", 1:n_patients)),
+ sex = factor(sample(c("M", "F"), n_patients, replace = TRUE)),
+ age = round(rnorm(n_patients, mean = 38, sd = 10)),
+ stringsAsFactors = FALSE
+)
+
+# Sample metadata (targets)
+targets <- expand.grid(
+ ptID = paste0("pt", sprintf("%02d", 1:n_patients)),
+ treatment = treatments,
+ visit = visits,
+ stringsAsFactors = FALSE
+) %>%
+ arrange(ptID, treatment, visit) %>%
+ left_join(patient_meta, by = "ptID") %>%
+ mutate(
+ sample_id = paste(ptID, treatment, visit, sep = "_"),
+ rNANgUl = rnorm(n(), mean = 5, sd = 1),
+ percent_duplication = runif(n(), min = 0.05, max = 0.55),
+ median_cv_coverage = rnorm(n(), mean = 0.85, sd = 0.08),
+ lib.size = round(rnorm(n(), mean = 20e6, sd = 3e6)),
+ norm.factors = rnorm(n(), mean = 1, sd = 0.05)
+ )
+rownames(targets) <- targets$sample_id
+
+# Expression matrix
+n_samples <- nrow(targets)
+E_mat <- matrix(rnorm(n_genes * n_samples, mean = 8, sd = 2),
+ nrow = n_genes, ncol = n_samples,
+ dimnames = list(paste0("gene", sprintf("%02d", 1:n_genes)),
+ targets$sample_id))
+
+# Simulated TrtC:V3 effect on genes 1-10
+trtC_v3 <- which(targets$treatment == "TrtC" & targets$visit == "V3")
+E_mat[1:10, trtC_v3] <- E_mat[1:10, trtC_v3] + 2.5
+
+# Patient random effect
+for (pt in unique(targets$ptID)) {
+ idx <- which(targets$ptID == pt)
+ E_mat[, idx] <- E_mat[, idx] + rnorm(1, 0, 1)
+}
+
+# Voom-like precision weights
+W_mat <- matrix(abs(rnorm(n_genes * n_samples, mean = 1, sd = 0.1)),
+ nrow = n_genes, ncol = n_samples,
+ dimnames = dimnames(E_mat))
+
+# EList-like object
+dat <- list(E = E_mat, weights = W_mat, targets = targets)
+
+# Small subset for faster examples
+dat_sub <- list(E = dat$E[1:10, ], weights = dat$weights[1:10, ], targets = dat$targets)
+
+cat("Design:", n_patients, "patients ×", length(treatments), "treatments ×",
+ length(visits), "visits =", n_samples, "samples\n")#> Design: 10 patients × 3 treatments × 4 visits = 120 samples
+
+#> Genes: 50 (full) / 10 (subset used in examples)
+geneLME_contrast_spec() — Inspect Levels Before
+RunningBefore calling geneLME(), use
+geneLME_contrast_spec() to enumerate available contrast
+levels and understand exactly how to construct your contrast arguments.
+This function has two modes depending on whether
+contrast_vars contains ":".
Pass a single "var_a:var_b" string to get all pairwise
+combinations of interaction cells. This returns a
+data.frame that you filter down to the specific contrasts
+of interest and pass directly as contrast_spec to
+geneLME().
spec_template <- geneLME_contrast_spec(
+ targets = dat$targets,
+ contrast_vars = "treatment:visit"
+)
+
+# spec_template has two columns: contrast_ref and contrast_lvl.
+# contrast_index is NOT included here — it is added by geneLME() to its $contrast_spec
+# output element once it receives your filtered spec.
+cat("Total pairwise combinations:", nrow(spec_template), "\n\n")#> Total pairwise combinations: 66
+
+Filter to the contrasts of interest — here, same-visit +cross-treatment comparisons at V2 and V3:
+my_spec <- spec_template %>%
+ filter(
+ sub(".* ", "", contrast_ref) == sub(".* ", "", contrast_lvl), # same visit
+ sub(".* ", "", contrast_ref) %in% c("V2", "V3") # V2 and V3 only
+ )
+
+cat("Filtered to", nrow(my_spec), "contrasts:\n")#> Filtered to 6 contrasts:
+
+#> contrast_ref contrast_lvl
+#> 1 TrtA V2 TrtB V2
+#> 2 TrtA V2 TrtC V2
+#> 3 TrtB V2 TrtC V2
+#> 4 TrtA V3 TrtB V3
+#> 5 TrtA V3 TrtC V3
+#> 6 TrtB V3 TrtC V3
+Pass a character vector of plain variable names (no :).
+Returns a named list — one
+data.frame(level) per variable — and prints a message for
+each variable explaining its role in geneLME()
+arguments.
The printed messages explain:
+treatment (position 1, primary): its
+levels define the length and position order of
+contrasts_primary vectors in geneLME()visit (position 2, secondary): its
+levels can be subset via contrast_var_2_levels to restrict
+which groups the primary contrasts are computed within#> treatment levels (contrasts_primary vector positions):
+
+#> level
+#> 1 TrtA
+#> 2 TrtB
+#> 3 TrtC
+
+#>
+#> visit levels (available for contrast_var_2_levels filtering):
+
+#> level
+#> 1 V1
+#> 2 V2
+#> 3 V3
+#> 4 V4
+contrast_specUse when: the model formula contains an interaction
+term (treatment * visit or treatment:visit)
+and you want specific, explicitly-defined pairwise contrasts across
+interaction cells.
Under the hood, geneLME_fit() computes the full
+emmeans object for the interaction, then builds one named
+contrast vector per row of contrast_spec — setting the
+reference cell to −1, the comparison cell to +1, and all others to 0.
+This guarantees that only the contrasts you define are run.
result_A <- geneLME(
+ dat = dat_sub,
+ formula_str = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)",
+ model_weights = TRUE,
+ run_contrast = TRUE,
+ contrast_vars = "treatment:visit", # single ":" string → Branch A
+ contrast_spec = my_spec, # the filtered data.frame from geneLME_contrast_spec()
+ n_cores = 2
+)#> Output elements: lme_anova, lme_contrast, lme_fit, lme_err, contrast_spec
+
+#> lme_fit (AIC per gene):
+
+#> gene AIC
+#> 1 gene01 532.8037
+#> 2 gene02 532.7433
+#> 3 gene03 545.1509
+#> 4 gene04 511.5349
+#> 5 gene05 531.4956
+#> 6 gene06 524.0391
+#> 7 gene07 523.6012
+#> 8 gene08 504.2733
+#> 9 gene09 495.8422
+#> 10 gene10 534.7778
+
+#>
+#> Model status summary:
+
+#>
+#> success
+#> 10
+# $contrast_spec is the indexed copy of my_spec attached by geneLME().
+# contrast_index here is 1:nrow(my_spec) — the row position within the filtered
+# spec. This is the stable index you should use when building contrasts_secondary vectors.
+cat("\n$contrast_spec (indexed by geneLME):\n")#>
+#> $contrast_spec (indexed by geneLME):
+
+#> contrast_index contrast_ref contrast_lvl
+#> 1 1 TrtA V2 TrtB V2
+#> 2 2 TrtA V2 TrtC V2
+#> 3 3 TrtB V2 TrtC V2
+#> 4 4 TrtA V3 TrtB V3
+#> 5 5 TrtA V3 TrtC V3
+#> 6 6 TrtB V3 TrtC V3
+lme_anova — ANOVA tableOne row per model term per gene. predictor_class
+classifies each term, and Estimate_source indicates whether
+a coefficient estimate is available directly
+("lme_summary") or should be read from the contrast output
+("seeContrasts"). p.value_adj is the
+FDR-adjusted p-value computed across all genes within each model
+term (Benjamini-Hochberg by default).
#> ANOVA rows: 80
+cat(" (", length(unique(result_A$lme_anova$term)), "terms ×",
+ length(unique(result_A$lme_anova$gene)), "genes)\n\n")#> ( 8 terms × 10 genes)
+# Show one gene in full
+result_A$lme_anova %>%
+ filter(gene == "gene01") %>%
+ select(term, statistic, df, p.value, p.value_adj, predictor_class, Estimate_source, Estimate, Estimate_SE)lme_contrast — contrast resultsOne row per defined contrast per gene. contrast_order is
+always "first_order" in a basic Branch A call (no
+second-order contrasts specified). p.value_adj is
+FDR-adjusted across all genes within each contrast ×
+contrast_order combination.
contrast_ref and contrast_lvl are populated
+for Branch A first-order contrasts — they record
+exactly which interaction cell is the reference (−1) and which is the
+comparison (+1), eliminating any ambiguity in how the estimate is
+signed. For second-order contrasts and all Branch B contrasts these
+columns are NA (no single ref/lvl pair applies).
#> Contrast rows: 60
+cat(" (", length(unique(result_A$lme_contrast$contrast)), "contrasts ×",
+ length(unique(result_A$lme_contrast$gene)), "genes)\n\n")#> ( 6 contrasts × 10 genes)
+result_A$lme_contrast %>%
+ filter(gene == "gene01") %>%
+ select(contrast, contrast_ref, contrast_lvl, estimate, SE, df, t.ratio, p.value, p.value_adj, contrast_order)Notice that gene01 shows large, significant effects at V3 (TrtC vs
+others) but not at V2, consistent with the simulated
+TrtC:V3 upregulation.
Second-order contrasts are contrasts-of-contrasts: each operates on +the vector of first-order contrast estimates, allowing you to ask +whether a treatment difference changes across visits.
+At validation time, geneLME() prints the numbered
+first-order list in contrast_spec row order, so you can
+verify the position indices before constructing your secondary vectors.
+Secondary vectors must have length equal to
+nrow(contrast_spec).
# At validation time, geneLME() prints the first-order contrast list in my_spec row
+# order. You can also inspect result_A$contrast_spec for the indexed row ordering.
+# With my_spec as defined above (same-visit, V2 and V3), alphabetical ordering gives:
+# row 1: TrtB V2 - TrtA V2
+# row 2: TrtC V2 - TrtA V2
+# row 3: TrtC V2 - TrtB V2
+# row 4: TrtB V3 - TrtA V3
+# row 5: TrtC V3 - TrtA V3
+# row 6: TrtC V3 - TrtB V3
+#
+# Second-order questions:
+# "Did the TrtA vs TrtB difference change from V2 to V3?" → row4 − row1
+# "Did the TrtA vs TrtC difference change from V2 to V3?" → row5 − row2
+#
+# Verify with: result_A$contrast_spec
+
+result_A2 <- geneLME(
+ dat = dat_sub,
+ formula_str = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)",
+ model_weights = TRUE,
+ run_contrast = TRUE,
+ contrast_vars = "treatment:visit",
+ contrast_spec = my_spec,
+ contrasts_secondary = list(
+ "TrtA vs TrtB: V3 minus V2 effect" = c(-1, 0, 0, 1, 0, 0), # row4 − row1
+ "TrtA vs TrtC: V3 minus V2 effect" = c(0, -1, 0, 0, 1, 0) # row5 − row2
+ ),
+ n_cores = 2
+)#> Contrast order breakdown:
+
+#>
+#> first_order second_order
+#> 60 20
+
+#>
+#> Second-order results for gene01 and gene09 (strongest TrtC:V3 effect):
+# contrast_ref/contrast_lvl are NA for second-order rows (no single ref/lvl pair)
+result_A2$lme_contrast %>%
+ filter(contrast_order == "second_order", gene %in% c("gene01", "gene09")) %>%
+ select(gene, contrast, contrast_ref, contrast_lvl, estimate, SE, df, t.ratio, p.value)$contrast_specWhen there are many first-order contrasts, hand-coding the position
+of each −1/+1 in every secondary vector is
+error-prone. After a successful geneLME() call (or a
+soft-fail call — see Input Validation),
+the $contrast_spec element of the returned list contains an
+indexed copy of contrast_spec with a
+contrast_index column added by geneLME().
+contrast_index is simply 1:nrow(contrast_spec)
+— the row positions in the filtered spec you passed in. Use these as
+stable handles when building contrasts_secondary vectors
+programmatically.
Workflow: run geneLME() once (or use a
+dry-run via a wrong-length contrasts_secondary soft-fail),
+inspect result$contrast_spec to confirm the row ordering
+and indices, then build and pass the correct
+contrasts_secondary list.
Example scenario: within each treatment pair, ask
+whether the between-treatment difference changed from V2 to V3 — i.e.,
+(TrtX−TrtY at V3) − (TrtX−TrtY at V2).
# result_A$contrast_spec is the indexed version of my_spec as seen by geneLME().
+# contrast_index = 1:nrow(my_spec) (simple row positions after filtering).
+cat("Indexed contrast_spec from result_A:\n")#> Indexed contrast_spec from result_A:
+
+#> contrast_index contrast_ref contrast_lvl
+#> 1 1 TrtA V2 TrtB V2
+#> 2 2 TrtA V2 TrtC V2
+#> 3 3 TrtB V2 TrtC V2
+#> 4 4 TrtA V3 TrtB V3
+#> 5 5 TrtA V3 TrtC V3
+#> 6 6 TrtB V3 TrtC V3
+# Annotate with the visit for each contrast (extracted from the interaction label)
+indexed_spec <- result_A$contrast_spec %>%
+ mutate(visit = sub(".* ", "", contrast_ref)) # extract visit from the interaction label
+
+print(indexed_spec)#> contrast_index contrast_ref contrast_lvl visit
+#> 1 1 TrtA V2 TrtB V2 V2
+#> 2 2 TrtA V2 TrtC V2 V2
+#> 3 3 TrtB V2 TrtC V2 V2
+#> 4 4 TrtA V3 TrtB V3 V3
+#> 5 5 TrtA V3 TrtC V3 V3
+#> 6 6 TrtB V3 TrtC V3 V3
+# Build a two-column lookup: one contrast_index per sign, one row per second-order contrast.
+# group_by the treatment pair (shared across visits); within each group there are exactly
+# two rows — one for V2 and one for V3 — giving the index_neg (V2, subtracted) and
+# index_pos (V3, added).
+secondary_lookup <- indexed_spec %>%
+ # Create a treatment-pair label that is the same regardless of visit
+ mutate(trt_pair = paste(sub(" .*", "", contrast_ref), # var_a from ref
+ sub(" .*", "", contrast_lvl), # var_a from lvl
+ sep = " vs ")) %>%
+ arrange(trt_pair, visit) %>% # V2 before V3 (alphabetical)
+ group_by(trt_pair) %>%
+ summarise(
+ secondary_contrast_name = paste0("(", last(contrast_lvl), " - ", last(contrast_ref), ")",
+ " - ",
+ "(", first(contrast_lvl), " - ", first(contrast_ref), ")"),
+ index_neg = first(contrast_index), # V2 row → subtracted (−1)
+ index_pos = last(contrast_index), # V3 row → added (+1)
+ .groups = "drop"
+ )
+
+print(secondary_lookup)#> # A tibble: 3 × 4
+#> trt_pair secondary_contrast_name index_neg index_pos
+#> <chr> <chr> <int> <int>
+#> 1 TrtA vs TrtB (TrtB V3 - TrtA V3) - (TrtB V2 - TrtA V2) 1 4
+#> 2 TrtA vs TrtC (TrtC V3 - TrtA V3) - (TrtC V2 - TrtA V2) 2 5
+#> 3 TrtB vs TrtC (TrtC V3 - TrtB V3) - (TrtC V2 - TrtB V2) 3 6
+# Format as a named list of zero vectors with −1/+1 placed by contrast_index.
+# The vector length must equal nrow(my_spec) — the number of first-order contrasts.
+n_first_order <- nrow(my_spec)
+
+contrasts_secondary_prog <- setNames(
+ lapply(seq_len(nrow(secondary_lookup)), function(i) {
+ v <- rep(0, n_first_order)
+ # contrast_index IS the row position (1:nrow), so it can be used directly.
+ # Using which() for robustness in case the data.frame is reordered.
+ v[which(indexed_spec$contrast_index == secondary_lookup$index_neg[i])] <- -1
+ v[which(indexed_spec$contrast_index == secondary_lookup$index_pos[i])] <- 1
+ v
+ }),
+ secondary_lookup$secondary_contrast_name
+)
+
+# Inspect: each vector should have exactly one −1 and one +1
+print(contrasts_secondary_prog)#> $`(TrtB V3 - TrtA V3) - (TrtB V2 - TrtA V2)`
+#> [1] -1 0 0 1 0 0
+#>
+#> $`(TrtC V3 - TrtA V3) - (TrtC V2 - TrtA V2)`
+#> [1] 0 -1 0 0 1 0
+#>
+#> $`(TrtC V3 - TrtB V3) - (TrtC V2 - TrtB V2)`
+#> [1] 0 0 -1 0 0 1
+result_A_prog <- geneLME(
+ dat = dat_sub,
+ formula_str = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)",
+ model_weights = TRUE,
+ run_contrast = TRUE,
+ contrast_vars = "treatment:visit",
+ contrast_spec = my_spec,
+ contrasts_secondary = contrasts_secondary_prog,
+ n_cores = 2
+)
+
+cat("Second-order results for gene01:\n")#> Second-order results for gene01:
+result_A_prog$lme_contrast %>%
+ filter(contrast_order == "second_order", gene == "gene01") %>%
+ select(gene, contrast, contrast_ref, contrast_lvl, estimate, SE, df, t.ratio, p.value)++Key design note:
+geneLME()appends +contrast_index(=1:nrow(contrast_spec)) to +its$contrast_specoutput element — not to the +contrast_specdata frame you pass in. This means +contrast_indexalways reflects the row positions in +your filtered spec, never in the full unfiltered template. The +which(indexed_spec$contrast_index == ...)lookup maps those +indices back to row positions for vector construction, and is robust to +any reordering of the spec data frame.
Use when: the model is additive (no interaction
+term), and you want named contrasts on the marginal means of one
+variable. Optionally, compute those contrasts within specific levels of
+a second grouping variable (contrast_var_2_levels), and
+optionally follow up with second-order contrasts across the primary
+contrast estimates.
Under the hood, geneLME_fit() calls
+emmeans() with a spec formula derived from
+contrast_vars, using contrast_var_2_levels as
+the at = filter. contrasts_primary vectors are
+passed directly to emmeans::contrast().
Before specifying contrasts_primary, confirm the level
+ordering emmeans will use — it is alphabetical by default:
treatment levels are TrtA[1],
+TrtB[2], TrtC[3] — so
+contrasts_primary vectors must have length 3 with positions
+corresponding to those labels:
"TrtC vs TrtA" = c(-1, 0, 1) # TrtA=−1, TrtB=0, TrtC=+1
+"TrtB vs TrtA" = c(-1, 1, 0) # TrtA=−1, TrtB=+1, TrtC=0
+result_B <- geneLME(
+ dat = dat_sub,
+ formula_str = "~ treatment + visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)",
+ model_weights = TRUE,
+ run_contrast = TRUE,
+ contrast_vars = c("treatment", "visit"), # [1] primary; [2] 'by' grouping variable
+ contrast_var_2_levels = c("V2", "V3"), # restrict to these visit levels
+ contrasts_primary = list(
+ "TrtC vs TrtA" = c(-1, 0, 1),
+ "TrtB vs TrtA" = c(-1, 1, 0)
+ ),
+ n_cores = 2
+)#> Contrast rows: 40
+cat(" (", length(unique(result_B$lme_contrast$contrast)), "contrasts ×",
+ length(contrast_var_2_levels <- c("V2","V3")), "visits ×",
+ length(unique(result_B$lme_contrast$gene)), "genes)\n\n")#> ( 2 contrasts × 2 visits × 10 genes)
+# contrast_ref and contrast_lvl are NA for Branch B (named contrast vectors,
+# not a ref/lvl spec frame)
+result_B$lme_contrast %>%
+ filter(gene == "gene01") %>%
+ select(contrast, contrast_ref, contrast_lvl, visit, estimate, SE, df, t.ratio, p.value, contrast_order)++Note: In an additive model, treatment estimates are +identical across V2 and V3 — there is no interaction, so the treatment +effect does not depend on visit.
+contrast_var_2_levels+controls which visit cells are included in the output, which is +useful for limiting the number of tests and focusing results on visits +of biological interest.
A second-order contrast operates on the vector of first-order
+contrast estimates. Here, contrasts_secondary must have
+length equal to the number of primary contrasts (2 in this example). The
+result asks: “Is the TrtC vs TrtA difference significantly
+larger than the TrtB vs TrtA difference?” — i.e., is there
+a differential treatment response between TrtB and TrtC relative to
+TrtA?
result_B2 <- geneLME(
+ dat = dat_sub,
+ formula_str = "~ treatment + visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)",
+ model_weights = TRUE,
+ run_contrast = TRUE,
+ contrast_vars = c("treatment", "visit"),
+ contrast_var_2_levels = c("V2", "V3"),
+ contrasts_primary = list(
+ "TrtC vs TrtA" = c(-1, 0, 1),
+ "TrtB vs TrtA" = c(-1, 1, 0)
+ ),
+ contrasts_secondary = list(
+ "TrtC vs TrtB (2nd order)" = c(1, -1) # length = n primary contrasts (2)
+ ),
+ n_cores = 2
+)#> Contrast order breakdown:
+
+#>
+#> first_order second_order
+#> 40 20
+result_B2$lme_contrast %>%
+ filter(gene == "gene01") %>%
+ select(contrast, contrast_ref, contrast_lvl, visit, estimate, SE, df, t.ratio, p.value, contrast_order)geneLME() automatically appends a
+p.value_adj column to both lme_anova and
+lme_contrast using p.adjust(). Adjustment is
+performed across all genes within each grouping unit —
+not globally across all rows — so that each test family is treated
+independently:
| Output table | +Grouping unit for adjustment | +
|---|---|
lme_anova |
+Each model term (e.g. treatment,
+treatment:visit) |
+
lme_contrast |
+Each contrast label × contrast_order |
+
The default method is Benjamini-Hochberg ("BH"). Any
+method accepted by p.adjust() can be specified via the
+fdr_method argument:
#> [1] "holm" "hochberg" "hommel" "bonferroni" "BH"
+#> [6] "BY" "fdr" "none"
+# Example: Bonferroni correction instead of BH
+result_bonf <- geneLME(
+ dat = dat_sub,
+ formula_str = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)",
+ model_weights = TRUE,
+ run_contrast = TRUE,
+ contrast_vars = "treatment:visit",
+ contrast_spec = my_spec,
+ fdr_method = "bonferroni",
+ n_cores = 2
+)
+
+# Compare BH vs Bonferroni for one contrast across genes
+comparison <- result_A$lme_contrast %>%
+ filter(contrast == first(contrast), contrast_order == "first_order") %>%
+ select(gene, p.value, p.value_adj_BH = p.value_adj) %>%
+ left_join(
+ result_bonf$lme_contrast %>%
+ filter(contrast == first(contrast), contrast_order == "first_order") %>%
+ select(gene, p.value_adj_bonf = p.value_adj),
+ by = "gene"
+ )
+
+print(comparison)#> # A tibble: 10 × 4
+#> gene p.value p.value_adj_BH p.value_adj_bonf
+#> <chr> <dbl> <dbl> <dbl>
+#> 1 gene01 0.967 0.998 1
+#> 2 gene02 0.972 0.998 1
+#> 3 gene03 0.998 0.998 1
+#> 4 gene04 0.831 0.998 1
+#> 5 gene05 0.101 0.719 1
+#> 6 gene06 0.555 0.998 1
+#> 7 gene07 0.194 0.719 1
+#> 8 gene08 0.836 0.998 1
+#> 9 gene09 0.216 0.719 1
+#> 10 gene10 0.715 0.998 1
+++Note on NA values: genes whose models failed receive +
+NAforp.value, which propagates to +p.value_adj. These genes are excluded from the adjustment +set, so the effective number of tests equals the number of successfully +fitted genes.
geneLME() validates all inputs before launching any
+parallel work. Informative errors are raised for the most common
+misspecifications:
run_validation_test <- function(label, expr) {
+ tryCatch(
+ { expr; cat(label, "— (no error raised)\n") },
+ error = function(e) cat(label, "ERROR:\n ", conditionMessage(e), "\n\n"),
+ warning = function(w) cat(label, "WARNING:\n ", conditionMessage(w), "\n\n")
+ )
+}
+
+# 1. Formula variable missing from targets
+run_validation_test("Test 1 — missing formula variable:",
+ geneLME(dat_sub, formula_str = "~ NONEXISTENT_VAR + (1|ptID)", n_cores = 2)
+)#> Test 1 — missing formula variable: ERROR:
+#> The following variables in the formula are missing from dat$targets: NONEXISTENT_VAR
+# 2. Weights requested but not present
+dat_no_w <- dat_sub; dat_no_w$weights <- NULL
+run_validation_test("Test 2 — model_weights=TRUE but no dat$weights:",
+ geneLME(dat_no_w, formula_str = "~ treatment + (1|ptID)", model_weights = TRUE, n_cores = 2)
+)#> Test 2 — model_weights=TRUE but no dat$weights: ERROR:
+#> model_weights = TRUE but dat$weights is NULL.
+# 3. Interaction contrast requested but interaction not in formula
+run_validation_test("Test 3 — interaction contrast but additive formula:",
+ geneLME(dat_sub,
+ formula_str = "~ treatment + visit + age + (1|ptID)",
+ run_contrast = TRUE,
+ contrast_vars = "treatment:visit",
+ contrast_spec = my_spec,
+ n_cores = 2)
+)#> Test 3 — interaction contrast but additive formula: ERROR:
+#> contrast_vars specifies an interaction contrast for 'treatment:visit', but this interaction term is not present in formula_str.
+#> Either add the interaction to the formula (e.g. '~ treatment * visit + ...') or change contrast_vars to a non-interaction term.
+# 4. Interaction contrast but contrast_spec not provided
+run_validation_test("Test 4 — interaction contrast, contrast_spec = NULL:",
+ geneLME(dat_sub,
+ formula_str = "~ treatment * visit + age + (1|ptID)",
+ run_contrast = TRUE,
+ contrast_vars = "treatment:visit",
+ n_cores = 2)
+)#> Test 4 — interaction contrast, contrast_spec = NULL: ERROR:
+#> contrast_vars specifies an interaction contrast ('treatment:visit') but contrast_spec is NULL.
+#> Use geneLME_contrast_spec(dat$targets, contrast_vars = 'treatment:visit') to generate a template, filter it to your contrasts of interest, then pass it as contrast_spec.
+# 5. contrast_spec has wrong column names
+run_validation_test("Test 5 — contrast_spec with wrong columns:",
+ geneLME(dat_sub,
+ formula_str = "~ treatment * visit + age + (1|ptID)",
+ run_contrast = TRUE,
+ contrast_vars = "treatment:visit",
+ contrast_spec = data.frame(a = "x", b = "y"),
+ n_cores = 2)
+)#> Test 5 — contrast_spec with wrong columns: ERROR:
+#> contrast_spec must have columns 'contrast_ref' and 'contrast_lvl'.
+# 6. contrast_var_2_levels contains an invalid level
+run_validation_test("Test 6 — invalid contrast_var_2_levels:",
+ geneLME(dat_sub,
+ formula_str = "~ treatment + visit + age + (1|ptID)",
+ run_contrast = TRUE,
+ contrast_vars = c("treatment", "visit"),
+ contrast_var_2_levels = c("V2", "NOTAVISIT"),
+ contrasts_primary = list("TrtC vs TrtA" = c(-1, 0, 1)),
+ n_cores = 2)
+)#> Test 6 — invalid contrast_var_2_levels: ERROR:
+#> Levels specified for 'visit' not found in data: NOTAVISIT
+When contrasts_secondary vectors have the wrong length,
+geneLME() soft-fails: it returns early
+without running any models, but still populates
+$contrast_spec with the indexed row ordering. This lets you
+inspect the indexing and fix your vectors without a hard error.
# 7. contrasts_secondary with wrong length → soft-fail with $contrast_spec populated
+bad_result <- geneLME(
+ dat = dat_sub,
+ formula_str = "~ treatment * visit + age + sex + rNANgUl + percent_duplication + median_cv_coverage + (1|ptID)",
+ model_weights = TRUE,
+ run_contrast = TRUE,
+ contrast_vars = "treatment:visit",
+ contrast_spec = my_spec,
+ contrasts_secondary = list(
+ "wrong length vector" = rep(0, 99) # should be length 6 = nrow(my_spec)
+ ),
+ n_cores = 2
+)
+
+cat("Elements of soft-fail result:\n")#> Elements of soft-fail result:
+
+#> lme_anova lme_contrast lme_fit lme_err contrast_spec
+#> TRUE TRUE TRUE TRUE FALSE
+
+#>
+#> $contrast_spec (use these indices to fix contrasts_secondary):
+
+#> contrast_index contrast_ref contrast_lvl
+#> 1 1 TrtA V2 TrtB V2
+#> 2 2 TrtA V2 TrtC V2
+#> 3 3 TrtB V2 TrtC V2
+#> 4 4 TrtA V3 TrtB V3
+#> 5 5 TrtA V3 TrtC V3
+#> 6 6 TrtB V3 TrtC V3
+Per-gene errors are captured by tryCatch inside
+geneLME_fit() and do not abort the run. Failed genes
+produce NA-filled rows in lme_anova and
+lme_contrast, and the error message is stored in
+lme_err.
#> Model status — Branch A:
+
+#>
+#> success
+#> 10
+# With small mock data (10 patients), singular fits are expected due to
+# patient-level covariates (sex, age) being collinear with the patient random effect.
+# With real data at typical sample sizes this is less common.
+
+# Identify any failed genes
+failed <- names(result_A$lme_err)[result_A$lme_err != "success"]
+if (length(failed) > 0) {
+ cat("\nFailed genes:\n")
+ print(result_A$lme_err[failed])
+} else {
+ cat("\nAll genes fitted successfully.\n")
+}#>
+#> All genes fitted successfully.
+| Argument | +Type | +Required | +Description | +
|---|---|---|---|
dat |
+list | +Yes | +EList-like: $E, $weights (optional),
+$targets |
+
formula_str |
+character | +Yes | +RHS formula string,
+e.g. "~ treatment * visit + (1\|ptID)" |
+
model_weights |
+logical | +No | +TRUE to use dat$weights as precision
+weights |
+
run_contrast |
+logical | +No | +TRUE to run emmeans contrasts |
+
contrast_vars |
+character | +If run_contrast=TRUE |
+"var_a:var_b" (Branch A) or
+c("var1", "var2") (Branch B) |
+
contrast_var_2_levels |
+character vector | +No | +Branch B: restrict ‘by’ variable to these levels | +
contrast_spec |
+data.frame | +Required for Branch A | +Columns contrast_ref and contrast_lvl;
+from geneLME_contrast_spec(). geneLME()
+attaches an indexed copy (with contrast_index) to
+$contrast_spec in its return value. |
+
contrasts_primary |
+named list | +Branch B | +Named contrast vectors; length = levels of
+contrast_vars[1] |
+
contrasts_secondary |
+named list | +No | +Second-order contrasts; length = nrow(contrast_spec)
+(Branch A) or length(contrasts_primary) (Branch B). Wrong
+lengths trigger a soft-fail with $contrast_spec populated
+for debugging. |
+
fdr_method |
+character | +No | +P-value adjustment method passed to p.adjust(); default
+"BH" (Benjamini-Hochberg) |
+
n_cores |
+integer | +No | +Number of parallel workers; defaults to
+detectCores() − 4 |
+
#> R version 4.5.1 (2025-06-13)
+#> Platform: aarch64-apple-darwin20
+#> Running under: macOS Sequoia 15.7.3
+#>
+#> Matrix products: default
+#> BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
+#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
+#>
+#> locale:
+#> [1] en_US/en_US/en_US/C/en_US/en_US
+#>
+#> time zone: America/Los_Angeles
+#> tzcode source: internal
+#>
+#> attached base packages:
+#> [1] stats graphics grDevices utils datasets methods base
+#>
+#> other attached packages:
+#> [1] future.apply_1.20.1 future_1.69.0 purrr_1.2.1
+#> [4] tibble_3.3.1 dplyr_1.2.0 broom.mixed_0.2.9.6
+#> [7] car_3.1-3 carData_3.0-5 emmeans_2.0.1
+#> [10] lme4_1.1-38 Matrix_1.7-4
+#>
+#> loaded via a namespace (and not attached):
+#> [1] utf8_1.2.6 tidyr_1.3.2 sass_0.4.10 generics_0.1.4
+#> [5] lattice_0.22-7 listenv_0.10.0 digest_0.6.39 magrittr_2.0.4
+#> [9] evaluate_1.0.5 grid_4.5.1 estimability_1.5.1 mvtnorm_1.3-3
+#> [13] fastmap_1.2.0 jsonlite_2.0.0 backports_1.5.0 Formula_1.2-5
+#> [17] codetools_0.2-20 jquerylib_0.1.4 abind_1.4-8 reformulas_0.4.3.1
+#> [21] Rdpack_2.6.4 cli_3.6.5 rlang_1.1.7 rbibutils_2.4.1
+#> [25] parallelly_1.46.1 splines_4.5.1 withr_3.0.2 cachem_1.1.0
+#> [29] yaml_2.3.12 otel_0.2.0 parallel_4.5.1 tools_4.5.1
+#> [33] nloptr_2.2.1 coda_0.19-4.1 minqa_1.2.8 forcats_1.0.1
+#> [37] globals_0.18.0 boot_1.3-32 broom_1.0.11 vctrs_0.7.1
+#> [41] R6_2.6.1 lifecycle_1.0.5 MASS_7.3-65 furrr_0.3.1
+#> [45] pkgconfig_2.0.3 bslib_0.10.0 pillar_1.11.1 glue_1.8.0
+#> [49] Rcpp_1.1.1 xfun_0.56 tidyselect_1.2.1 knitr_1.51
+#> [53] xtable_1.8-4 htmltools_0.5.9 nlme_3.1-168 rmarkdown_2.30
+#> [57] compiler_4.5.1
+