Skip to content

sbresnahan/scAmbi

Repository files navigation

scAmbi — Mapping-ambiguity overdispersion correction for scRNA-seq

scAmbi provides tools to estimate and correct overdispersion arising from read-to-transcript mapping ambiguity (e.g., Salmon Alevin cell-level bootstraps), then evaluate effects on variability using within- and between-sample BCV analyses. It includes helpers to build corrected Seurat objects and plots for quick diagnostics.


Why scAmbi?

Multi-mapping fragments and assignment ambiguity inflate inferential variance. Alevin can emit per-cell bootstrap replicates that capture this. scAmbi:

  • computes per-gene OD from bootstraps (sparse-aware, block-wise),
  • integrates bootstrap-, moments-, and prior-based estimates for exploration,
  • constructs a corrected assay (counts scaled by 1/OD),
  • quantifies improvement via edgeR BCV (within/between sample),
  • offers plotting + summaries for rapid QC.

Key features

  • Bootstrap OD (sparse-aware): compute_overdisp_sparse_aware()
  • Integrated OD estimator: estimate_overdispersion_integrated()
  • Seurat integration: process_and_create_seurat_corrected_improved()
  • Within-sample BCV: calculate_within_sample_bcv(), analyze_within_sample_bcv()
  • Between-sample BCV: extract_and_pseudobulk(), calculate_bcv_direct()
  • Visualization: plot_within_sample_bcv(), plot_within_sample_summary(), plot_bcv(), plot_bcv_comparison()
  • Utilities: read_eds_gc(), read_sample_data_improved(), set_feature_metadata(), extract_feature_vector()

Dependencies

Required R packages

This package requires R version 4.2 or higher.

Core dependencies

  • Seurat (>= 4.0.0): Single-cell data structures and workflows
  • Matrix (>= 1.3-0): Sparse matrix operations
  • edgeR (>= 3.34.0): Dispersion estimation and BCV calculations
  • Rcpp (>= 1.0.7): C++ integration for performance-critical operations

Data I/O and processing

  • eds (>= 1.0.0): Reading Alevin EDS sparse matrix format
  • tximport (>= 1.20.0): Transcript-level quantification import
  • rtracklayer (>= 1.52.0): GTF file parsing for gene complexity calculations
  • jsonlite (>= 1.7.0): JSON data handling

Visualization

  • ggplot2 (>= 3.3.0): Core plotting framework
  • patchwork (>= 1.1.0): Combining multiple plots
  • dplyr (>= 1.0.0): Data manipulation for summaries
  • tidyr (>= 1.1.0): Data reshaping for visualization

Parallel processing

  • parallel: Built-in R package for multi-core processing

Development

These packages are necessary for building the vignettes and running tests. You can install them by running the following command:

install.packages(c("knitr", "rmarkdown", "testthat"))

System requirements

  • C++ compiler: Required for compiling Rcpp functions (e.g., g++ >= 7.0 or clang >= 4.0)
  • OpenMP (optional): For additional parallelization in C++ code
  • Memory: Minimum 64GB RAM recommended for typical datasets (4-8 samples, ~1M cells each)
  • Storage: Sufficient space for Alevin bootstrap files (can be several GB per sample)

Installation

From GitHub (recommended)

# install.packages("remotes")
remotes::install_github("sbresnahan/scAmbi")

From a local source tarball/zip

# If you have scAmbi.zip or a source tar.gz:
install.packages("scAmbi.zip", repos = NULL, type = "source")
# or use devtools:
# devtools::install("path/to/scAmbi/")

Getting started

library(scAmbi)

# 1) Estimate integrated overdispersion from one Alevin sample
#    (requires bootstraps in <sample>/alevin/quants_boot_mat.gz)
alevin_dir <- "path/to/<sample>/alevin"
# counts <- ... (genes x cells dgCMatrix), feats <- rownames(counts), cells <- colnames(counts)
od <- estimate_overdispersion_integrated(
  counts     = counts,
  alevin_dir = alevin_dir,
  n_boot     = 20,
  n_cores    = 4
)

# 2) Build a Seurat object with corrected assay (RNA_corr = counts / OD)
seu <- process_and_create_seurat_corrected_improved(
  sample_id = "S1", counts = counts, od = od, feats = feats, cells = cells
)

# 3) Within-sample BCV comparison (raw vs corrected)
wres <- analyze_within_sample_bcv(list(S1 = seu), assay_names = c("RNA", "RNA_corr"), n_groups = 10)
p <- plot_within_sample_bcv(wres, sample_name = "S1")
print(p)

See the vignette for a full, reproducible walkthrough.


Alevin settings (inputs expected)

To use bootstrap-based OD, run Alevin with cell-level bootstraps enabled so that alevin/quants_boot_mat.gz is present:

salmon alevin \
  -l ISR \
  -1 example_1.fastq.gz \
  -2 example_2.fastq.gz \
  --chromiumV3 \
  -i index/transcripts \
  -p 10 \
  --whitelist index/3M-february-2018.txt \
  --numCellBootstraps 20 \
  --dumpFeatures \
  -o quants/example \
  --tgMap index/tx2g.tsv  # tx-to-tx identity table

Notes:

  • scAmbi reads the boot matrix and associated index files via eds::readEDS().
  • For transcript-centric work, provide a suitable index/mapping to Alevin.

Vignette

# After install:
browseVignettes("scAmbi")
# Or build from source:
devtools::build_vignettes(); browseVignettes("scAmbi")

The vignette demonstrates OD estimation, Seurat correction, and BCV diagnostics end-to-end.


License

GPL-3.
Maintainer: Sean T. Bresnahan stbresnahan@mdanderson.org.


Citation

If you use scAmbi, please cite this repository and the tools it builds upon (e.g., Salmon/Alevin, edgeR). A formal citation will be added once a preprint is available.