Run LM22-based immune cell deconvolution (CIBERSORT) from Python via rpy2, with an automated QA pass that validates and labels samples (“Excellent / Moderate / Poor”).
This runner executes an R pipeline under the hood, handles minimal preprocessing (counts→CPM via edgeR when needed), writes standard plots/CSVs, and performs a numeric-safety QA step.
- Input auto-detection: Treats inputs as TPM/RPKM TMM-normalized CPM (edgeR).
- Metadata-aware: Optional metadata alignment by
sample_id. - Artifacts:
CIBERSORT_results.csv(fractions + metrics)- Stacked barplots per sample (paged)
- Heatmap (samples × cell types)
- P-value histogram; Corr vs RMSE scatter
- LM22 overlap reports
CIBERSORT_Quality_Assessment.csvwith categorical labelsRUN_SUMMARY.txtwith session info
- Fail-fast for missing inputs and non-numeric columns.
- Python 3.9–3.12
- R ≥ 4.0 installed and on
PATH - Python packages:
rpy2 - R packages (installed automatically if missing):
- CRAN:
devtools,readr,readxl,dplyr,tibble,stringr,tools,ggplot2,pheatmap,reshape2,tidyr,ggrepel,scales,data.table - Bioc:
edgeR - GitHub (if needed):
Moonerss/CIBERSORT(installed viadevtools::install_github)
- CRAN:
The runner also looks for a local folder CIBERSORT-main and will install_local if present.
python -m venv cibersort-env source cibersort-env/bin/activate # On Linux/macOS cibersort-env\Scripts\activate.bat # On Windows
pip install --upgrade pip pip install rpy2
Ensure that a compatible version of R (≥ 4.0) is installed and accessible via your system PATH. Check by running:
R --version
If using a system R outside Conda, ensure R is discoverable: R --version works in the same shell.
python Cibersort.py --counts <file> --lm22 <file> --out <dir> [options]
--counts: TSV/CSV/XLS(X). First column = gene symbol (header names are auto-detected).--lm22: Path to LM22 signature matrix (txt/tsv).--out: Output directory.
--meta: Metadata file with columnssample_id, condition.--perm: CIBERSORT permutations (default: 100).--qn: Enable quantile normalization (true/false; default false; keepfalsefor RNA-seq/TPM).--chunk-size: Samples per page for stacked bar plot (default: 60).--install: Pre-install base R packages before running.--res-path: Explicit path toCIBERSORT_results.csvfor the QA step (rarely needed).--log-level:DEBUG|INFO|WARNING|ERROR(default:INFO).
0success1pipeline exception2missing input files
python Cibersort.py \
--counts 'Bulk-data/counts.csvv' \
--meta 'Bulk-data/meta.csv' \
--lm22 'inst/extdata/LM22.txt' \
--out 'CIBERSORT_outputs-v' \
--perm 100 --qn false --chunk-size 40 --install
Paths with spaces/parentheses must be quoted (or escaped) on all shells.
python Cibersort.py `
--counts "Bulk-data/counts.csvv" `
--meta "Bulk-data/meta.csv" `
--lm22 "inst/extdata/LM22.txt" `
--out "CIBERSORT_outputs-v" `
--perm 100 --qn false --chunk-size 40 --install
python Cibersort.py ^
--counts "Bulk-data/counts.csvv" ^
--meta "Bulk-data/meta.csv" ^
--lm22 "inst/extdata/LM22.txt" ^
--out "CIBERSORT_outputs-v" ^
--perm 100 --qn false --chunk-size 40 --install
- Row = gene, Column = sample.
- First column = gene symbol. Accepted headers include:
Gene,gene,GeneSymbol,Symbol,ID, etc. - The runner:
- Keeps genes with expression
> 1in ≥10% of samples (≥1 sample minimum). - Auto-detects scale:
- Median column sum ~ 1e6 → treat as TPM/RPKM (
QN=FALSErecommended). - Otherwise → edgeR TMM → CPM.
- Median column sum ~ 1e6 → treat as TPM/RPKM (
- Keeps genes with expression
- Columns:
sample_id(must match counts’ column names)condition(free text; used only in QC table join, currently no group plots)
- Text/TSV with first column as gene and 22 cell types as columns.
- We upper-case gene symbols internally for overlap accounting.
CIBERSORT_results.csv— Deconvolution results (rows=samples). Common trailing columns:P-value,Correlation,RMSE,Absolute score(if provided by CIBERSORT impl)
01_stacked_bar_fractions*.png— Horizontal stacked bars of cell fractions (paged by--chunk-size)02_heatmap_all_cell_types.png— Heatmap (samples × LM22 types)03_pvalue_histogram.png— Per-sample P-value distribution (if present)04_scatter_correlation_vs_RMSE.png— Fit scatter (if both fields present)LM22_overlap_gene_values_by_sample.csv— Per-sample expression for LM22-overlap genesLM22_overlap_gene_summary.csv— Mean expression + cell types per geneLM22_overlap_report.txt— Coverage summaryCIBERSORT_Quality_Assessment.csv— QA summary per sample:- Fraction sum check (0.85–1.15), discretized P/Corr/RMSE,
Quality_Category
- Fraction sum check (0.85–1.15), discretized P/Corr/RMSE,
RUN_SUMMARY.txt— Inputs, counts, session info, medians
- Python logs to stdout (
--log-level DEBUGfor verbose). - R messages are surfaced via rpy2.
- Failures during R package install or CIBERSORT run are bubbled up and return exit code
1.
RUN pip install --no-cache-dir rpy2
WORKDIR /app COPY Cibersort.py .
CMD ["python", "Cibersort.py", "--help"]
Build & run:
docker build -t cibersort-runner .
docker run --rm -v "$PWD":/work -w /work cibersort-runner \
python /app/Cibersort.py --help
>
- Validate paths quoting on all OSes.
- Confirm R package installs on clean hosts (
--installfirst run). - Keep LM22 reference versioned in
inst/extdata/. - CI smoke-test: run with a tiny synthetic dataset (see below) to ensure plots/CSVs render.
# create_fake.py
import numpy as np, pandas as pd
rng = np.random.default_rng(1)
genes = [f"GENE{i}" for i in range(200)]
samples = [f"S{i}" for i in range(12)]
df = pd.DataFrame(rng.poisson(5, size=(len(genes), len(samples))), index=genes, columns=samples)
df.insert(0, "GeneSymbol", genes)
df.to_csv("fake_counts.csv", index=False)
print("fake_counts.csv written")
Then:
python create_fake.py
python Cibersort.py --counts fake_counts.csv --lm22 inst/extdata/LM22.txt --out out_fake --perm 10 --qn false --install
- Why is QN default false?
For RNA-seq/TPM, quantile normalization is generally not recommended; we leave it opt-in. - Where do QA thresholds come from?
They’re pragmatic defaults: P≤0.05, Corr≥0.5, RMSE≤0.7 and fraction sum in [0.85, 1.15]. Tune in the R QA block if needed. - Can I plug other signatures?
Yes. Use any CIBERSORT-compatible signature matrix via--lm22 <path>; plots/QA still work.
python Cibersort.py --counts 'Bulk-data/counts.csvv' \
--meta 'Bulk-data/meta.csv' \
--lm22 'inst/extdata/LM22.txt' \
--out 'CIBERSORT_outputs-v' --perm 100 --qn false --chunk-size 40 --install