ClusterCatcher

Single-cell RNA sequencing analysis pipeline for mutational signature detection, cell annotation, and cancer cell identification.

Overview

ClusterCatcher is a comprehensive Snakemake-based pipeline designed for end-to-end analysis of single-cell RNA sequencing data with a focus on:

Mutational signature detection at single-cell resolution
Cancer/dysregulated cell identification using dual-model consensus
Viral pathogen detection in unmapped reads
Automated cell type annotation using popV with cluster-based refinement

The pipeline integrates seven major analysis modules that can be flexibly enabled or disabled based on your research needs.

Key Features

Modular design: Enable only the modules you need
Reproducible: Snakemake workflow with conda environment management
Scalable: Supports local execution and HPC cluster submission
Comprehensive: From raw FASTQs to annotated single-cell mutations
Well-documented: Extensive logging and summary reports
Publication-ready plots: Non-overlapping UMAP labels, stacked bar plots

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           ClusterCatcher Pipeline                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────────────┐                                                       │
│  │   FASTQ Files    │                                                       │
│  └────────┬─────────┘                                                       │
│           │                                                                 │
│           ▼                                                                 │
│  ┌──────────────────┐     ┌──────────────────┐                              │
│  │  1. Cell Ranger  │───▶│   BAM + Matrix   │                              │
│  │   (Alignment)    │     │                  │                              │
│  └──────────────────┘     └────────┬─────────┘                              │
│                                    │                                        │
│           ┌────────────────────────┼────────────────────────┐               │
│           │                        │                        │               │
│           ▼                        ▼                        ▼               │
│  ┌──────────────────┐     ┌──────────────────┐     ┌──────────────────┐     │
│  │ 2. QC & Filter   │     │ 5. Viral Detect  │     │ 6. SComatic      │     │
│  │   (Scanpy)       │     │   (Kraken2)      │     │ (Mutations)      │     │
│  └────────┬─────────┘     └────────┬─────────┘     └────────┬─────────┘     │
│           │                        │                        │               │
│           ▼                        ▼                        │               │
│  ┌──────────────────┐     ┌──────────────────┐              │               │
│  │ 3. Cell Annot.   │     │ Viral Integration│              │               │
│  │   (popV)         │     │                  │              │               │
│  └────────┬─────────┘     └──────────────────┘              │               │
│           │                                                 │               │
│           ▼                                                 │               │
│  ┌──────────────────┐                                       │               │
│  │ 4. Dysregulation │                                       │               │
│  │ (CytoTRACE2+CNV) │                                       │               │
│  └────────┬─────────┘                                       │               │
│           │                                                 │               │
│           └──────────────────────────┬──────────────────────┘               │
│                                      │                                      │
│                                      ▼                                      │
│                             ┌──────────────────┐                            │
│                             │ 7. Signatures    │                            │
│                             │   (NNLS/COSMIC)  │                            │
│                             └────────┬─────────┘                            │
│                                      │                                      │
│                                      ▼                                      │
│                             ┌──────────────────┐                            │
│                             │  Final AnnData   │                            │
│                             │  + Summaries     │                            │
│                             └──────────────────┘                            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Pipeline Modules Summary

#	Module	Script	Required	Description
1	Cell Ranger	`cellranger_count.py`	Yes	FASTQ alignment, counting, BAM generation
2	QC & Filtering	`scanpy_qc_annotation.py`	Yes	Quality control, doublet removal, filtering
3	Cell Annotation	`scanpy_qc_annotation.py`	Yes	popV annotation with cluster-based refinement
4	Dysregulation	`cancer_cell_detection.py`	Default	Cancer cell detection (CytoTRACE2 + inferCNV)
5	Viral Detection	`kraken2_viral_detection.py`, `summarize_viral_detection.py`, `viral_integration.py`	Optional	Pathogen detection in unmapped reads
6	Mutation Calling	`scomatic_mutation_calling.py`	Optional	Somatic mutation calling (SComatic)
7	Signatures	`signature_analysis.py`	Optional	Mutational signature deconvolution

Requirements

System Requirements

Linux (tested on Ubuntu 20.04+, CentOS 7+)
Python >= 3.9
Memory: Minimum 64GB RAM recommended (128GB for large datasets)
Storage: ~100GB per sample (including intermediates)

External Software

These must be installed separately and available in PATH:

Software	Version	Purpose	Installation
Cell Ranger	>=7.0	Alignment & counting	10x Genomics
Snakemake	>=7.0	Pipeline orchestration	`conda install -c bioconda snakemake`
samtools	>=1.15	BAM processing	Installed via conda
Kraken2	>=2.1	Viral detection (optional)	`conda install -c bioconda kraken2`

Optional External Dependencies

Software	Purpose	Required for
SComatic	Mutation calling	`--enable-scomatic`
CytoTRACE2	Stemness scoring	`--enable-dysregulation`

Installation

1. Clone the Repository

git clone https://github.com/JakeLehle/ClusterCatcher.git
cd ClusterCatcher

2. Create Conda Environment

# Using mamba (recommended, faster)
mamba env create -f environment.yml

# Or using conda
conda env create -f environment.yml

# Activate the environment
conda activate ClusterCatcher

3. Install ClusterCatcher Package

pip install -e .

4. Install Cell Ranger

Download from 10x Genomics:

# Extract Cell Ranger
tar -xzf cellranger-7.2.0.tar.gz

# Add to PATH
export PATH=/path/to/cellranger-7.2.0:$PATH

# Verify installation
cellranger --version

5. Download Reference Data

# Download Cell Ranger reference (GRCh38)
wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
tar -xzf refdata-gex-GRCh38-2020-A.tar.gz

Reference directory structure (standard 10x Genomics layout):

refdata-gex-GRCh38-2020-A/
├── fasta/
│   └── genome.fa          # Reference FASTA (auto-derived)
├── genes/
│   └── genes.gtf          # GTF annotation (auto-derived)
├── star/
│   └── ...                # STAR index
└── reference.json

Note: ClusterCatcher only requires you to specify the Cell Ranger reference directory (--cellranger-reference). The FASTA and GTF paths are automatically derived from this directory using the standard 10x layout. You can override them with --reference-fasta or --gtf-file if using a non-standard layout.

6. (Optional) Install CytoTRACE2

Required for dysregulation detection:

git clone https://github.com/digitalcytometry/cytotrace2.git
cd cytotrace2/cytotrace2_python
pip install .

7. (Optional) Clone SComatic

Required for mutation calling:

git clone https://github.com/cortes-ciriano-lab/SComatic.git

Quick Start

Step 1: Prepare Sample Information

Create a CSV file with your sample information:

sample_id,fastq_r1,fastq_r2,condition,donor
SAMPLE1,/path/to/SAMPLE1_S1_L001_R1_001.fastq.gz,/path/to/SAMPLE1_S1_L001_R2_001.fastq.gz,tumor,P001
SAMPLE2,/path/to/SAMPLE2_S1_L001_R1_001.fastq.gz,/path/to/SAMPLE2_S1_L001_R2_001.fastq.gz,normal,P001

Convert to pickle format:

ClusterCatcher sample-information \
    --input samples.csv \
    --output samples.pkl

Step 2: Generate Configuration

Simplified usage - only --cellranger-reference is required for references:

python snakemake_wrapper/create_config.py \
    --output-dir ./results \
    --sample-pickle samples.pkl \
    --cellranger-reference /path/to/refdata-gex-GRCh38-2020-A \
    --threads 16 \
    --memory-gb 64

The FASTA and GTF files are automatically derived from the Cell Ranger reference directory.

Override if needed (for non-standard reference layouts):

python snakemake_wrapper/create_config.py \
    --output-dir ./results \
    --sample-pickle samples.pkl \
    --cellranger-reference /path/to/refdata-gex-GRCh38-2020-A \
    --reference-fasta /custom/path/to/genome.fa \
    --gtf-file /custom/path/to/genes.gtf \
    --threads 16 \
    --memory-gb 64

Step 3: Run Pipeline

# Dry run first (recommended)
cd snakemake_wrapper
snakemake --configfile ../results/config.yaml --dryrun

# Full run
snakemake --configfile ../results/config.yaml \
    --cores 32 \
    --use-conda

For SRAscraper Users

If you used SRAscraper to download data:

python snakemake_wrapper/create_config.py \
    --output-dir ./results \
    --sample-pickle /path/to/SRAscraper_output/metadata/sample_dict.pkl \
    --cellranger-reference /path/to/refdata-gex-GRCh38-2020-A

Detailed Usage

Sample CSV Format

Required Columns

Column	Description	Example
`sample_id`	Unique sample identifier	`SAMPLE1`
`fastq_r1`	Path to R1 FASTQ file(s)	`/path/to/R1.fastq.gz`
`fastq_r2`	Path to R2 FASTQ file(s)	`/path/to/R2.fastq.gz`

Optional Columns

Column	Description
`sample_name`	Human-readable name
`condition`	Experimental condition (tumor, normal, etc.)
`donor`	Donor/patient ID
`tissue`	Tissue type
`chemistry`	10X chemistry version (SC3Pv2, SC3Pv3, etc.)

Multi-Lane Samples

For samples with multiple sequencing lanes, use comma-separated paths:

sample_id,fastq_r1,fastq_r2
SAMPLE1,/path/L001_R1.fq.gz,/path/L002_R1.fq.gz,/path/L001_R2.fq.gz,/path/L002_R2.fq.gz

Enable All Modules

python snakemake_wrapper/create_config.py \
    --output-dir ./results \
    --sample-pickle samples.pkl \
    --cellranger-reference /path/to/refdata-gex-GRCh38-2020-A \
    \
    # Viral detection
    --enable-viral \
    --kraken-db /path/to/kraken2_db \
    --viral-db /path/to/human_viral_db/inspect.txt \
    \
    # Dysregulation (enabled by default)
    --enable-dysregulation \
    --infercnv-reference-groups "T cells" "B cells" "Fibroblasts" \
    \
    # SComatic mutation calling
    --enable-scomatic \
    --scomatic-scripts-dir /path/to/SComatic/scripts \
    --scomatic-editing-sites /path/to/RNA_editing_sites.txt \
    --scomatic-pon-file /path/to/PoN.tsv \
    --scomatic-bed-file /path/to/mappable_regions.bed \
    \
    # Signature analysis
    --enable-signatures \
    --cosmic-file /path/to/COSMIC_v3.4_SBS_GRCh38.txt \
    --core-signatures SBS2 SBS13 SBS5 \
    --use-scree-plot \
    --hnscc-only

Cluster Execution

SLURM Example

snakemake --configfile results/config.yaml \
    --cores 100 \
    --jobs 10 \
    --use-conda \
    --latency-wait 60 \
    --cluster "sbatch --partition=normal --nodes=1 --ntasks-per-node={threads} --mem={resources.mem_mb}M --time=08:00:00"

Using Snakemake Profiles

snakemake --configfile results/config.yaml \
    --profile /path/to/slurm_profile \
    --use-conda

Pipeline Modules

Module 1: Cell Ranger Alignment

Script: scripts/cellranger_count.py

Aligns FASTQ files to the reference transcriptome and generates gene expression matrices.

Inputs

FASTQ files (R1 and R2)
Cell Ranger reference transcriptome

Outputs

cellranger/{sample}/outs/filtered_feature_bc_matrix.h5 - Gene expression matrix
cellranger/{sample}/outs/possorted_genome_bam.bam - Aligned reads
cellranger/{sample}/outs/web_summary.html - QC report

Key Parameters

Parameter	Default	Description
`chemistry`	`auto`	10X chemistry version
`expect_cells`	`10000`	Expected number of cells
`include_introns`	`true`	Include intronic reads
`localcores`	`8`	CPU cores for Cell Ranger
`localmem`	`64`	Memory (GB) for Cell Ranger

Module 2: QC and Filtering

Script: scripts/scanpy_qc_annotation.py

Performs quality control, doublet detection, and cell filtering using Scanpy.

QC Metrics Calculated

Number of genes per cell
Total UMI counts per cell
Mitochondrial gene percentage
Ribosomal gene percentage
Doublet scores (Scrublet)

Filtering Steps

Remove cells with < min_genes genes detected
Remove cells with > max_genes genes detected
Remove cells with < min_counts total counts
Remove cells with > max_counts total counts
Remove cells with > max_mito_pct mitochondrial reads
Remove predicted doublets

Key Parameters

Parameter	Default	Description
`min_genes`	`200`	Minimum genes per cell
`max_genes`	`5000`	Maximum genes per cell
`min_counts`	`500`	Minimum UMI counts
`max_counts`	`50000`	Maximum UMI counts
`max_mito_pct`	`20`	Maximum mitochondrial %
`min_cells`	`3`	Minimum cells expressing a gene
`doublet_removal`	`true`	Enable doublet detection
`doublet_rate`	`0.08`	Expected doublet rate

Outputs

qc/qc_metrics.tsv - Per-sample QC statistics
qc/multiqc_report.html - MultiQC summary
figures/qc/pre_filter_*.png - Pre-filter QC plots
figures/qc/post_filter_*.png - Post-filter QC plots

Module 3: Cell Type Annotation (popV)

Script: scripts/scanpy_qc_annotation.py (combined with QC)

Annotates cells with cell type labels using popV with cluster-based refinement.

Workflow

The annotation follows a 7-step pipeline:

┌──────────────────┐
│ 1. Load Cell     │
│    Ranger Data   │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ 2. QC Metrics    │
│    Calculation   │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ 3. Cell/Gene     │
│    Filtering     │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ 4. Doublet       │
│    Detection     │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ 5. popV Annot.   │◄── RAW COUNTS (no normalization)
│  (HubModel)      │    Pulls from Hugging Face
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ 6. Post-Annot.   │◄── Normalize (CPM), UMAP, Leiden
│    Processing    │    Optional BBKNN batch correction
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ 7. Cluster-Based │◄── Weighted scoring per cluster
│    Refinement    │    Assigns final_annotation
└──────────────────┘

popV Annotation Details

ClusterCatcher uses popV (Population-level Voting) with HubModel from Hugging Face for automated cell type annotation.

Key features:

Runs on raw counts (before normalization)
Downloads pre-trained models from Hugging Face
Uses batch correction during annotation
Transfers predictions back to original object

Available models (set via popv_huggingface_repo):

Model	Description
`popV/tabula_sapiens_All_Cells`	All cell types (default, recommended)
`popV/tabula_sapiens_immune`	Immune cells only
`popV/human_lung`	Lung-specific

See Hugging Face popV for more models.

Cluster-Based Annotation Refinement

After popV annotation, ClusterCatcher refines predictions using cluster-level weighted scoring:

Min-Max normalization of popv_prediction_score within each cluster
Linear weighting to sum to 1 within each cluster
Weighted aggregation of scores per cell type per cluster
Assignment of dominant cell type to all cells in cluster

This approach:

Reduces noise from low-confidence predictions
Leverages cluster structure for more robust assignments
Produces cleaner cell type boundaries

Key Parameters

Parameter	Default	Description
`popv_huggingface_repo`	`popV/tabula_sapiens_All_Cells`	Hugging Face model repository
`popv_prediction_mode`	`inference`	`inference` (full) or `fast`
`popv_gene_symbol_key`	`feature_name`	Gene symbol column in adata.var
`popv_cache_dir`	`tmp/popv_models`	Cache for downloaded models
`batch_key`	`sample_id`	Batch correction key

Post-Annotation Processing Parameters

Parameter	Default	Description
`target_sum`	`1000000`	Normalization target (1e6 = CPM)
`n_pcs`	`null`	PCs for neighbors (null = CPU count)
`n_neighbors`	`15`	Neighbors for graph construction
`leiden_resolution`	`1.0`	Clustering resolution
`run_bbknn`	`false`	Enable BBKNN batch correction
`bbknn_batch_key`	`sample_id`	Batch key for BBKNN

Outputs

annotation/adata_annotated.h5ad - Annotated AnnData with:
- popv_prediction: Raw popV cell type calls
- popv_prediction_score: Confidence scores
- clusters: Leiden cluster assignments
- final_annotation: Cluster-refined cell types
annotation/annotation_summary.tsv - Cell type counts per sample
figures/qc/UMAP_popv_prediction.pdf - UMAP colored by popV calls
figures/qc/UMAP_popv_prediction_score.pdf - UMAP colored by confidence
figures/qc/UMAP_final_annotation.pdf - UMAP with non-overlapping labels
figures/qc/UMAP_samples.pdf - UMAP colored by sample
figures/qc/UMAP_clusters.pdf - UMAP colored by cluster
figures/qc/Stacked_Bar_Cluster_Composition.pdf - Cell type composition per cluster
figures/qc/cell_type_proportions.pdf - Cell type proportions per sample

Module 4: Dysregulation Detection (Cancer Cell Identification)

Script: scripts/cancer_cell_detection.py

Identifies cancer/dysregulated cells using a dual-model consensus approach.

How It Works

┌─────────────────┐     ┌─────────────────┐
│   CytoTRACE2    │     │    InferCNV     │
│  (Stemness)     │     │  (CNV Scores)   │
└────────┬────────┘     └────────┬────────┘
         │                       │
         ▼                       ▼
    ┌─────────┐            ┌─────────┐
    │ Potency │            │   CNV   │
    │  Score  │            │  Score  │
    └────┬────┘            └────┬────┘
         │                      │
         └──────────┬───────────┘
                    ▼
           ┌──────────────┐
           │  Agreement   │
           │    Score     │
           └──────┬───────┘
                  ▼
           ┌──────────────┐
           │ Cancer/Normal│
           │Classification│
           └──────────────┘

CytoTRACE2 Component

Scores cell developmental potential (0-1)
Higher scores = more stem-like/cancer-like
Automatic bimodal threshold detection

InferCNV Component

Uses normal cell types as reference
Detects copy number variations
Scores chromosomal instability

Agreement Scoring

Combines both models using weighted correlation
alpha parameter controls rank vs. value weighting
Cells classified as cancer only if both models agree

Key Parameters

Parameter	Default	Description
`cytotrace2.enabled`	`true`	Enable CytoTRACE2
`cytotrace2.species`	`human`	Species for model
`infercnv.enabled`	`true`	Enable inferCNV
`infercnv.window_size`	`250`	CNV sliding window
`infercnv_reference_groups`	`null`	Normal cell types for reference
`agreement.alpha`	`0.5`	Rank (0) vs. value (1) weight
`agreement.min_correlation`	`0.5`	Minimum Spearman rho

Outputs

dysregulation/adata_cancer_detected.h5ad - AnnData with cancer labels
dysregulation/cancer_detection_summary.tsv - Classification summary
dysregulation/figures/ - CytoTRACE2, CNV, and agreement plots

Module 5: Viral Detection

Scripts:

scripts/kraken2_viral_detection.py - Per-sample viral detection
scripts/summarize_viral_detection.py - Cross-sample summary
scripts/viral_integration.py - Integration with expression data

Detects viral/microbial sequences in unmapped reads using Kraken2.

Workflow

┌──────────────────┐
│ Cell Ranger BAM  │
│ (unmapped reads) │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Extract Unmapped │
│    (samtools)    │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│    Kraken2       │
│ Classification   │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Build SC Matrix  │
│(organism x cell) │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Filter Human     │
│   Viruses Only   │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Integrate with   │
│ Gene Expression  │
└──────────────────┘

Outputs

viral/{sample}/kraken2_filtered_feature_bc_matrix/ - Per-sample viral count matrices
viral/{sample}/{sample}_organism_summary.tsv - Per-sample organism summary
viral/viral_detection_summary.tsv - Cross-sample summary
viral/viral_counts.h5ad - Combined viral AnnData
viral_integration/adata_with_virus.h5ad - Expression + viral data
viral_integration/figures/ - Viral detection visualizations

Key Parameters

Parameter	Default	Description
`kraken_db`	required	Kraken2 database path
`viral_db`	optional	Human viral inspect.txt for filtering
`confidence`	`0.1`	Kraken2 confidence threshold
`include_organisms`	`null`	Organism patterns to include
`exclude_organisms`	`null`	Organism patterns to exclude

Kraken2 Database Setup

# Option A: Download pre-built viral database
wget https://genome-idx.s3.amazonaws.com/kraken/k2_viral_20231009.tar.gz
tar -xzf k2_viral_20231009.tar.gz -C ~/kraken2_db/

# Option B: Build custom human viral database
kraken2-build --download-taxonomy --db ~/kraken2_db/human_viral
datasets download virus genome taxon "human virus" --filename human_viruses.zip
unzip human_viruses.zip
kraken2-build --add-to-library ncbi_dataset/data/genomic.fna --db ~/kraken2_db/human_viral
kraken2-build --build --db ~/kraken2_db/human_viral --threads 16
kraken2-inspect --db ~/kraken2_db/human_viral > ~/kraken2_db/human_viral/inspect.txt

Module 6: SComatic Mutation Calling

Script: scripts/scomatic_mutation_calling.py

Comprehensive 10-phase somatic mutation calling pipeline using SComatic.

Workflow Phases

Phase	Description	Output
1	Filter BAM to annotated cells	Filtered BAM
2	Split BAM by cell type	Per-cell-type BAMs
3	Count bases per cell	Base count tables
4	Merge counts across samples	Combined counts
5	Variant calling (Step 1)	Raw variants
6	Filter with PoN/editing sites	Filtered variants
7	BED file filtering	Mappable variants
8	Callable sites (cell type)	Cell type callable
9	Callable sites (per cell)	Per-cell callable
10	Single-cell genotypes	Final mutations

Required Reference Files

File	Description	How to Obtain
`editing_sites`	Known RNA editing positions	SComatic repo or RADAR database
`pon_file`	Panel of Normals	Generate from matched normals
`bed_file`	Mappable genomic regions	Download or generate

Key Parameters

Parameter	Default	Description
`scripts_dir`	required	Path to SComatic/scripts
`editing_sites`	required	RNA editing sites file
`pon_file`	required	Panel of Normals
`bed_file`	required	Mappable regions BED
`min_cov`	`5`	Minimum coverage
`min_cells`	`5`	Minimum cells with variant
`min_base_quality`	`30`	Base quality threshold
`min_map_quality`	`30`	Mapping quality threshold

Outputs

mutations/all_samples.single_cell_genotype.filtered.tsv - Final mutations
mutations/CombinedCallableSites/complete_callable_sites.tsv - Callable sites
mutations/cell_annotations.tsv - Cell barcode to type mapping
mutations/trinucleotide_background.tsv - Background frequencies
mutations/{sample}/ - Per-sample intermediate files

Module 7: Mutational Signature Analysis

Script: scripts/signature_analysis.py

Deconvolves single-cell mutations into COSMIC mutational signatures using semi-supervised NNLS (Non-Negative Least Squares).

Workflow

┌──────────────────┐
│ Single-Cell      │
│ Mutations (TSV)  │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Build 96-Context │
│ Mutation Matrix  │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Aggregate Cell   │
│ Profiles         │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ NNLS Fitting to  │
│ COSMIC Signatures│
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Add Signature    │
│ Weights to       │
│ AnnData          │
└──────────────────┘

Signature Selection Methods

All Signatures: Fit all provided COSMIC signatures
HNSCC-Specific (--hnscc-only): Use curated set of 15 HNSCC-relevant signatures
Scree Plot (--use-scree-plot): Automatic selection using elbow detection

HNSCC Signature Set

When --hnscc-only is enabled, these signatures are used:

Signature	Proposed Etiology
SBS1	Spontaneous deamination (age)
SBS2	APOBEC activity
SBS4	Tobacco smoking
SBS5	Clock-like (unknown)
SBS7a/b	UV exposure
SBS13	APOBEC activity
SBS16	Unknown (liver-associated)
SBS17a/b	Unknown
SBS18	Reactive oxygen species
SBS29	Tobacco chewing
SBS39	Unknown
SBS40	Unknown
SBS44	Defective DNA mismatch repair

Key Parameters

Parameter	Default	Description
`cosmic_file`	required	COSMIC signatures file
`core_signatures`	`SBS2,SBS13,SBS5`	Always include these
`use_scree_plot`	`false`	Use elbow detection
`mutation_threshold`	`0`	Min mutations per cell
`max_signatures`	`15`	Max signatures to test
`hnscc_only`	`false`	Use HNSCC signature set

Outputs

signatures/signature_weights_per_cell.txt - Per-cell signature weights
signatures/adata_final.h5ad - Final AnnData with all annotations
signatures/mutation_matrix_96contexts.txt - 96-context mutation matrix
signatures/cosmic_signatures_used.txt - Signatures used
signatures/reconstruction_evaluation.txt - Quality metrics
signatures/figures/ - Visualization plots

Configuration Reference

Reference Files (Simplified)

The reference configuration has been simplified to reduce redundancy:

reference:
  # Cell Ranger reference directory - THE ONLY REQUIRED FIELD
  # Must contain fasta/, genes/, star/ subdirectories
  cellranger: "/path/to/refdata-gex-GRCh38-2020-A"
  
  # Auto-derived from cellranger: {cellranger}/fasta/genome.fa
  # Override only if using non-standard layout
  fasta: null
  
  # Auto-derived from cellranger: {cellranger}/genes/genes.gtf
  # Override only if using non-standard layout
  gtf: null
  
  # Genome build (GRCh38, GRCh37, mm10, mm39)
  genome: "GRCh38"

How it works:

You only need to specify cellranger (the Cell Ranger reference directory)
The pipeline automatically derives:
- fasta from {cellranger}/fasta/genome.fa
- gtf from {cellranger}/genes/genes.gtf
Override these only if your reference has a non-standard layout

Complete Configuration Example

# =============================================================================
# ClusterCatcher Configuration
# =============================================================================

output_dir: "/path/to/results"
threads: 8
memory_gb: 64

# Sample specification
sample_info: "/path/to/samples.pkl"

# Reference (only cellranger required)
reference:
  cellranger: "/path/to/refdata-gex-GRCh38-2020-A"
  genome: "GRCh38"

# QC parameters
qc:
  min_genes: 200
  max_genes: 5000
  min_counts: 500
  max_counts: 50000
  max_mito_pct: 20
  min_cells: 3
  doublet_removal: true
  doublet_rate: 0.08

# Preprocessing (post-annotation)
preprocessing:
  target_sum: 1000000  # CPM
  n_pcs: null          # Uses CPU count
  n_neighbors: 15
  leiden_resolution: 1.0
  run_bbknn: false
  bbknn_batch_key: "sample_id"

# Annotation (popV)
annotation:
  popv_huggingface_repo: "popV/tabula_sapiens_All_Cells"
  popv_prediction_mode: "inference"
  popv_gene_symbol_key: "feature_name"
  popv_cache_dir: "tmp/popv_models"
  batch_key: "sample_id"

# Module flags
modules:
  cellranger: true
  qc: true
  annotation: true
  viral: false
  dysregulation: true
  scomatic: false
  signatures: false

Complete `create_config.py` Options

Required Arguments

Option	Description
`--output-dir`	Output directory for results
`--cellranger-reference`	Cell Ranger reference directory

Reference Files (Optional - Auto-derived)

Option	Description
`--reference-fasta`	Override auto-derived FASTA path
`--gtf-file`	Override auto-derived GTF path
`--genome`	Genome build (default: GRCh38)

Sample Specification (one required)

Option	Description
`--sample-pickle`	Pickle file with sample dictionary
`--sample-ids`	Space-separated list of sample IDs

Resource Settings

Option	Default	Description
`--threads`	`8`	Threads per job
`--memory-gb`	`64`	Memory in GB

Cell Ranger Settings

Option	Default	Description
`--chemistry`	`auto`	10X chemistry version
`--expect-cells`	`10000`	Expected cell count
`--include-introns`	`false`	Include intronic reads

QC Settings

Option	Default	Description
`--min-genes`	`200`	Min genes per cell
`--max-genes`	`5000`	Max genes per cell
`--min-counts`	`500`	Min UMI counts
`--max-counts`	`50000`	Max UMI counts
`--max-mito-pct`	`20`	Max mitochondrial %
`--min-cells`	`3`	Min cells expressing gene
`--doublet-rate`	`0.08`	Expected doublet rate

Module Enable/Disable

Option	Default	Description
`--enable-viral`	`false`	Enable viral detection
`--enable-dysregulation`	`true`	Enable dysregulation
`--enable-scomatic`	`false`	Enable mutation calling
`--enable-signatures`	`false`	Enable signature analysis

Output Structure

results/
├── cellranger/                          # Cell Ranger outputs
│   └── {sample}/outs/
│       ├── filtered_feature_bc_matrix.h5
│       ├── possorted_genome_bam.bam
│       ├── possorted_genome_bam.bam.bai
│       └── web_summary.html
│
├── qc/                                  # Quality control
│   ├── qc_metrics.tsv
│   └── multiqc_report.html
│
├── figures/                             # Visualizations
│   └── qc/
│       ├── pre_filter_qc_violin_plots.pdf
│       ├── pre_filter_qc_scatter_plots.pdf
│       ├── post_filter_qc_violin_plots.pdf
│       ├── post_filter_qc_scatter_plots.pdf
│       ├── UMAP_popv_prediction.pdf
│       ├── UMAP_popv_prediction_score.pdf
│       ├── UMAP_final_annotation.pdf
│       ├── UMAP_samples.pdf
│       ├── UMAP_clusters.pdf
│       ├── Stacked_Bar_Cluster_Composition.pdf
│       └── cell_type_proportions.pdf
│
├── annotation/                          # Cell type annotation
│   ├── adata_annotated.h5ad
│   └── annotation_summary.tsv
│
├── dysregulation/                       # Cancer cell detection
│   ├── adata_cancer_detected.h5ad
│   ├── cancer_detection_summary.tsv
│   ├── dysregulation_summary.tsv
│   └── figures/
│
├── viral/                               # Viral detection (if enabled)
│   ├── {sample}/
│   │   ├── kraken2_filtered_feature_bc_matrix/
│   │   └── {sample}_organism_summary.tsv
│   ├── viral_detection_summary.tsv
│   └── viral_counts.h5ad
│
├── mutations/                           # SComatic output (if enabled)
│   ├── all_samples.single_cell_genotype.filtered.tsv
│   └── {sample}/
│
├── signatures/                          # Signature analysis (if enabled)
│   ├── signature_weights_per_cell.txt
│   ├── adata_final.h5ad
│   └── figures/
│
├── logs/                                # Pipeline logs
│
├── config.yaml                          # Pipeline configuration
└── master_summary.yaml                  # Pipeline summary

Troubleshooting

Common Issues

Cell Ranger not found

# Add Cell Ranger to PATH
export PATH=/path/to/cellranger-7.2.0:$PATH

# Verify
which cellranger
cellranger --version

Reference FASTA not found (auto-derivation failed)

If using a non-standard Cell Ranger reference layout:

python create_config.py ... \
    --reference-fasta /custom/path/to/genome.fa \
    --gtf-file /custom/path/to/genes.gtf

Chemistry detection fails

Specify chemistry explicitly:

python create_config.py ... --chemistry SC3Pv3

Memory issues

Increase memory allocation:

python create_config.py ... --memory-gb 128

popV model download fails

Check internet connection and try manually downloading:

# Clear cache and retry
rm -rf tmp/popv_models
# Run pipeline again

Or specify a different model:

annotation:
  popv_huggingface_repo: "popV/tabula_sapiens_immune"

SComatic script errors

Verify SComatic installation:

ls /path/to/SComatic/scripts/
# Should contain: SplitBamCellTypes.py, BaseCellCounter.py, etc.

No mutations detected

Check BAM files contain cell barcodes (CB tag)
Verify reference files are correct
Check callable sites output
Lower min_cov and min_cells thresholds

Log Files

All log files are stored in {output_dir}/logs/:

# View Cell Ranger log
cat results/logs/cellranger/SAMPLE1.log

# View QC/annotation log
cat results/logs/qc_annotation/qc_annotation.log

# View Snakemake log
snakemake ... 2>&1 | tee snakemake.log

Debugging

Run with verbose output:

snakemake --configfile config.yaml \
    --cores 8 \
    --verbose \
    --printshellcmds

Dry run to check workflow:

snakemake --configfile config.yaml --dryrun

Citation

If you use ClusterCatcher in your research, please cite:

Lehle, J. (2025). ClusterCatcher: Single-cell sequencing analysis pipeline for mutation signature detection. GitHub. https://github.com/JakeLehle/ClusterCatcher

Please also cite the tools used in each module:

Module	Citation
Cell Ranger	10x Genomics
Scanpy	Wolf et al., Genome Biology 2018
popV	https://github.com/YosefLab/popV
CytoTRACE2	Gulati et al., Science 2020
inferCNV	https://github.com/broadinstitute/infercnv
SComatic	Muyas et al., Nature Biotechnology 2024
Kraken2	Wood et al., Genome Biology 2019
COSMIC	Alexandrov et al., Nature 2020

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Contact

Author: Jake Lehle
Institution: Texas Biomedical Research Institute
Issues: GitHub Issues

For bug reports or feature requests, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
cli		cli
snakemake_wrapper		snakemake_wrapper
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
setup.py		setup.py

License

JakeLehle/ClusterCatcher

Folders and files

Latest commit

History

Repository files navigation

ClusterCatcher

Table of Contents

Overview

Key Features

Pipeline Architecture

Pipeline Modules Summary

Requirements

System Requirements

External Software

Optional External Dependencies

Installation

1. Clone the Repository

2. Create Conda Environment

3. Install ClusterCatcher Package

4. Install Cell Ranger

5. Download Reference Data

6. (Optional) Install CytoTRACE2

7. (Optional) Clone SComatic

Quick Start

Step 1: Prepare Sample Information

Step 2: Generate Configuration

Step 3: Run Pipeline

For SRAscraper Users

Detailed Usage

Sample CSV Format

Required Columns

Optional Columns

Multi-Lane Samples

Enable All Modules

Cluster Execution

SLURM Example

Using Snakemake Profiles

Pipeline Modules

Module 1: Cell Ranger Alignment

Inputs

Outputs

Key Parameters

Module 2: QC and Filtering

QC Metrics Calculated

Filtering Steps

Key Parameters

Outputs

Module 3: Cell Type Annotation (popV)

Workflow

popV Annotation Details

Cluster-Based Annotation Refinement

Key Parameters

Post-Annotation Processing Parameters

Outputs

Module 4: Dysregulation Detection (Cancer Cell Identification)

How It Works

CytoTRACE2 Component

InferCNV Component

Agreement Scoring

Key Parameters

Outputs

Module 5: Viral Detection

Workflow

Outputs

Key Parameters

Kraken2 Database Setup

Module 6: SComatic Mutation Calling

Workflow Phases

Required Reference Files

Key Parameters

Outputs

Module 7: Mutational Signature Analysis

Workflow

Signature Selection Methods

HNSCC Signature Set

Key Parameters

Outputs

Configuration Reference

Reference Files (Simplified)

Complete `create_config.py` Options

Packages