Batch Effects Remain a Fundamental Barrier to Universal Embeddings in Single-Cell Foundation Models

Abstract

Constructing a cell universe requires integrating heterogeneous single-cell RNA-seq datasets, which is hindered by diverse batch effects. Single-cell foundation models (scFMs), inspired by large language models, aim to learn universal cellular embeddings from large-scale single-cell data. However, unlike language, single-cell data are sparse, noisy, and strongly affected by batch artifacts, limiting cross-dataset transferability. Our systematic evaluation across diverse batch scenarios reveals that current scFMs fail to intrinsically remove batch effects, with batch signals persisting in pretrained embeddings. Post-hoc batch-centering partially improves alignment, highlighting the need for future scFMs to integrate explicit batch-effect correction mechanisms to achieve true universal cellular embeddings.

About This Repository

This repository stores relevant analysis and plotting codes.

scFM-batcheffect/
├── config.yaml                    # Configuration file
├── run.sh                         # Master pipeline (all stages, all datasets)
├── README.md                      # This file
| 
├── figures/                       # Figure plotting Jupyter files 
|
└── core_pipeline/                 # Evaluation pipeline
    ├── 0_data_preprocess/         # Step 0: Data preprocessing & standardization
    ├── 1_data_model_preparation/  # Step 1: Model-specific data preparation
    ├── 2_extract_embedding/       # Step 2: Embedding extraction
    ├── 3_integration/             # Step 3: Embedding integration & visualization
    ├── 4_benchmark/               # Step 4: scIB metrics benchmarking
    ├── 5_batch_correction/        # Step 5: Batch-centering correction
    └── 6_probing/                 # Step 6: Linear probing analysis

Resources of Models and Datasets

Foundation Models

Integration Methods

All the traditional integration methods can be applied using scIB.

Datasets

Usage

1. Environment Setup

Base Environment

First, create a conda environment with common dependencies:

# Create conda environment
conda create -n scfm python=3.10
conda activate scfm

# Install common dependencies
pip install scanpy pandas scipy numpy matplotlib scikit-learn pyyaml 

# Install evaluation and integration methods
pip install scib scib_metrics scvi-tools harmony-pytorch scanorama

Foundation Model-Specific Environment Setup

Each foundation model has different dependencies and require separate environments. Please refer to the respective tutorials of each model.

2. Configure config.yaml

Edit config.yaml to set up paths for your datasets and models:

# Model directory paths
model_paths:
  geneformer:
    code_path: "/path/to/geneformer/code"
    model_path: "/path/to/geneformer/model"
    gpu: 0
    batch_size: 32

  scfoundation:
    code_path: "/path/to/scfoundation/code"
    model_path: "/path/to/scfoundation/model"
    gpu: 1
    batch_size: 32

  # ... other models

# Datasets
datasets:
  limb:
    data_path: "/path/to/limb/raw.h5ad"
    output_data_dir: "/path/to/limb/processed_data"
    output_res_dir: "/path/to/limb/results"
    batch_key: "batch"
    celltype_key: "cell_type"

  liver:
    data_path: "/path/to/liver/raw.h5ad"
    output_data_dir: "/path/to/liver/processed_data"
    output_res_dir: "/path/to/liver/results"
    batch_key: "batch"
    celltype_key: "cell_type"

  # ... other datasets

# Probing configuration
probing:
  n_splits: 5
  max_workers: 4

# Batch correction configuration
batch_correction:
  batch_cells:
    max_cells_per_batch: 10000
    random_seed: 42
    normalize: true

3. Run the Evaluation Pipeline

Option A: Run Full Pipeline (All Stages, All Datasets)

bash run.sh

This will execute all 6 stages sequentially for all datasets and models.

Option B: Run Pipeline Step by Step (Recommended)

Because deploying multiple foundation models in the same environment is challenging, it's best to run stages one by one, switching between model-specific environments as needed.

Stage 0: Data Preprocessing

Run once for all datasets:

cd core_pipeline

# Preprocess limb dataset
python 0_data_preprocess/data_preprocess.py \
    --dataset limb \
    --config ../config.yaml

# Preprocess liver datasets
python 0_data_preprocess/data_preprocess.py \
    --dataset liver \
    --config ../config.yaml

# Repeat for other datasets

Stage 1: Model-Specific Data Preparation

Run each model for each dataset:

# For limb dataset, prepare all model-specific data
python 1_data_model_preparation/prepare_geneformer.py --dataset limb --config ../config.yaml
python 1_data_model_preparation/prepare_uce.py --dataset limb --config ../config.yaml
python 1_data_model_preparation/prepare_cellplm.py --dataset limb --config ../config.yaml
bash 1_data_model_preparation/prepare_sccello.sh --dataset limb --config ../config.yaml

# Repeat for other models and datasets

Stage 2: Embedding Extraction

Extract embeddings for each model. Note: Each model may require a different environment setup.

# === Switch to GeneFormer environment ===
conda activate geneformer-env
python 2_extract_embedding/extract_embedding_geneformer.py \
    --dataset limb --config ../config.yaml

# === Switch to UCE environment ===
conda activate uce-env
bash 2_extract_embedding/extract_embedding_UCE.sh \
    --dataset limb --config ../config.yaml

# Repeat for other models and datasets

Stage 3: Integration

python 3_integration/integrate.py \
    --dataset limb --model uce --config ../config.yaml

# === Switch to CellPLM environment ===
conda activate cellplm-env
python 3_integration/integrate.py \
    --dataset limb --model cellplm --config ../config.yaml

python 3_integration/integrate.py \
    --dataset limb --model harmony --config ../config.yaml

# Repeat for other models and datasets

Stage 4: Benchmarking

Evaluate embedding quality using scIB metrics:

# Benchmark all models on limb dataset
python 4_benchmark/benchmark.py \
    --dataset limb --model uce --config ../config.yaml

python 4_benchmark/benchmark.py \
    --dataset limb --model geneformer --config ../config.yaml

# Repeat for other models and datasets

Stage 5: Batch Correction

Apply batch-centering correction to embeddings:

# Batch correct all models on limb dataset
python 5_batch_correction/batch_normalize.py \
    --dataset limb --model uce --config ../config.yaml

python 5_batch_correction/batch_normalize.py \
    --dataset limb --model geneformer --config ../config.yaml

# Repeat for other models and datasets

Stage 6: Linear Probing Analysis

Evaluate embeddings on downstream classification tasks:

# Original embeddings
python 6_probing/probing_main.py \
    --dataset limb --model uce --config ../config.yaml

# Batch-corrected embeddings
python 6_probing/probing_main.py \
    --dataset limb --model uce --batch-center --config ../config.yaml

# Repeat for other models and datasets

Support

For questions, issues, or feature requests, please contact: wanglinting@amss.ac.cn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Batch Effects Remain a Fundamental Barrier to Universal Embeddings in Single-Cell Foundation Models

Abstract

About This Repository

Resources of Models and Datasets

Foundation Models

Integration Methods

Datasets

Usage

1. Environment Setup

Base Environment

Foundation Model-Specific Environment Setup

2. Configure config.yaml

3. Run the Evaluation Pipeline

Option A: Run Full Pipeline (All Stages, All Datasets)

Option B: Run Pipeline Step by Step (Recommended)

Stage 0: Data Preprocessing

Stage 1: Model-Specific Data Preparation

Stage 2: Embedding Extraction

Stage 3: Integration

Stage 4: Benchmarking

Stage 5: Batch Correction

Stage 6: Linear Probing Analysis

Support

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
core_pipeline		core_pipeline
figures		figures
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
run.sh		run.sh

License

zhanglabtools/scFM-batcheffect

Folders and files

Latest commit

History

Repository files navigation

Batch Effects Remain a Fundamental Barrier to Universal Embeddings in Single-Cell Foundation Models

Abstract

About This Repository

Resources of Models and Datasets

Foundation Models

Integration Methods

Datasets

Usage

1. Environment Setup

Base Environment

Foundation Model-Specific Environment Setup

2. Configure config.yaml

3. Run the Evaluation Pipeline

Option A: Run Full Pipeline (All Stages, All Datasets)

Option B: Run Pipeline Step by Step (Recommended)

Stage 0: Data Preprocessing

Stage 1: Model-Specific Data Preparation

Stage 2: Embedding Extraction

Stage 3: Integration

Stage 4: Benchmarking

Stage 5: Batch Correction

Stage 6: Linear Probing Analysis

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages