Skip to content

zhanglabtools/scFM-batcheffect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Batch Effects Remain a Fundamental Barrier to Universal Embeddings in Single-Cell Foundation Models

Abstract

Constructing a cell universe requires integrating heterogeneous single-cell RNA-seq datasets, which is hindered by diverse batch effects. Single-cell foundation models (scFMs), inspired by large language models, aim to learn universal cellular embeddings from large-scale single-cell data. However, unlike language, single-cell data are sparse, noisy, and strongly affected by batch artifacts, limiting cross-dataset transferability. Our systematic evaluation across diverse batch scenarios reveals that current scFMs fail to intrinsically remove batch effects, with batch signals persisting in pretrained embeddings. Post-hoc batch-centering partially improves alignment, highlighting the need for future scFMs to integrate explicit batch-effect correction mechanisms to achieve true universal cellular embeddings.

Description


About This Repository

This repository stores relevant analysis and plotting codes.

scFM-batcheffect/
├── config.yaml                    # Configuration file
├── run.sh                         # Master pipeline (all stages, all datasets)
├── README.md                      # This file
| 
├── figures/                       # Figure plotting Jupyter files 
|
└── core_pipeline/                 # Evaluation pipeline
    ├── 0_data_preprocess/         # Step 0: Data preprocessing & standardization
    ├── 1_data_model_preparation/  # Step 1: Model-specific data preparation
    ├── 2_extract_embedding/       # Step 2: Embedding extraction
    ├── 3_integration/             # Step 3: Embedding integration & visualization
    ├── 4_benchmark/               # Step 4: scIB metrics benchmarking
    ├── 5_batch_correction/        # Step 5: Batch-centering correction
    └── 6_probing/                 # Step 6: Linear probing analysis

Resources of Models and Datasets

Foundation Models

Integration Methods

All the traditional integration methods can be applied using scIB.

Datasets


Usage

1. Environment Setup

Base Environment

First, create a conda environment with common dependencies:

# Create conda environment
conda create -n scfm python=3.10
conda activate scfm

# Install common dependencies
pip install scanpy pandas scipy numpy matplotlib scikit-learn pyyaml 

# Install evaluation and integration methods
pip install scib scib_metrics scvi-tools harmony-pytorch scanorama

Foundation Model-Specific Environment Setup

Each foundation model has different dependencies and require separate environments. Please refer to the respective tutorials of each model.

2. Configure config.yaml

Edit config.yaml to set up paths for your datasets and models:

# Model directory paths
model_paths:
  geneformer:
    code_path: "/path/to/geneformer/code"
    model_path: "/path/to/geneformer/model"
    gpu: 0
    batch_size: 32

  scfoundation:
    code_path: "/path/to/scfoundation/code"
    model_path: "/path/to/scfoundation/model"
    gpu: 1
    batch_size: 32

  # ... other models

# Datasets
datasets:
  limb:
    data_path: "/path/to/limb/raw.h5ad"
    output_data_dir: "/path/to/limb/processed_data"
    output_res_dir: "/path/to/limb/results"
    batch_key: "batch"
    celltype_key: "cell_type"

  liver:
    data_path: "/path/to/liver/raw.h5ad"
    output_data_dir: "/path/to/liver/processed_data"
    output_res_dir: "/path/to/liver/results"
    batch_key: "batch"
    celltype_key: "cell_type"

  # ... other datasets

# Probing configuration
probing:
  n_splits: 5
  max_workers: 4

# Batch correction configuration
batch_correction:
  batch_cells:
    max_cells_per_batch: 10000
    random_seed: 42
    normalize: true

3. Run the Evaluation Pipeline

Option A: Run Full Pipeline (All Stages, All Datasets)

bash run.sh

This will execute all 6 stages sequentially for all datasets and models.

Option B: Run Pipeline Step by Step (Recommended)

Because deploying multiple foundation models in the same environment is challenging, it's best to run stages one by one, switching between model-specific environments as needed.

Stage 0: Data Preprocessing

Run once for all datasets:

cd core_pipeline

# Preprocess limb dataset
python 0_data_preprocess/data_preprocess.py \
    --dataset limb \
    --config ../config.yaml

# Preprocess liver datasets
python 0_data_preprocess/data_preprocess.py \
    --dataset liver \
    --config ../config.yaml

# Repeat for other datasets
Stage 1: Model-Specific Data Preparation

Run each model for each dataset:

# For limb dataset, prepare all model-specific data
python 1_data_model_preparation/prepare_geneformer.py --dataset limb --config ../config.yaml
python 1_data_model_preparation/prepare_uce.py --dataset limb --config ../config.yaml
python 1_data_model_preparation/prepare_cellplm.py --dataset limb --config ../config.yaml
bash 1_data_model_preparation/prepare_sccello.sh --dataset limb --config ../config.yaml

# Repeat for other models and datasets
Stage 2: Embedding Extraction

Extract embeddings for each model. Note: Each model may require a different environment setup.

# === Switch to GeneFormer environment ===
conda activate geneformer-env
python 2_extract_embedding/extract_embedding_geneformer.py \
    --dataset limb --config ../config.yaml

# === Switch to UCE environment ===
conda activate uce-env
bash 2_extract_embedding/extract_embedding_UCE.sh \
    --dataset limb --config ../config.yaml

# Repeat for other models and datasets
Stage 3: Integration
python 3_integration/integrate.py \
    --dataset limb --model uce --config ../config.yaml

# === Switch to CellPLM environment ===
conda activate cellplm-env
python 3_integration/integrate.py \
    --dataset limb --model cellplm --config ../config.yaml

python 3_integration/integrate.py \
    --dataset limb --model harmony --config ../config.yaml

# Repeat for other models and datasets
Stage 4: Benchmarking

Evaluate embedding quality using scIB metrics:

# Benchmark all models on limb dataset
python 4_benchmark/benchmark.py \
    --dataset limb --model uce --config ../config.yaml

python 4_benchmark/benchmark.py \
    --dataset limb --model geneformer --config ../config.yaml

# Repeat for other models and datasets
Stage 5: Batch Correction

Apply batch-centering correction to embeddings:

# Batch correct all models on limb dataset
python 5_batch_correction/batch_normalize.py \
    --dataset limb --model uce --config ../config.yaml

python 5_batch_correction/batch_normalize.py \
    --dataset limb --model geneformer --config ../config.yaml

# Repeat for other models and datasets
Stage 6: Linear Probing Analysis

Evaluate embeddings on downstream classification tasks:

# Original embeddings
python 6_probing/probing_main.py \
    --dataset limb --model uce --config ../config.yaml

# Batch-corrected embeddings
python 6_probing/probing_main.py \
    --dataset limb --model uce --batch-center --config ../config.yaml

# Repeat for other models and datasets

Support

For questions, issues, or feature requests, please contact: wanglinting@amss.ac.cn

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published