Constructing a cell universe requires integrating heterogeneous single-cell RNA-seq datasets, which is hindered by diverse batch effects. Single-cell foundation models (scFMs), inspired by large language models, aim to learn universal cellular embeddings from large-scale single-cell data. However, unlike language, single-cell data are sparse, noisy, and strongly affected by batch artifacts, limiting cross-dataset transferability. Our systematic evaluation across diverse batch scenarios reveals that current scFMs fail to intrinsically remove batch effects, with batch signals persisting in pretrained embeddings. Post-hoc batch-centering partially improves alignment, highlighting the need for future scFMs to integrate explicit batch-effect correction mechanisms to achieve true universal cellular embeddings.
This repository stores relevant analysis and plotting codes.
scFM-batcheffect/
├── config.yaml # Configuration file
├── run.sh # Master pipeline (all stages, all datasets)
├── README.md # This file
|
├── figures/ # Figure plotting Jupyter files
|
└── core_pipeline/ # Evaluation pipeline
├── 0_data_preprocess/ # Step 0: Data preprocessing & standardization
├── 1_data_model_preparation/ # Step 1: Model-specific data preparation
├── 2_extract_embedding/ # Step 2: Embedding extraction
├── 3_integration/ # Step 3: Embedding integration & visualization
├── 4_benchmark/ # Step 4: scIB metrics benchmarking
├── 5_batch_correction/ # Step 5: Batch-centering correction
└── 6_probing/ # Step 6: Linear probing analysis
All the traditional integration methods can be applied using scIB.
First, create a conda environment with common dependencies:
# Create conda environment
conda create -n scfm python=3.10
conda activate scfm
# Install common dependencies
pip install scanpy pandas scipy numpy matplotlib scikit-learn pyyaml
# Install evaluation and integration methods
pip install scib scib_metrics scvi-tools harmony-pytorch scanoramaEach foundation model has different dependencies and require separate environments. Please refer to the respective tutorials of each model.
Edit config.yaml to set up paths for your datasets and models:
# Model directory paths
model_paths:
geneformer:
code_path: "/path/to/geneformer/code"
model_path: "/path/to/geneformer/model"
gpu: 0
batch_size: 32
scfoundation:
code_path: "/path/to/scfoundation/code"
model_path: "/path/to/scfoundation/model"
gpu: 1
batch_size: 32
# ... other models
# Datasets
datasets:
limb:
data_path: "/path/to/limb/raw.h5ad"
output_data_dir: "/path/to/limb/processed_data"
output_res_dir: "/path/to/limb/results"
batch_key: "batch"
celltype_key: "cell_type"
liver:
data_path: "/path/to/liver/raw.h5ad"
output_data_dir: "/path/to/liver/processed_data"
output_res_dir: "/path/to/liver/results"
batch_key: "batch"
celltype_key: "cell_type"
# ... other datasets
# Probing configuration
probing:
n_splits: 5
max_workers: 4
# Batch correction configuration
batch_correction:
batch_cells:
max_cells_per_batch: 10000
random_seed: 42
normalize: truebash run.shThis will execute all 6 stages sequentially for all datasets and models.
Because deploying multiple foundation models in the same environment is challenging, it's best to run stages one by one, switching between model-specific environments as needed.
Run once for all datasets:
cd core_pipeline
# Preprocess limb dataset
python 0_data_preprocess/data_preprocess.py \
--dataset limb \
--config ../config.yaml
# Preprocess liver datasets
python 0_data_preprocess/data_preprocess.py \
--dataset liver \
--config ../config.yaml
# Repeat for other datasetsRun each model for each dataset:
# For limb dataset, prepare all model-specific data
python 1_data_model_preparation/prepare_geneformer.py --dataset limb --config ../config.yaml
python 1_data_model_preparation/prepare_uce.py --dataset limb --config ../config.yaml
python 1_data_model_preparation/prepare_cellplm.py --dataset limb --config ../config.yaml
bash 1_data_model_preparation/prepare_sccello.sh --dataset limb --config ../config.yaml
# Repeat for other models and datasetsExtract embeddings for each model. Note: Each model may require a different environment setup.
# === Switch to GeneFormer environment ===
conda activate geneformer-env
python 2_extract_embedding/extract_embedding_geneformer.py \
--dataset limb --config ../config.yaml
# === Switch to UCE environment ===
conda activate uce-env
bash 2_extract_embedding/extract_embedding_UCE.sh \
--dataset limb --config ../config.yaml
# Repeat for other models and datasetspython 3_integration/integrate.py \
--dataset limb --model uce --config ../config.yaml
# === Switch to CellPLM environment ===
conda activate cellplm-env
python 3_integration/integrate.py \
--dataset limb --model cellplm --config ../config.yaml
python 3_integration/integrate.py \
--dataset limb --model harmony --config ../config.yaml
# Repeat for other models and datasetsEvaluate embedding quality using scIB metrics:
# Benchmark all models on limb dataset
python 4_benchmark/benchmark.py \
--dataset limb --model uce --config ../config.yaml
python 4_benchmark/benchmark.py \
--dataset limb --model geneformer --config ../config.yaml
# Repeat for other models and datasetsApply batch-centering correction to embeddings:
# Batch correct all models on limb dataset
python 5_batch_correction/batch_normalize.py \
--dataset limb --model uce --config ../config.yaml
python 5_batch_correction/batch_normalize.py \
--dataset limb --model geneformer --config ../config.yaml
# Repeat for other models and datasetsEvaluate embeddings on downstream classification tasks:
# Original embeddings
python 6_probing/probing_main.py \
--dataset limb --model uce --config ../config.yaml
# Batch-corrected embeddings
python 6_probing/probing_main.py \
--dataset limb --model uce --batch-center --config ../config.yaml
# Repeat for other models and datasetsFor questions, issues, or feature requests, please contact: wanglinting@amss.ac.cn
