Tools to use and expand the capabilities of the original GenePT. This repository contains utilities and notebooks for working with gene embeddings and single-cell RNA sequencing data.
This project builds upon the GenePT paper and provides tools to:
- Compare different embedding approaches (GenePT vs scGPT)
- Work with large single-cell datasets like Tabula Sapiens
- Generate composable embeddings across different dimensions
- Perform cell type classification using embeddings
The following image shows a detailed summary of the results of the comparison between GenePT and scGPT zero-shot classification so far:
We used a Google Sheet to format the output
See SETUP_GUIDE.md for more detail on getting up and running.
- Python 3.10 (required for scGPT compatibility)
- Standard scientific Python packages (pandas, numpy, scikit-learn)
- Special dependencies:
- scGPT
- AnnData
- Hugging Face datasets/models
We use uv for package management. Install using the instructions at https://docs.astral.sh/uv/getting-started/installation/
Then:
# Install all dependencies including dev dependencies (default behavior)
uv sync
# Or install without dev dependencies
uv sync --no-dev# Format code
scripts/format.sh
uv run isort --gitignore .
# run tests
uv run pytestThe scripts/train_cellxgene_mlp.py script trains an MLP classifier on CellXGene embeddings for cell type classification. It supports both standard training and hyperparameter optimization using Optuna.
# Train with local data
python scripts/train_cellxgene_mlp.py \
--local-data-dir data/cellxgene_embeddings/training_v1_shuffled \
--test-data-dir data/cellxgene_embeddings/test_v1 \
--cell-types-file cell_types_filtered.csv \
--epochs 10 \
--batch-size 1024 \
--learning-rate 4.366e-5 \
--device cuda \
--wandb-project cellxgene-mlp
# Stream directly from S3 (requires AWS credentials)
python scripts/train_cellxgene_mlp.py \
--s3-bucket pythiomicsdata \
--s3-prefix cellxgene_v2/training_v1_shuffled \
--test-data-dir data/cellxgene_embeddings/test_v1 \
--download-if-missing \
--epochs 10 \
--device cuda
# Resume from checkpoint
python scripts/train_cellxgene_mlp.py \
--resume-from checkpoints/checkpoint_epoch5_batch1000.pt \
--local-data-dir data/cellxgene_embeddings/training_v1_shuffled \
--test-data-dir data/cellxgene_embeddings/test_v1The training script supports automatic hyperparameter optimization using Optuna. Simply provide a YAML configuration file with --tuning-config:
# Quick tuning with example config (10 trials)
python scripts/train_cellxgene_mlp.py \
--tuning-config specs/examples/tuning_quick.yaml \
--local-data-dir data/cellxgene_embeddings/training_v1_shuffled \
--test-data-dir data/cellxgene_embeddings/test_v1 \
--wandb-project cellxgene-tuning
# Full hyperparameter search (100 trials)
python scripts/train_cellxgene_mlp.py \
--tuning-config specs/examples/tuning_full.yaml \
--tuning-n-trials 100 \
--local-data-dir data/cellxgene_embeddings/training_v1_shuffled \
--test-data-dir data/cellxgene_embeddings/test_v1
# Resume previous tuning study
python scripts/train_cellxgene_mlp.py \
--tuning-config specs/examples/tuning_full.yaml \
--tuning-storage sqlite:///optuna_study.db \
--tuning-n-trials 50 # Run 50 more trialsQuick Search (specs/examples/tuning_quick.yaml):
- 10-20 trials for rapid exploration
- Tests key hyperparameters: learning rate, dropout, hidden layers
- 2 epochs per trial for quick evaluation
- Automatically retries failed trials to reach target count
- Good for initial exploration
Full Search (specs/examples/tuning_full.yaml):
- Comprehensive hyperparameter space
- Includes optimizer choice, learning rate scheduling
- Warm-starts from best known configurations
- Suitable for production model optimization
Custom Tuning: Create your own YAML config following this structure:
optuna:
study_name: "my_study"
direction: "minimize"
metric_to_optimize: "val_loss"
n_trials: 30
n_epochs_per_trial: 2
hyperparameters:
# Parameters to optimize
dropout:
type: "float"
low: 0.0
high: 0.5
batch_size:
type: "categorical"
choices: [512, 1024, 2048]
fixed_params:
# Parameters with fixed values (not optimized)
learning_rate: 1e-4 # Fixed learning rate
n_hidden_layers: 3 # Fixed architecture
epochs: 10 # For final training
device: "cuda"Important Notes:
- Any parameter you want to keep fixed should go in
fixed_params, nothyperparameters - Parameters in
hyperparameterswill be optimized, while those infixed_paramsremain constant across all trials - Command-line arguments override config values: If you specify a parameter both in the config file and on the command line, the command-line value takes precedence for certain parameters (data paths, checkpoints, wandb settings)
Training Parameters:
--local-data-dir: Directory with pre-shuffled training parquet files--test-data-dir: Directory with validation data--cell-types-file: CSV file mapping cell type names to codes--epochs: Number of training epochs (default: 10)--batch-size: Batch size for training (default: 1024)--learning-rate: Learning rate (default: 4.366e-05)--n-hidden-layers: Number of hidden layers (default: 3)--dropout: Dropout rate (default: 0.053)--checkpoint-dir: Where to save model checkpoints
Hyperparameter Tuning:
--tuning-config: Path to YAML configuration file (enables tuning mode)--tuning-n-trials: Override number of trials from config--tuning-timeout: Maximum time in seconds for tuning--tuning-storage: Database URL for study persistence (e.g.,sqlite:///study.db)
Monitoring:
--wandb-project: Weights & Biases project for experiment tracking--wandb-entity: W&B team/organization--enable-hierarchical-metrics: Use Cell Ontology for hierarchical evaluation
For full parameter documentation:
python scripts/train_cellxgene_mlp.py --helpGenePT-tools/
├── src/ # utility functions
└── notebooks/ # analysis notebooks
Take a look at generate_genept_embeddings.ipynb to see how to generate a GenePT embeddings and dataset and upload them to HuggingFace Hub. create_hf_repos.ipynb will create a new repository for the embeddings and dataset.
Take a look at tabula_sapiens_*.ipynb for a comparison of cell type classification using GenePT and scGPT embeddings.
| Notebook | Description |
|---|---|
generate_genept_embeddings.ipynb |
Generates the GenePT embeddings and dataset for upload to HuggingFace Hub |
tabula_sapiens_embed_genept.ipynb |
Evaluates GenePT embeddings' cell classification performance on Tabula Sapiens |
create_hf_repos.ipynb |
Creates the initial HuggingFace repositories for the GenePT embeddings and dataset |
tabula_sapiens_eda.ipynb |
Exploratory analysis of the Tabula Sapiens single-cell dataset |
tabula_sapiens_embed_genept.ipynb |
Embed a subset of the Tabula Sapiens dataset using GenePT embeddings |
tabula_sapiens_embed_scgpt.ipynb |
Embed a subset of the Tabula Sapiens dataset using scGPT embeddings |
tabula_sapiens_analysis_all.ipynb |
A comparison of GenePT and scGPT embeddings for cell type classification on TS |
- Support for loading and processing large sparse AnnData files
- Integration with Hugging Face datasets
- GenePT original embeddings
- scGPT embeddings
- Composable embeddings across different dimensions:
- Associated genes
- Aging related information
- Drug interactions
- Pathways and biological processes
- Cell type classification
- Embedding comparison utilities
- Visualization tools for high-dimensional data
- Exact comparison between scGPT and GenePT embeddings
- Minimum cell count filtering per cell type
- AnnData integration
- Original GenePT embeddings support
- Prompt improvements
- Remove aging
- Add cell type
- Add tissue type
- Add dysfunctional cell type
- scGPT with batch tokens
- scGPT with modality tokens
- scGPT with combined batch/modality tokens
- Complete Tabula Sapiens cell embedding
- Cell-document bidirectional lookups
- Cell separation analysis
This is a preliminary repository with work in progress. Code is mostly untested but being actively developed. Contributions and collaborations are welcome.
This project is subject to the x described in the LICENSE.md file.