GenePT-tools

Tools to use and expand the capabilities of the original GenePT. This repository contains utilities and notebooks for working with gene embeddings and single-cell RNA sequencing data.

Overview

This project builds upon the GenePT paper and provides tools to:

Compare different embedding approaches (GenePT vs scGPT)
Work with large single-cell datasets like Tabula Sapiens
Generate composable embeddings across different dimensions
Perform cell type classification using embeddings

Results

The following image shows a detailed summary of the results of the comparison between GenePT and scGPT zero-shot classification so far: We used a Google Sheet to format the output

Setup

See SETUP_GUIDE.md for more detail on getting up and running.

Requirements

Python 3.10 (required for scGPT compatibility)
Standard scientific Python packages (pandas, numpy, scikit-learn)
Special dependencies:
- scGPT
- AnnData
- Hugging Face datasets/models

Installation

We use uv for package management. Install using the instructions at https://docs.astral.sh/uv/getting-started/installation/

Then:

# Install all dependencies including dev dependencies (default behavior)
uv sync

# Or install without dev dependencies
uv sync --no-dev

Running tools

# Format code
scripts/format.sh
uv run isort --gitignore .

# run tests
uv run pytest

Training the CellXGene MLP Model

The scripts/train_cellxgene_mlp.py script trains an MLP classifier on CellXGene embeddings for cell type classification. It supports both standard training and hyperparameter optimization using Optuna.

Basic Training

# Train with local data
python scripts/train_cellxgene_mlp.py \
  --local-data-dir data/cellxgene_embeddings/training_v1_shuffled \
  --test-data-dir data/cellxgene_embeddings/test_v1 \
  --cell-types-file cell_types_filtered.csv \
  --epochs 10 \
  --batch-size 1024 \
  --learning-rate 4.366e-5 \
  --device cuda \
  --wandb-project cellxgene-mlp

# Stream directly from S3 (requires AWS credentials)
python scripts/train_cellxgene_mlp.py \
  --s3-bucket pythiomicsdata \
  --s3-prefix cellxgene_v2/training_v1_shuffled \
  --test-data-dir data/cellxgene_embeddings/test_v1 \
  --download-if-missing \
  --epochs 10 \
  --device cuda

# Resume from checkpoint
python scripts/train_cellxgene_mlp.py \
  --resume-from checkpoints/checkpoint_epoch5_batch1000.pt \
  --local-data-dir data/cellxgene_embeddings/training_v1_shuffled \
  --test-data-dir data/cellxgene_embeddings/test_v1

Hyperparameter Tuning with Optuna

The training script supports automatic hyperparameter optimization using Optuna. Simply provide a YAML configuration file with --tuning-config:

# Quick tuning with example config (10 trials)
python scripts/train_cellxgene_mlp.py \
  --tuning-config specs/examples/tuning_quick.yaml \
  --local-data-dir data/cellxgene_embeddings/training_v1_shuffled \
  --test-data-dir data/cellxgene_embeddings/test_v1 \
  --wandb-project cellxgene-tuning

# Full hyperparameter search (100 trials)
python scripts/train_cellxgene_mlp.py \
  --tuning-config specs/examples/tuning_full.yaml \
  --tuning-n-trials 100 \
  --local-data-dir data/cellxgene_embeddings/training_v1_shuffled \
  --test-data-dir data/cellxgene_embeddings/test_v1

# Resume previous tuning study
python scripts/train_cellxgene_mlp.py \
  --tuning-config specs/examples/tuning_full.yaml \
  --tuning-storage sqlite:///optuna_study.db \
  --tuning-n-trials 50  # Run 50 more trials

Example Tuning Configurations

Quick Search (specs/examples/tuning_quick.yaml):

10-20 trials for rapid exploration
Tests key hyperparameters: learning rate, dropout, hidden layers
2 epochs per trial for quick evaluation
Automatically retries failed trials to reach target count
Good for initial exploration

Full Search (specs/examples/tuning_full.yaml):

Comprehensive hyperparameter space
Includes optimizer choice, learning rate scheduling
Warm-starts from best known configurations
Suitable for production model optimization

Custom Tuning: Create your own YAML config following this structure:

optuna:
  study_name: "my_study"
  direction: "minimize"
  metric_to_optimize: "val_loss"
  n_trials: 30
  n_epochs_per_trial: 2

hyperparameters:
  # Parameters to optimize
  dropout:
    type: "float"
    low: 0.0
    high: 0.5
  
  batch_size:
    type: "categorical"
    choices: [512, 1024, 2048]

fixed_params:
  # Parameters with fixed values (not optimized)
  learning_rate: 1e-4  # Fixed learning rate
  n_hidden_layers: 3    # Fixed architecture
  epochs: 10            # For final training
  device: "cuda"

Important Notes:

Any parameter you want to keep fixed should go in fixed_params, not hyperparameters
Parameters in hyperparameters will be optimized, while those in fixed_params remain constant across all trials
Command-line arguments override config values: If you specify a parameter both in the config file and on the command line, the command-line value takes precedence for certain parameters (data paths, checkpoints, wandb settings)

Key Parameters

Training Parameters:

--local-data-dir: Directory with pre-shuffled training parquet files
--test-data-dir: Directory with validation data
--cell-types-file: CSV file mapping cell type names to codes
--epochs: Number of training epochs (default: 10)
--batch-size: Batch size for training (default: 1024)
--learning-rate: Learning rate (default: 4.366e-05)
--n-hidden-layers: Number of hidden layers (default: 3)
--dropout: Dropout rate (default: 0.053)
--checkpoint-dir: Where to save model checkpoints

Hyperparameter Tuning:

--tuning-config: Path to YAML configuration file (enables tuning mode)
--tuning-n-trials: Override number of trials from config
--tuning-timeout: Maximum time in seconds for tuning
--tuning-storage: Database URL for study persistence (e.g., sqlite:///study.db)

Monitoring:

--wandb-project: Weights & Biases project for experiment tracking
--wandb-entity: W&B team/organization
--enable-hierarchical-metrics: Use Cell Ontology for hierarchical evaluation

For full parameter documentation:

python scripts/train_cellxgene_mlp.py --help

Important files

GenePT-tools/
├── src/        # utility functions
└── notebooks/  # analysis notebooks

Notebooks

Take a look at generate_genept_embeddings.ipynb to see how to generate a GenePT embeddings and dataset and upload them to HuggingFace Hub. create_hf_repos.ipynb will create a new repository for the embeddings and dataset.

Take a look at tabula_sapiens_*.ipynb for a comparison of cell type classification using GenePT and scGPT embeddings.

Notebook	Description
`generate_genept_embeddings.ipynb`	Generates the GenePT embeddings and dataset for upload to HuggingFace Hub
`tabula_sapiens_embed_genept.ipynb`	Evaluates GenePT embeddings' cell classification performance on Tabula Sapiens
`create_hf_repos.ipynb`	Creates the initial HuggingFace repositories for the GenePT embeddings and dataset
`tabula_sapiens_eda.ipynb`	Exploratory analysis of the Tabula Sapiens single-cell dataset
`tabula_sapiens_embed_genept.ipynb`	Embed a subset of the Tabula Sapiens dataset using GenePT embeddings
`tabula_sapiens_embed_scgpt.ipynb`	Embed a subset of the Tabula Sapiens dataset using scGPT embeddings
`tabula_sapiens_analysis_all.ipynb`	A comparison of GenePT and scGPT embeddings for cell type classification on TS

Data Processing

Support for loading and processing large sparse AnnData files
Integration with Hugging Face datasets

Embedding Generation

GenePT original embeddings
scGPT embeddings
Composable embeddings across different dimensions:
- Associated genes
- Aging related information
- Drug interactions
- Pathways and biological processes

Analysis Tools

Cell type classification
Embedding comparison utilities
Visualization tools for high-dimensional data

Project Status

Contributing

This is a preliminary repository with work in progress. Code is mostly untested but being actively developed. Contributions and collaborations are welcome.

License

This project is subject to the x described in the LICENSE.md file.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
aws_batch		aws_batch
configs		configs
docs		docs
img		img
notebooks		notebooks
scripts		scripts
specs		specs
src		src
test		test
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
.yapfignore		.yapfignore
CLAUDE.md		CLAUDE.md
LICENSE.md		LICENSE.md
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GenePT-tools

Overview

Results

Setup

Requirements

Installation

Running tools

Training the CellXGene MLP Model

Basic Training

Hyperparameter Tuning with Optuna

Example Tuning Configurations

Key Parameters

Important files

Notebooks

Data Processing

Embedding Generation

Analysis Tools

Project Status

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

honicky/GenePT-tools

Folders and files

Latest commit

History

Repository files navigation

GenePT-tools

Overview

Results

Setup

Requirements

Installation

Running tools

Training the CellXGene MLP Model

Basic Training

Hyperparameter Tuning with Optuna

Example Tuning Configurations

Key Parameters

Important files

Notebooks

Data Processing

Embedding Generation

Analysis Tools

Project Status

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages