GeneVector

GeneVector is a Python library for single-cell RNA sequencing analysis that learns distributed gene representations using a neural embedding approach. It enables gene co-expression analysis, cell type annotation, and metagene discovery through vector arithmetic operations.

Key Features

Gene Embeddings: Learn distributed representations of genes based on co-expression patterns
Cell Type Annotation: Automated cell type assignment using marker gene sets
Metagene Discovery: Identify functionally related gene modules through clustering
Vector Arithmetic: Perform gene relationship analysis using vector operations
Batch Correction: Fast batch effect correction for multi-sample datasets
Pluggable Targets: Train on mutual information, Pearson, Spearman, Jaccard, cosine, or custom metrics
High Performance: Numba JIT, Rust native, and CUDA GPU backends for MI computation
Smart Caching: Computed scores cached to disk, instant reload on re-runs

Installation

From PyPI (Recommended)

pip install genevector

With Numba Acceleration (Recommended)

pip install genevector[fast]

From Source

git clone https://github.com/nceglia/genevector.git
cd genevector
pip install -e .

# with numba support
pip install -e ".[fast]"

# with rust backend (requires Rust toolchain)
pip install maturin
maturin develop --release

Dependencies

Python ≥ 3.9
PyTorch
Scanpy
NumPy, SciPy, Pandas
Matplotlib, Seaborn

Optional: numba for JIT-accelerated MI computation (strongly recommended).

Quick Start

import scanpy as sc
from genevector.data import GeneVectorDataset
from genevector.model import GeneVector
from genevector.embedding import GeneEmbedding, CellEmbedding

# Load your single-cell data
adata = sc.read_h5ad("your_data.h5ad")

# Create dataset (auto-selects fastest MI backend)
dataset = GeneVectorDataset(adata)

# Train the model
model = GeneVector(dataset, output_file="genes.vec", emb_dimension=100)
model.train(1000, threshold=1e-6)

# Load gene embeddings
gene_embed = GeneEmbedding("genes.vec", dataset, vector="average")

# Generate cell embeddings
cell_embed = CellEmbedding(dataset, gene_embed)
adata_embedded = cell_embed.get_adata()

# Analyze gene similarities
similarities = gene_embed.compute_similarities("CD8A")
gene_embed.plot_similarities("CD8A", n_genes=10)

MI Computation Backends

GeneVector computes pairwise co-expression scores between all genes as the training target. For mutual information (the default target), multiple backends are available with automatic dispatch.

Dispatch Priority

When mi_backend="auto" (the default), GeneVector selects the fastest available backend:

Priority	Backend	Speedup vs Legacy	How It Works	Availability
1	CUDA GPU	~200-1000x	PyTorch scatter-based histograms on GPU	Requires `device="cuda"` and NVIDIA GPU
2	Rust	~100-800x	Native compiled with rayon parallelism	Requires `maturin develop --release` build
3	Numba	~100-500x	JIT-compiled, parallel across CPU cores	Requires `pip install numba`
4	NumPy	~10-30x	Vectorized discretization, Python pair loop	Always available (default fallback)

Selecting a Backend

# Auto-select (recommended): picks the fastest available
dataset = GeneVectorDataset(adata, mi_backend="auto")

# Force a specific backend
dataset = GeneVectorDataset(adata, mi_backend="rust")
dataset = GeneVectorDataset(adata, mi_backend="numba")
dataset = GeneVectorDataset(adata, mi_backend="numpy")
dataset = GeneVectorDataset(adata, mi_backend="gpu", device="cuda")

Check Available Backends

from genevector.metrics import HAS_NUMBA

# Rust availability (if built)
try:
    from genevector._rust import compute_mi_pairs
    print("Rust backend: available")
except ImportError:
    print("Rust backend: not installed")

print(f"Numba backend: {'available' if HAS_NUMBA else 'not installed'}")

import torch
print(f"CUDA backend: {'available' if torch.cuda.is_available() else 'not available'}")

Practical Runtimes (Approximate)

For 2,000 genes × 5,000 cells (~2M gene pairs):

Backend	Time
Legacy (v0.2)	~26 hours
NumPy	~3 hours
Numba	~2-3 minutes
Rust	~1-2 minutes

Scores are cached to ~/.genevector/cache/ after the first run. Subsequent runs with the same data load instantly.

Co-expression Targets

GeneVector can train on different co-expression metrics beyond mutual information:

# Default: signed mutual information
dataset = GeneVectorDataset(adata, target="mi", signed_mi=True)

# Pearson correlation
dataset = GeneVectorDataset(adata, target="pearson")

# Spearman rank correlation
dataset = GeneVectorDataset(adata, target="spearman")

# Jaccard index (binarized co-detection)
dataset = GeneVectorDataset(adata, target="jaccard")

# Cosine similarity between gene expression vectors
dataset = GeneVectorDataset(adata, target="cosine")

# Custom callable
def my_metric(X, gene_names, **kwargs):
    # must return dict[str, dict[str, float]]
    ...

dataset = GeneVectorDataset(adata, target=my_metric)

The mi_backend parameter only applies when target="mi". The matrix-based targets (Pearson, Spearman, Jaccard, cosine) compute in seconds via BLAS regardless of gene count.

Tutorials

See the example/ directory for comprehensive workflows:

PBMC workflow: Identification of interferon stimulated metagene and cell type annotation
TICA workflow: Cell type assignment
SPECTRUM workflow: Vector arithmetic for site specific metagenes
FITNESS workflow: Identifying increasing metagenes in time series

Detailed Usage Guide

1. Data Loading

GeneVector uses Scanpy AnnData objects and requires raw count data in the .X matrix. It's recommended to subset genes using the seurat_v3 flavor in Scanpy or GeneVector's entropy-based quality control.

from genevector.data import GeneVectorDataset
from genevector.model import GeneVector
from genevector.embedding import GeneEmbedding, CellEmbedding
import scanpy as sc

# Load and preprocess data
adata = sc.read_h5ad("your_data.h5ad")

# Option A: Scanpy's variable gene selection
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=2000)
adata = adata[:, adata.var.highly_variable]

# Option B: GeneVector's entropy-based gene filtering
adata = GeneVectorDataset.quality_control(adata, entropy_threshold=1.0)

# Create dataset
dataset = GeneVectorDataset(adata)

2. Model Training

Creating a GeneVector object computes the co-expression target (MI by default) and prepares training batches. Training time varies by dataset size — a 10k PBMC dataset trains in under 5 minutes. The emb_dimension parameter controls vector size (minimum 50 recommended).

# Initialize model (triggers MI computation + batch preparation)
model = GeneVector(
    dataset,
    output_file="genes.vec",
    emb_dimension=100,
    device="cpu"
)

# Train for 1000 iterations or until convergence
model.train(1000, threshold=1e-6)

# Visualize training progress
model.plot()

3. Gene Embeddings Analysis

Training produces two vector files (input and output weights). Using the average of both weights is recommended for best results.

# Load gene embeddings
gene_embed = GeneEmbedding("genes.vec", dataset, vector="average")

# Compute gene similarities
similarities_df = gene_embed.compute_similarities("CD8A")
gene_embed.plot_similarities("CD8A", n_genes=10)

# Generate metagenes through clustering
gene_adata = gene_embed.get_adata(resolution=40)
metagenes = gene_embed.get_metagenes(gene_adata)

# Visualize specific metagenes
gene_embed.plot_metagene(gene_adata, mg=metagenes[0])

4. Cell Embeddings

Generate cell embeddings using the trained gene vectors. The embeddings are stored in AnnData format with automatic UMAP generation.

# Create cell embeddings
cell_embed = CellEmbedding(dataset, gene_embed)
adata_embedded = cell_embed.get_adata()

# Visualize with Scanpy
sc.pl.umap(adata_embedded)

# Optional: Batch correction
cell_embed.batch_correct(column="sample", reference="control")
adata_corrected = cell_embed.get_adata()

5. Cell Type Assignment

# Define marker genes for cell types
markers = {
    "T Cell": ["CD3D", "CD3G", "CD3E", "TRAC", "IL32", "CD2"],
    "B/Plasma": ["CD79A", "CD79B", "MZB1", "CD19", "BANK1"],
    "Myeloid": ["LYZ", "CST3", "AIF1", "CD68", "C1QA", "C1QB", "C1QC"]
}

# Perform automated cell type assignment
annotated_adata = cell_embed.phenotype_probability(adata_embedded, markers)

# Visualize results
prob_cols = [col for col in annotated_adata.obs.columns if "Pseudo-probability" in col]
sc.pl.umap(annotated_adata, color=prob_cols + ["genevector"], size=25)

6. Caching

Computed co-expression scores are cached automatically. Control this with:

# Enable caching (default)
dataset = GeneVectorDataset(adata, use_cache=True)

# Disable caching (always recompute)
dataset = GeneVectorDataset(adata, use_cache=False)

# Clear the cache
from genevector.cache import clear_cache
clear_cache()

Cache location: ~/.genevector/cache/. Cache keys incorporate the expression matrix, gene list, target function, and all parameters, so different configurations never collide.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use GeneVector in your research, please cite:

Ceglia, N., Sethna, Z., Freeman, S.S. et al. Identification of transcriptional programs using dense vector representations defined by mutual information with GeneVector. Nat Commun 14, 4400 (2023). https://doi.org/10.1038/s41467-023-39985-2

@article{ceglia2023genevector,
  title={Identification of transcriptional programs using dense vector representations defined by mutual information with GeneVector},
  author={Ceglia, Nicholas and Sethna, Zachary and Freeman, Samuel S and others},
  journal={Nature Communications},
  volume={14},
  pages={4400},
  year={2023},
  doi={10.1038/s41467-023-39985-2}
}

Documentation

For detailed documentation and examples, visit: https://genevector.readthedocs.io

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GeneVector

Key Features

Installation

From PyPI (Recommended)

With Numba Acceleration (Recommended)

From Source

Dependencies

Quick Start

MI Computation Backends

Dispatch Priority

Selecting a Backend

Check Available Backends

Practical Runtimes (Approximate)

Co-expression Targets

Tutorials

Detailed Usage Guide

1. Data Loading

2. Model Training

3. Gene Embeddings Analysis

4. Cell Embeddings

5. Cell Type Assignment

6. Caching

Contributing

License

Citation

Documentation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

GeneVector

Key Features

Installation

From PyPI (Recommended)

With Numba Acceleration (Recommended)

From Source

Dependencies

Quick Start

MI Computation Backends

Dispatch Priority

Selecting a Backend

Check Available Backends

Practical Runtimes (Approximate)

Co-expression Targets

Tutorials

Detailed Usage Guide

1. Data Loading

2. Model Training

3. Gene Embeddings Analysis

4. Cell Embeddings

5. Cell Type Assignment

6. Caching

Contributing

License

Citation

Documentation