BSVAE is a PyTorch package for Structured Factor Variational Autoencoders (StructuredFactorVAE).
It is designed for gene expression modeling with biological priors, integrating protein–protein interaction (PPI) networks and sparsity constraints for interpretable latent representations.
- Structured VAE architecture (
StructuredFactorVAE)- Factorized encoder/decoder with group sparsity
- Optional Laplacian regularization from PPI networks
- Dataset utilities
- Load gene expression matrices (
GeneExpression) - Support for CSV full-matrix mode or pre-split train/test mode
- Load gene expression matrices (
- Biological priors
- Fetch and cache STRING PPI networks by NCBI TaxID
- Map gene symbols / Ensembl IDs to protein IDs using MyGene.info or BioMart
- Training and evaluation
- Unified training loop (
Trainer) - Evaluation with reconstruction, KL, sparsity, and Laplacian penalties (
Evaluator)
- Unified training loop (
- Reproducibility
- Save/load models + metadata (
modelIO) - Configurable hyperparameters via
hyperparam.ini
- Save/load models + metadata (
- Post-training analysis
- Gene–gene network extraction via decoder similarity, propagated covariance, Graphical Lasso, and Laplacian refinement
- Latent export (
mu,logvar) to CSV or AnnData for downstream workflows
Install from PyPI:
pip install bsvaeOr install from source with Poetry:
git clone https://github.com/YOUR-LAB/BSVAE.git
cd BSVAE
poetry installDependencies:
- Python 3.11+
- PyTorch ≥ 2.8
- pandas, numpy, scikit-learn
- networkx, scipy
- mygene (for gene annotation)
BSVAE expects genes × samples CSVs.
- Full-matrix mode:
Provide
expr.csvwith all samples → 10-fold CV split is created. - Pre-split mode:
Provide directory with
X_train.csvandX_test.csv.
bsvae-train exp1 \
--gene-expression-filename data/expr.csv \
--epochs 50 \
--latent-dim 10 \
--ppi-taxid 9606- Results (checkpoints, logs, metadata) saved under
results/exp1/.
bsvae-train exp1 \
--gene-expression-filename data/expr.csv \
--is-eval-onlybsvae-networks extract-networks \
--model-path results/exp1 \
--dataset data/expr.csv \
--output-dir results/exp1/networks
# optional: --methods latent_cov graphical_lasso laplacian
bsvae-networks export-latents \
--model-path results/exp1 \
--dataset data/expr.csv \
--output results/exp1/latents.h5adThe extractor writes sparse NPZ adjacency matrices and Parquet edge
lists by default (--sparse and --compress are on unless disabled with
--no-sparse/--no-compress). When --threshold 0 with sparse output,
an adaptive threshold is computed from --target-sparsity (default 0.01 =
top 1% of edges). Use --quantize (default int8) to reduce adjacency
size. Legacy dense CSV/TSV/NPY outputs are available only with --no-sparse
or explicit legacy formats. By default the decoder-loading cosine similarity
(w_similarity) is computed; add other methods with --methods. Latent
exports include per-sample mu and logvar as tidy CSV or AnnData files.
Hyperparameters can be set via hyperparam.ini:
[Custom]
seed = 42
no_cuda = False
epochs = 100
batch_size = 64
latent_dim = 10
hidden_dims = [128, 64]
dropout = 0.1
l1_strength = 1e-3
lap_strength = 1e-4Override from CLI if needed:
bsvae-train my_experiment --epochs 50 --latent-dim 20BSVAE supports automatic download & caching of STRING v12.0 PPI networks.
-
Supported species (via NCBI TaxID):
- Human (
9606) - Mouse (
10090) - Rat (
10116) - Fly (
7227)
- Human (
-
Cache location defaults to
~/.bsvae/ppi(override via--ppi-cache).
Use the lightweight downloader to cache a STRING network ahead of training:
bsvae-download-ppi --taxid 9606 --cache-dir ~/.bsvae/ppiSome clusters block HTTPS certificate resolution for outbound downloads. If bsvae-download-ppi cannot reach STRING, manually
cache the file with wget (or curl) using --no-check-certificate and point --ppi-cache to the same directory:
OUTDIR="$HOME/.bsvae/ppi"
mkdir -p "${OUTDIR}"
wget --no-check-certificate \
"https://stringdb-static.org/download/protein.links.detailed.v12.0/9606.protein.links.detailed.v12.0.txt.gz" \
-O "${OUTDIR}/9606_string.txt.gz"Use curl -k -L "<url>" -o "${OUTDIR}/9606_string.txt.gz" if wget is unavailable.
- The
bsvae-networksworkflows reuse the same gene ordering as training. When loading a standalone expression file, ensure columns correspond to the genes seen by the checkpoint. - The CLI automatically handles CPU/GPU placement based on availability; models are loaded in evaluation mode without modifying training metadata.
- Network extraction functions are written to be test-friendly: they accept
PyTorch
DataLoaderinstances, operate without global state, and persist outputs as CSV/TSV/NPY for interoperability with graph toolchains.
If you use BSVAE in your research, please cite:
@article{Benjamin2025bsvae,
title={Structured Factor Variational Autoencoder with Biological Priors},
author={Kynon J. M. Benjamin},
year={2025},
journal={N/A}
}
This project is licensed under the GNU General Public License v3.0.