Skip to content

cancersysbio/BERGERON

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BERGERON: Class-Conditional VAE for Synthetic H&E Tile Embedding Generation

Preprint Link: https://www.biorxiv.org/content/10.64898/2025.12.29.692457v1

BERGERON is a class-conditional Variational Autoencoder (cVAE) designed to generate realistic foundation model embeddings from H&E tiles.
It enables data augmentation for attention-based multiple instance learning model training, controlled subtype perturbation, and interpretable latent-space exploration.

Installation

# Conda
conda env create -f environment.yml
conda activate BERGERON

Data Format

Each .h5 file represents one WSI or pseudo-bag (generated using CLAM or another H&E processing tool).

Datasets inside each file

features: [N_tiles, D] — embedding vectors (D=embedding dimension of foundation model)

coords (optional): [N_tiles, 2] (x, y)

Example directory

/data/h5_files/TCGA-XX-0001.h5

/data/h5_files/TCGA-XX-0002.h5

Example label csv

See examples/meta_labels.csv, can be same as generated by CLAM

Contains the slide label (which matches the .h5 file name minus the file suffix) and an integer for the class label.

Example train csv

See examples/splits_0.csv, can be same as generated by CLAM

Contains the train/val/test split to be used for ABMIL training. Must match the folds to be used for ABMIL training to prevent data leakage.

Quickstart

Train a Conditional VAE

python bergeron_train.py \
--h5_dir /path/to/h5_files \
--output_dir /path/to/output_directory \
--train_csv ./examples/splits_0.csv \
--label_csv ./examples/meta_labels.csv \
--num_epochs 50 \
--learning_rate 1e-4 \
--batch_size 64 \
--latent_dim 64 \
--num_classes 2 \
--beta_initial 0.01 \
--beta_final 0.05 \
--decoder_dropout 0.3 \
--data_dim 1536 \
--iteration_name iteration1_fold0 \
--tiles_per_sample 2000 \
--encoder_hidden_sizes 512 256 128 \
--decoder_hidden_sizes 256 512 1024

Generate Synthetic Tiles

python bergeron_sample.py \
--vae_ckpt_path /path/to/checkpoint.pth \
--h5_dir /path/to/h5_files \
--train_csv ./examples/splits_0.csv \
--label_csv ./examples/meta_labels.csv \
--pseudo_bag_output_dir /path/to/pseudobag_output_directory \
--num_bags 10000 \
--num_real 1000 \
--num_synth 1000 \
--encoder_hidden_sizes 512 256 128 \
--decoder_hidden_sizes 256 512 1024 \
--latent_dim 64 \
--num_class 2 \
--fold 0 \
--iteration iteration1_fold0 \
--prefix run1

Training Notes

Begin with low β and gradually increase for KL stability

Add latent dropout or Gaussian noise to prevent overfitting

Typical latent sizes: 16–128

Use stratified or balanced class sampling each epoch

Troubleshooting

Issue --- Possible Fix

KL loss → 0 early --- Decrease β, add latent noise/dropout

Reconstruction realistic but poor downstream utility --- Increase latent_dim or rebalance conditional loss

OOM during PCA plotting --- Reduce subsample size or disable per-epoch PCA

Synthetic class drift --- Enforce balanced sampling across classes

License

Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

Acknowledgments

Developed in the Curtis Lab at Stanford University.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors