BERGERON: Class-Conditional VAE for Synthetic H&E Tile Embedding Generation

Preprint Link: https://www.biorxiv.org/content/10.64898/2025.12.29.692457v1

BERGERON is a class-conditional Variational Autoencoder (cVAE) designed to generate realistic foundation model embeddings from H&E tiles.
It enables data augmentation for attention-based multiple instance learning model training, controlled subtype perturbation, and interpretable latent-space exploration.

Installation

# Conda
conda env create -f environment.yml
conda activate BERGERON

Data Format

Each .h5 file represents one WSI or pseudo-bag (generated using CLAM or another H&E processing tool).

Datasets inside each file

features: [N_tiles, D] — embedding vectors (D=embedding dimension of foundation model)

coords (optional): [N_tiles, 2] (x, y)

Example directory

/data/h5_files/TCGA-XX-0001.h5

/data/h5_files/TCGA-XX-0002.h5

Example label csv

See examples/meta_labels.csv, can be same as generated by CLAM

Contains the slide label (which matches the .h5 file name minus the file suffix) and an integer for the class label.

Example train csv

See examples/splits_0.csv, can be same as generated by CLAM

Contains the train/val/test split to be used for ABMIL training. Must match the folds to be used for ABMIL training to prevent data leakage.

Quickstart

Train a Conditional VAE

python bergeron_train.py \
--h5_dir /path/to/h5_files \
--output_dir /path/to/output_directory \
--train_csv ./examples/splits_0.csv \
--label_csv ./examples/meta_labels.csv \
--num_epochs 50 \
--learning_rate 1e-4 \
--batch_size 64 \
--latent_dim 64 \
--num_classes 2 \
--beta_initial 0.01 \
--beta_final 0.05 \
--decoder_dropout 0.3 \
--data_dim 1536 \
--iteration_name iteration1_fold0 \
--tiles_per_sample 2000 \
--encoder_hidden_sizes 512 256 128 \
--decoder_hidden_sizes 256 512 1024

Generate Synthetic Tiles

python bergeron_sample.py \
--vae_ckpt_path /path/to/checkpoint.pth \
--h5_dir /path/to/h5_files \
--train_csv ./examples/splits_0.csv \
--label_csv ./examples/meta_labels.csv \
--pseudo_bag_output_dir /path/to/pseudobag_output_directory \
--num_bags 10000 \
--num_real 1000 \
--num_synth 1000 \
--encoder_hidden_sizes 512 256 128 \
--decoder_hidden_sizes 256 512 1024 \
--latent_dim 64 \
--num_class 2 \
--fold 0 \
--iteration iteration1_fold0 \
--prefix run1

Training Notes

Begin with low β and gradually increase for KL stability

Add latent dropout or Gaussian noise to prevent overfitting

Typical latent sizes: 16–128

Use stratified or balanced class sampling each epoch

Troubleshooting

Issue --- Possible Fix

KL loss → 0 early --- Decrease β, add latent noise/dropout

Reconstruction realistic but poor downstream utility --- Increase latent_dim or rebalance conditional loss

OOM during PCA plotting --- Reduce subsample size or disable per-epoch PCA

Synthetic class drift --- Enforce balanced sampling across classes

License

Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

Acknowledgments

Developed in the Curtis Lab at Stanford University.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
bergeron		bergeron
examples		examples
scripts		scripts
LICENSE		LICENSE
README.md		README.md
bergeron_sample.sh		bergeron_sample.sh
bergeron_train.sh		bergeron_train.sh
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERGERON: Class-Conditional VAE for Synthetic H&E Tile Embedding Generation

Installation

Data Format

Datasets inside each file

Example directory

Example label csv

Example train csv

Quickstart

Training Notes

Troubleshooting

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BERGERON: Class-Conditional VAE for Synthetic H&E Tile Embedding Generation

Installation

Data Format

Datasets inside each file

Example directory

Example label csv

Example train csv

Quickstart

Training Notes

Troubleshooting

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages