Preprint Link: https://www.biorxiv.org/content/10.64898/2025.12.29.692457v1
BERGERON is a class-conditional Variational Autoencoder (cVAE) designed to generate realistic foundation model embeddings from H&E tiles.
It enables data augmentation for attention-based multiple instance learning model training, controlled subtype perturbation, and interpretable latent-space exploration.
# Conda
conda env create -f environment.yml
conda activate BERGERONEach .h5 file represents one WSI or pseudo-bag (generated using CLAM or another H&E processing tool).
features: [N_tiles, D] — embedding vectors (D=embedding dimension of foundation model)
coords (optional): [N_tiles, 2] (x, y)
/data/h5_files/TCGA-XX-0001.h5
/data/h5_files/TCGA-XX-0002.h5
See examples/meta_labels.csv, can be same as generated by CLAM
Contains the slide label (which matches the .h5 file name minus the file suffix) and an integer for the class label.
See examples/splits_0.csv, can be same as generated by CLAM
Contains the train/val/test split to be used for ABMIL training. Must match the folds to be used for ABMIL training to prevent data leakage.
Train a Conditional VAE
python bergeron_train.py \
--h5_dir /path/to/h5_files \
--output_dir /path/to/output_directory \
--train_csv ./examples/splits_0.csv \
--label_csv ./examples/meta_labels.csv \
--num_epochs 50 \
--learning_rate 1e-4 \
--batch_size 64 \
--latent_dim 64 \
--num_classes 2 \
--beta_initial 0.01 \
--beta_final 0.05 \
--decoder_dropout 0.3 \
--data_dim 1536 \
--iteration_name iteration1_fold0 \
--tiles_per_sample 2000 \
--encoder_hidden_sizes 512 256 128 \
--decoder_hidden_sizes 256 512 1024Generate Synthetic Tiles
python bergeron_sample.py \
--vae_ckpt_path /path/to/checkpoint.pth \
--h5_dir /path/to/h5_files \
--train_csv ./examples/splits_0.csv \
--label_csv ./examples/meta_labels.csv \
--pseudo_bag_output_dir /path/to/pseudobag_output_directory \
--num_bags 10000 \
--num_real 1000 \
--num_synth 1000 \
--encoder_hidden_sizes 512 256 128 \
--decoder_hidden_sizes 256 512 1024 \
--latent_dim 64 \
--num_class 2 \
--fold 0 \
--iteration iteration1_fold0 \
--prefix run1Begin with low β and gradually increase for KL stability
Add latent dropout or Gaussian noise to prevent overfitting
Typical latent sizes: 16–128
Use stratified or balanced class sampling each epoch
Issue --- Possible Fix
KL loss → 0 early --- Decrease β, add latent noise/dropout
Reconstruction realistic but poor downstream utility --- Increase latent_dim or rebalance conditional loss
OOM during PCA plotting --- Reduce subsample size or disable per-epoch PCA
Synthetic class drift --- Enforce balanced sampling across classes
Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
Developed in the Curtis Lab at Stanford University.