Vaibhav Mishra and William Lotter. Comparing Computational Pathology Foundation Models using Representational Similarity Analysis. Proceedings of the 5th Machine Learning for Health (ML4H) Symposium. 2025.
This repository contains the code and analysis pipeline for comparing the feature representations of various pretrained computational pathology models. The primary method used is Representational Similarity Analysis (RSA), applied to whole-slide image (WSI) patches from The Cancer Genome Atlas (TCGA). The analysis covers four cancer types: Breast Cancer (BRCA), Colon Adenocarcinoma (COAD), Lung Adenocarcinoma (LUAD), and Lung Squamous Cell Carcinoma (LUSC).
The project follows a three-stage pipeline:
- Preprocessing: WSIs from the TCGA database are processed. This involves tissue segmentation, generation of 224x224 pixel patches, and Macenko stain normalization.
- Embedding Generation: A suite of seven different pretrained models is used to generate feature embeddings for the preprocessed image patches.
- Analysis: The generated embeddings are analyzed to understand and compare their representational geometries. This includes:
- Calculating Representational Dissimilarity Matrices (RDMs).
- Measuring the similarity between model RDMs using Spearman correlation and Cosine similarity.
- Evaluating slide-level and disease-level specificity using Cliff's Delta.
- Performing spectral analysis (SVD) to assess the effective dimensionality of the embeddings.
The following seven pretrained models are evaluated in this study:
- UNI2: MahmoodLab/UNI2-h
- Virchow2: paige-ai/Virchow2
- Prov-Gigapath: prov-gigapath/prov-gigapath
- CONCH: MahmoodLab/conch
- PLIP: vinid/plip
- KEEP: Astaxanthin/KEEP
- ViT-DinoV2: facebookresearch/dinov2
preprocessing.py: Script to process raw TCGA WSIs, perform stain normalization, and extract patches.generate_embeddings.py: The main script to generate embeddings for all models. It accepts the model name and GPU index as command-line arguments.run_analysis.ipynb: A Jupyter Notebook for conducting the full analysis pipeline on the generated embeddings. It calculates RDMs, specificity scores, and generates all figures.constants.py: Defines the base directory for saving project data. Note: All paths are hardcoded and must be modified for your local environment.
You will need Python 3.x and the following libraries. You can install them using pip:
pip install torch torchvision pandas numpy scikit-image openslide-python tiatoolbox rsatoolbox seaborn timm transformers huggingface_hub cliffs_deltaYou will also need to have access to the TCGA slide images (.svs files).
First, update the hardcoded paths in constants.py and other scripts to match your local environment. Specifically, modify PROJECT_SAVE_DIR in constants.py to your desired output directory.
# In constants.py
PROJECT_SAVE_DIR = '/path/to/your/project/directory/'You will also need to update data paths within preprocessing.py and run_analysis.ipynb.
The preprocessing.py script handles the extraction and normalization of patches from WSIs.
- Place a reference image for stain normalization named
normalization_template.jpgin the root directory. - Update the path to the TCGA metadata file (
metadata_path) and the base directory for slide images (base_dir) inpreprocessing.py. - Run the script:
python preprocessing.pyThis will create subdirectories (e.g., preprocessed_patches_BRCA/) containing the extracted patches as .npy files.
Embeddings are generated using the generate_embeddings.py script. This script requires a model name and a GPU index as command-line arguments.
Important: You will need to replace YOUR_HF_TOKEN with your actual Hugging Face access token in generate_embeddings.py.
To run generate_embeddings.py, use the following format:
# Example for generating embeddings with UNI2 on GPU 0
python generate_embeddings.py uni2 0
# Example for generating embeddings with Prov-Gigapath on GPU 1
python generate_embeddings.py prov 1Replace model_name with one of the models from the list ['uni2', 'virchow2', 'prov', 'conch', 'plip', 'keep', 'dinov2'] and gpu_index with the desired GPU.
Repeat this command for all models. The script will save the embeddings in the embeddings/ directory specified in your configuration.
Once all embeddings are generated, open and run the run_analysis.ipynb notebook. This notebook will:
- Load the embeddings and split them into batches.
- Calculate RDMs for each model and batch.
- Generate heatmaps comparing model RDMs (Spearman correlation and Cosine similarity).
- Create hierarchical clustering dendrograms.
- Calculate and save slide and disease specificity scores.
- Perform and plot the spectral analysis.
All plots and results will be saved in the plots/ and rdms/ directories.
To test other distance functions change the following code snippet:
rdm = calc_rdm(dataset, method='euclidean')For instance, using Pearson correlation instead of euclidean distance would be changing that line to:
rdm = calc_rdm(dataset, method='correlation')To change number of slides and patches per slide, change the following code snippet:
num_slides_per_batch = total_slides // n_batches
num_patches_per_batch = total_patches // n_batchesto:
num_slides_per_batch = 25
num_patches_per_batch = 100 # using 25 WSIs and 100 patches per WSI for instance.