Comparing Computational Pathology Foundation Models using Representational Similarity Analysis

Vaibhav Mishra and William Lotter. Comparing Computational Pathology Foundation Models using Representational Similarity Analysis. Proceedings of the 5th Machine Learning for Health (ML4H) Symposium. 2025.

This repository contains the code and analysis pipeline for comparing the feature representations of various pretrained computational pathology models. The primary method used is Representational Similarity Analysis (RSA), applied to whole-slide image (WSI) patches from The Cancer Genome Atlas (TCGA). The analysis covers four cancer types: Breast Cancer (BRCA), Colon Adenocarcinoma (COAD), Lung Adenocarcinoma (LUAD), and Lung Squamous Cell Carcinoma (LUSC).

Project Overview

The project follows a three-stage pipeline:

Preprocessing: WSIs from the TCGA database are processed. This involves tissue segmentation, generation of 224x224 pixel patches, and Macenko stain normalization.
Embedding Generation: A suite of seven different pretrained models is used to generate feature embeddings for the preprocessed image patches.
Analysis: The generated embeddings are analyzed to understand and compare their representational geometries. This includes:
- Calculating Representational Dissimilarity Matrices (RDMs).
- Measuring the similarity between model RDMs using Spearman correlation and Cosine similarity.
- Evaluating slide-level and disease-level specificity using Cliff's Delta.
- Performing spectral analysis (SVD) to assess the effective dimensionality of the embeddings.

Models Analyzed

The following seven pretrained models are evaluated in this study:

UNI2: MahmoodLab/UNI2-h
Virchow2: paige-ai/Virchow2
Prov-Gigapath: prov-gigapath/prov-gigapath
CONCH: MahmoodLab/conch
PLIP: vinid/plip
KEEP: Astaxanthin/KEEP
ViT-DinoV2: facebookresearch/dinov2

Repository Structure

preprocessing.py: Script to process raw TCGA WSIs, perform stain normalization, and extract patches.
generate_embeddings.py: The main script to generate embeddings for all models. It accepts the model name and GPU index as command-line arguments.
run_analysis.ipynb: A Jupyter Notebook for conducting the full analysis pipeline on the generated embeddings. It calculates RDMs, specificity scores, and generates all figures.
constants.py: Defines the base directory for saving project data. Note: All paths are hardcoded and must be modified for your local environment.

Getting Started

Prerequisites

You will need Python 3.x and the following libraries. You can install them using pip:

pip install torch torchvision pandas numpy scikit-image openslide-python tiatoolbox rsatoolbox seaborn timm transformers huggingface_hub cliffs_delta

You will also need to have access to the TCGA slide images (.svs files).

Instructions

1. Configure Paths

First, update the hardcoded paths in constants.py and other scripts to match your local environment. Specifically, modify PROJECT_SAVE_DIR in constants.py to your desired output directory.

# In constants.py
PROJECT_SAVE_DIR = '/path/to/your/project/directory/'

You will also need to update data paths within preprocessing.py and run_analysis.ipynb.

2. Preprocess Data

The preprocessing.py script handles the extraction and normalization of patches from WSIs.

Place a reference image for stain normalization named normalization_template.jpg in the root directory.
Update the path to the TCGA metadata file (metadata_path) and the base directory for slide images (base_dir) in preprocessing.py.
Run the script:

python preprocessing.py

This will create subdirectories (e.g., preprocessed_patches_BRCA/) containing the extracted patches as .npy files.

3. Generate Embeddings

Embeddings are generated using the generate_embeddings.py script. This script requires a model name and a GPU index as command-line arguments.

Important: You will need to replace YOUR_HF_TOKEN with your actual Hugging Face access token in generate_embeddings.py.

To run generate_embeddings.py, use the following format:

# Example for generating embeddings with UNI2 on GPU 0
python generate_embeddings.py uni2 0

# Example for generating embeddings with Prov-Gigapath on GPU 1
python generate_embeddings.py prov 1

Replace model_name with one of the models from the list ['uni2', 'virchow2', 'prov', 'conch', 'plip', 'keep', 'dinov2'] and gpu_index with the desired GPU.

Repeat this command for all models. The script will save the embeddings in the embeddings/ directory specified in your configuration.

4. Run Analysis

Once all embeddings are generated, open and run the run_analysis.ipynb notebook. This notebook will:

Load the embeddings and split them into batches.
Calculate RDMs for each model and batch.
Generate heatmaps comparing model RDMs (Spearman correlation and Cosine similarity).
Create hierarchical clustering dendrograms.
Calculate and save slide and disease specificity scores.
Perform and plot the spectral analysis.

All plots and results will be saved in the plots/ and rdms/ directories.

Customizing Distance Functions and Batch Sizes

To test other distance functions change the following code snippet:

rdm = calc_rdm(dataset, method='euclidean')

For instance, using Pearson correlation instead of euclidean distance would be changing that line to:

rdm = calc_rdm(dataset, method='correlation')

To change number of slides and patches per slide, change the following code snippet:

num_slides_per_batch = total_slides // n_batches
num_patches_per_batch = total_patches // n_batches

to:

num_slides_per_batch = 25
num_patches_per_batch = 100 # using 25 WSIs and 100 patches per WSI for instance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparing Computational Pathology Foundation Models using Representational Similarity Analysis

Project Overview

Models Analyzed

Repository Structure

Getting Started

Prerequisites

Instructions

1. Configure Paths

2. Preprocess Data

3. Generate Embeddings

4. Run Analysis

Customizing Distance Functions and Batch Sizes

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
constants.py		constants.py
generate_embeddings.py		generate_embeddings.py
normalization_template.jpg		normalization_template.jpg
preprocessing.py		preprocessing.py
run_analysis.ipynb		run_analysis.ipynb

lotterlab/cpath_rsa

Folders and files

Latest commit

History

Repository files navigation

Comparing Computational Pathology Foundation Models using Representational Similarity Analysis

Project Overview

Models Analyzed

Repository Structure

Getting Started

Prerequisites

Instructions

1. Configure Paths

2. Preprocess Data

3. Generate Embeddings

4. Run Analysis

Customizing Distance Functions and Batch Sizes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages