Skip to content

lotterlab/PathFMTools

Repository files navigation

PathFMTools

A computational pathology toolkit for working with pathology foundation models. Supports:

  • Slide preprocessing: Segment whole-slide images (WSIs) and extract tiles
  • Tile embedding: Generate tile embeddings using built-in support for popular foundation models, or use your own model natively using the simple extension system.
  • Text embedding: Generate language embeddings for textual prompts and visual "zeroshot embeddings" for tiles that are aligned with language embeddings for zero-shot workflows.
  • Exploratory and quantitative analyses: Perform clustering and dimensionality reduction on tile embeddings from one or more slides. Run zero-shot analysis by computing similarity scores between tile vision embeddings and text embeddings of user-provided prompts using supported VLMs. Analyses expose built-in plotting utilities that generate useful figures using a single line of code.
  • Command line use: Use the pfmt command to run the built-in slide processing and tile embedding pipelines. Natively supports and coordinates running multiple subprocesses in parallel and utilizing multiple GPUs concurrently with a centralized command line progress tracker. Commands persist per-WSI h5 files that can be reused by the Python interface.
  • Use as a Python package: Import the pathfmtools package in Python to perform slide preprocessing and tile embedding, run analyses, and generate visualizations.

Leveraging Foundation Models for Histological Grading in Cutaneous Squamous Cell Carcinoma using PathFMTools
Proceedings of the 5th Machine Learning for Health (ML4H) Symposium (2025)
Abdul Rahman Diab, Emily E. Karn, Renchin Wu, Emily S. Ruiz, William Lotter
arXiv link

Models

The following models have built-in support in the repo.

Model Supports Text Huggingface Repo
conch link
h-optimus-0 link
h-optimus-1 link
hibou-b link
hibou-l link
midnight-12k link
musk link
phikon link
phikon2 link
uni2 link
virchow link
virchow2 link

To use any of these models, you must:

  1. Have access to it through your HuggingFace account
  2. Have a HuggingFace access token associated with your account
  3. Be locally authenticated with the HuggingFace CLI

This allows pathfmtools to pull the necessary model weights.

Authenticate with Hugging Face via CLI or environment variable:

pip install huggingface_hub  # if CLI is not present
huggingface-cli login        # paste your token when prompted
# Alternatively (non-interactive): export HUGGINGFACE_HUB_TOKEN=...

You can easily register your own local models using the built-in extension system. Once registered, custom models can be used exactly like built-in models.

Environment

Requirements

  • Python 3.12+
  • CUDA-capable GPU for embedding
  • PyTorch with matching CUDA build installed by the provided scripts
  • OpenSlide libraries available on the system

To set up the environment:

  • Make sure you have uv installed.
  • Run bash create_env.sh. This will install the environment and activate it.
  • Run source activate_env.sh to re-activate the environment when needed.
  • (Optional): Run bash download_demo_slides.sh to download the 4 TCGA slides that are used in the notebooks under demo/.

To generate embeddings, your system must have a CUDA-compatible GPU.

Quickstart

Embedding generation

Minimal CLI workflow:

export DATA_DIR=/data/wsis
export STORE_DIR=/stores

# Inspect available backbones and segmenters
pfmt list-tile-models
pfmt list-segmenters

# (Optional) Pre-fetch weights
pfmt download-model-weights --model conch

# Segment, tile (if needed), and embed
pfmt embed-tiles \
  --slide-path "$DATA_DIR/*.tiff" \
  --output-dir "$STORE_DIR" \
  --gpu 0 \
  --model conch \
  --batch-size 256 \
  --n-tile-workers 8 \
  --n-data-workers 2

Every dispatch run also emits slide_status.csv in the chosen --output-dir, capturing a success/error flag and any exception message for each slide so you can spot issues without scraping logs.

Minimal Python workflow:

from pathlib import Path
from pathfmtools.image.slide import Slide
from pathfmtools.tile_models.model_pool import ModelPool

# store_root is the path to the output directory where the h5 stores will be saved
slide = Slide(slide_path=Path("/data/slide1.tiff"), store_root_dir=Path("/stores"))
pool = ModelPool(model_names=["conch"], devices=["cuda:0"])  # Single GPU

embed_results = slide.embed_tiles(model_pool=pool, batch_size=256)
for result in embed_results:
    features = result.feature_embedding_matrix

Analysis

  • demo/analysis/clustering.ipynb demonstrates how embeddings from one or more slides can be clustered to capture patterns.
  • demo/analysis/dimensionality_reduction.pynb demonstrates how embeddings can be reduced and visualized in conjunction with their associated patches.
  • demo/analysis/zeroshot_classification.ipynb demonstrates a minimal zero-shot classification workflow with VLMs using generated embeddings.
  • demo/analysis/abmil.ipynb demonstrates a simple weakly-supervised ABMIL setup for performing slide-level classification using patch-level embeddings.

Slide Stores and Data Layout

CLI commands and the Python APIs persist all intermediate artifacts into a slide-specific HDF5 store (<store_root_dir>/<slide_id>.h5), where <slide_id> is the stem of the WSI file path (e.g. the store for slide_name.svs is slide_name.h5). Each store keeps:

  • segmentation masks
  • per-tile RGB pixels keyed by tile size
  • feature and zero-shot embeddings keyed by model + scaled tile size
  • manifest metadata (microns-per-pixel, slide dimensions)

Tile coordinates are stored as top-left (x, y) pixel offsets. Downstream analysis relies on these coordinates, so never reorder tiles manually.

Slide reading is backed by OpenSlide; common formats include SVS, TIFF, NDPI, SCN. Availability varies by platform.

Example HDF5 store layout

<slide_id>.h5
├─ metadata  # Serialized JSON
├─ segmentation/
│  ├─ mask
│  ├─ method
│  └─ prop_foreground
├─ tiles/<tile_size_px>/
│  ├─ pixels  (N, H, W, 3)
│  └─ coords  (N, 2)  # x, y (top-left)
└─ embeddings/<model_name>/<model_tile_size_px>/
   ├─ feature                   # (N, D_f)
   ├─ feature_top_left_coords   # (N, 2)
   ├─ zeroshot                  # (N, D_z)
   └─ zeroshot_top_left_coords  # (N, 2)

CLI (pfmt)

The CLI is designed for coarse-grained processing where you supply WSI paths or glob patterns. It is optimal for large runs, as it offers built-in support for multiprocessing and multi-GPU workflows. Run pfmt --help or pfmt <command> --help for argument details.

1. Inspect Registered Assets

  • pfmt list-tile-models — prints available embedding backbones and their capabilities, including whether required weights are already cached.
  • pfmt list-segmenters — enumerates tissue/background segmentation backends.

Also see Extensions for registering custom models and segmenters.

2. Prepare Model Weights

  • pfmt download-model-weights --model conch --model musk — fetch weights eagerly if machines lack network access during runs. Missing weights can be downloaded on-demand as well.

3. Segment and Tile Slides

  • pfmt segment-and-tile --slide-path "/data/*.tiff" --output-dir /stores --model conch
    • When --model is provided the tool infers tile sizes that satisfy each model's expected microns-per-pixel. Alternatively, specify explicit --tile-size values.
    • Use --segmenter to pick a custom method; defaults to otsu.
    • Control concurrency with --n-parallel (slides) and --n-tile-workers (per slide).

4. Embed Tiles

  • pfmt embed-tiles --slide-path /data/slide1.tiff --output-dir /stores --gpu 0 --model conch.
    • Reuses cached tiles when present; otherwise triggers segmentation/tiling automatically.
    • --tile-size overrides inferred tile sizes (rarely needed).
    • Use --skip-feature-embeddings or --skip-zeroshot-embeddings to limit outputs.
    • Enable --allow-mpp-interpolation if the slide resolution is coarser than the model expects.
    • --delete-tiles/--no-delete-tiles toggles whether cached RGB tiles remain in the store after embedding (keeping them speeds up later pixel-dependent workflows at the cost of disk space).
    • Adjust --n-tile-workers and --n-data-workers for IO vs. dataloader concurrency.

Example multi-GPU batch (see demo/embed_tiles.sh):

pfmt embed-tiles \
  --slide-path "tests/data/*.tiff" \
  --output-dir tests/data \
  --gpu 0 --gpu 1 \
  --batch-size 512 \
  --model conch --model musk \
  --n-tile-workers 8 \
  --n-data-workers 2

5. Manage Tile Stores

  • pfmt delete-tiles --store-path /stores/slide1.h5 --yes removes cached tile pixels while keeping metadata and embeddings intact. Substantially reduces the amount of disk space occupied by the generated h5 files, but can be slow as the h5 files must be reconstructed to reclaim disk space.

6. Cache Text Prompts for Zero-Shot Classification

  • pfmt embed-text --model conch --prompt "tumor" --prompt "stroma" --gpu cuda:0
    • Prompts are cached in pathfmtools/data/text_store.h5.

Python package (pathfmtools)

The Python interfaces provide finer control for custom preprocessing loops and interactive analysis.

pathfmtools.image.Slide

Slide is one of three objects that the user is responsible for initializing (the other two being ModelPool and TileAnalysis). It represents a WSI along with all of its processed data, including tiles, segmentation, and embeddings. It also exposes methods that run supported workflows, such as slide preprocessing (segmentation + tiling) and tile embedding. The Slide object also exposes basic visualization utilities.

Initialization

from pathlib import Path
from pathfmtools.image.slide import Slide

# Initialize a Slide from a WSI. This is the only initialization mode when processing a
# slide for the first time.
slide = Slide(
    slide_path=Path("/data/slide1.tiff"),
    store_root_dir=Path("/stores"),
    microns_per_pixel=None,  # Can be specified to override (potentially missing) slide metadata
)

# Alternatively, initialize a Slide from a pre-existing h5 store. Per-slide h5 stores are
# created from pre-processing and embedding steps. h5 stores keep a reference to the original
# WSI file path, and contain cached tiles, segmentation results, and embedding results. This is
# the preferred method of initialization when available.
slide = Slide(
    store_path=Path("/stores/slide1.h5"),
)

Preprocessing

Slide preprocessing (segmentation + tiling) can be performed by calling the Slide.preprocess method. Note: When interested in generating embeddings, a direct call to Slide.preprocess is neither necessary nor recommended, as the Slide.embed_tiles method handles preprocessing implicitly and always extracts the tile size(s) that match what the specified model(s) expect — see the Embedding section.

segmentation, tile_readers = slide.preprocess(
    tile_sizes=[448],  # 1+ tile sizes to extract. Only foreground regions (tissue) are extracted).
    segmenter="otsu",  # The name of the segmenter that will distinguish tissue from background.
    tile_workers=4,    # Number of multiprocessing workers for extracting tiles from the WSI
)
  • segmentation is a Segmentation object with the attributes:
    • seg_mask: Boolean foreground/background (1/0) array. Dimensions are determined by the segmenter of choice, but must preserve the aspect ratio of the slide.
    • seg_method: The name of the segmentation method used to produce the segmentation.
    • prop_foreground: The proportion of the slide that was determined to be foreground/tissue.
  • tile_readers are re-usable pixel iterators bound to the persisted HDF5 datasets. They support lazy access to tile pixels using numpy-like indexing syntax, and support iteration. Tile readers must be used as context managers:
    tile_reader = slide.get_tile_reader(tile_size=448)
    with tile_reader as t:
        tile0, coords0 = t[0]
        for tile, coords in t:
            pass

The Segmentation object associated with a slide can be accessed using Slide.get_segmentation(). Similarly, the tile reader for a given tile size can be accessed using Slide.get_tile_reader(tile_size=...).

Visualization

The Slide object supports visualizations including thumbnail generation and reading of arbitrary regions given coordinates. See demo/slide.ipynb for examples.

Embedding with pathfmtools.tile_models.ModelPool

ModelPool is an object that coordinates calls to embedding models. It allows the user to specify the model(s) and GPU(s) to use for embedding, and is a required input to the Slide.embed_tiles method.

The object supports 2 modes of GPU assignment:

  1. Dynamic: specify multiple models and multiple GPUs without direct linkage. The pool will attempt to keep the specified GPUs as busy as possible by dynamically assigning embedding tasks to free GPUs in parallel.
  2. Mapped: specify an explicit mapping from model to GPU. Models can only run on their associated GPUs. Parallel embedding is performed given at least two models with non-overlapping GPU assignments.
import torch
from pathfmtools.tile_models.model_pool import ModelPool

# Dynamic GPU assignment. Each model may run on either GPU, provided that it is free.
 pool = ModelPool(
    model_names=["conch", "musk", "virchow2"],
    devices=["cuda:0", "cuda:1"],
)

# Mapped GPU assignment. conch and musk can only run on GPU 0, virchow2 can only run on GPU 1.
# conch and musk will run sequentially on GPU 0, virchow2 will run in parallel on GPU 1.
 pool = ModelPool(
    model_device_map={
      "conch": "cuda:0",
      "musk": "cuda:0",
      "virchow2": "cuda:1",
    }
)

embed_results = slide.embed_tiles(
    model_pool=pool,
    batch_size=256,                   # Batches fed into embedding models
    skip_feature_embeddings=False,    # If True, skip generating feature embeddings
    skip_zeroshot_embeddings=False,   # If True, skip generating zero-shot embeddings
    tile_workers=8,                   # Multiprocessing workers for tile extraction
    data_workers=2,                   # DataLoader workers feeding the model
)

# embed_results is a list of TileEmbeddingGroup objects. See demo/slide.ipynb for usage reference.
for result in embed_results:
    features = result.feature_embedding_matrix        # Shape (n_tiles, n_features)
    zeroshot = result.zeroshot_embedding_matrix       # May be None if model lacks text support
    pixels = result.pixel_reader                      # Context-managed lazy access to RGB tiles

The TileEmbeddingGroup objects associated with a slide can be accessed using Slide.read_tile_embeddings(model_name=..., tile_size=...). Note that tile_size is the size of the tiles that was provided to the model, which is not necessarily the same as the size of the tiles extracted from the slide (due to inter-slide variations in microns per pixel). If you directly call Slide.embed_tiles as shown in the example above, the "model tile size" will always be 224 for built-in models.

Analysis

The TileAnalysis object provides a simple interface for running analyses using the generated embeddings.

from pathfmtools.analysis.tile_analysis import TileAnalysis

analysis = TileAnalysis.from_slides(
    data=slide,              # One Slide instance or a list of Slide instances.
    model_name="conch",      # Name of the model that generated the embeddings which will be used.
    tile_size=224,           # Tile size provided to the model. Unless  overridden during embedding generation, always 224.
)

TileAnalysis has built-in support for 3 types of analyses:

  • Clustering: Cluster model embeddings and visualize tiles assigned to different clusters to detect patterns. See demo/analysis/clustering.ipynb.
  • Zero-shot classification: Compute cosine similarity scores (logits) between tiles and user-provided text prompts. Visualize the highest- and lowest-probability tiles for each prompt. Visualize prompt-specific similarity score heatmaps and tile class assignments over a slide thumbnail. See demo/analysis/zeroshot_classification.ipynb.
  • Dimensionality reduction: Perform dimensionality reduction (TSNE/UMAP/PCA) on generated embeddings and visualize the results as 2D/3D scatterplots. Supports coloring of reduced data points by clustering results or zero-shot class assignment. When running in Jupyter, the generated scatterplots support on-hover display of the corresponding tile pixels to facilitate pattern detection. See demo/analysis/dimensionality_reduction.ipynb.

Additionally, pathfmtools exposes the pathfmtools.data.torch.TileEmbeddingDataset class, which is a Torch Dataset that interfaces with the created h5 stores and exposes generated embeddings, greatly simplifying deep learning workflows. See demo/analysis/abmil.ipynb for a plug-and-play example that trains a toy ABMIL model using only a few lines of code.

Extensions

The extension system lets you add user-provided embedding models and slide segmenters without modifying core code. Extensions are regular Python classes that live under the pathfmtools.extensions package and are registered by alias into a persistent manifest.

  • Location policy: classes must be importable from pathfmtools.extensions.<module>:<Class>.
  • Manifest: registrations persist to pathfmtools/extensions/manifest.yaml.
  • Autoloading: on startup, registered entries are validated and added to the core registries so you can reference them anywhere a model or segmenter name is accepted by the CLI or API.

Kinds and required interfaces

  • Segmenters: implement a static segment_slide(slide_reader: SlideReader, ...) -> np.ndarray[bool]. Example: pathfmtools.extensions.demo_segmenter:CheckerboardSegmenter.
  • Models: subclass pathfmtools.tile_models.tile_model.TileModel and define required class attributes (NAME, EXPECTED_MICRONS_PER_PIXEL, EXPECTED_TILE_SIZE, FEATURE_EMBEDDING_DIM, SUPPORTS_ZEROSHOT, ZEROSHOT_EMBEDDING_DIM, POOLING_RULE) and methods (get_preproc_callable, get_feature_embeddings). Example: pathfmtools.extensions.demo_model:MeanIntensityModel.

See demo/register_model.ipynb and demo/register_segmenter.ipynb for examples of registering classes and using them.

FAQ

  • Extracted tile sizes vs model tile sizes
    • Slides can have variable microns-per-pixel (MPP). Extraction happens in slide pixel space at requested sizes (e.g., 448 px). Models consume a fixed model tile size (e.g., 224) after scaling to match expected MPP. The “model tile size” recorded with embeddings reflects the input to the model, not necessarily the raw extracted size. If not explicitly overridden, all built-in models consume a tile size of 224. Slide.list_available_feature_embeddings() and Slide.list_available_zeroshot_embeddings() list the available embeddings and the tile sizes that the models consumed to generate them.
  • Persisted artifacts
    • Per-slide HDF5 stores cache segmentation masks, RGB tiles per size, and embeddings per model+model-tile-size. Repeated runs reuse cached data to avoid recomputation.
  • Torch datasets
    • Use pathfmtools.data.torch.TileEmbeddingDataset to iterate embeddings directly from HDF5. Ideal for training downstream models without manual I/O.
  • Feature vs zero-shot embeddings
    • Feature embeddings are pure vision features. Zero-shot embeddings are aligned to the model’s text space and enable prompt-based similarity and classification when the backbone supports text.

Performance and Sizing

  • Concurrency
    • Tune CLI parameters --n-parallel (slides), --n-tile-workers (tiling), and --n-data-workers (dataloader) based on IO vs GPU utilization.
  • Disk
    • Keeping RGB tiles accelerates future pixel-dependent tasks but increases store size; remove via pfmt delete-tiles when not needed.
  • Memory
    • Batch size is bounded by GPU memory and model; start with smaller batch sizes and adjust. Monitor VRAM and dataloader RAM usage.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published