This repository provides modularized Python scripts for benchmarking and validating histopathology foundation models (HFMs) on downstream tasks such as linear probing, k-nearest neighbor (kNN), ABMIL (classification and regerssion) evaluation. It is designed to work seamlessly with pre-trained HFMs (e.g., Virchow, UNI, H-optimus, etc.) and custom datasets across multiple histological stains (H&E, PAS, Silver, IHC).
project_root/
│
├── pyproject.toml
└── src/
└── kidney_hfm_eval/
├── __init__.py
├── ABMIL_classification/
├── ABMIL_regression/
├── Copy_detection/ # Copy-detection & augmentation-invariance analysis
├── Feature_Extraction/ # Model builders + tile/WSI feature extraction
├── MIL_preprocessing/ # Fold generation + embedding packing utilities
├── Probing/ # Linear probing + kNN probing utilities
├── Regression/ # Ridge regression (quantitative probing)
└── Statistical_analysis/ # Friedman/Wilcoxon tests + Holm-Bonferroni + CLD grouping + plots
│
├── requirements.txt
├── environment.yml
├── LICENSE
└── readme.md
The repository provides an end-to-end framework for evaluating HFMs across tile-level, instance-level, and slide-level tasks. The workflow consists of four major components:
Scripts: kidney_hfm_eval/Feature_Extraction/
- Extracts tile-level or WSI-level embeddings using a variety of pretrained HFMs (e.g., UNI, Virchow, Hibou, SP22M, SP85M, Gigapath).
- Outputs standardized
.ptembedding files and metadata for downstream tasks.
Probing assesses the quality of frozen HFM embeddings without fine-tuning.
Script: kidney_hfm_eval/Probing
- Trains a logistic regression classifier or regressor on top of frozen embeddings.
- Measures separability, embedding structure, and task-specific predictive signal.
Script: kidney_hfm_eval/Probing
- Performs non-parametric nearest-neighbor evaluation.
- Serves as a robustness check for feature quality, insensitive to model optimization choices.
Folder: kidney_hfm_eval/Copy_detection/
- Quantifies embedding specificity and redundancy across models.
- Useful for measuring overfitting/embedding collapse.
Folders: kidney_hfm_eval/ABMIL_classification/, kidney_hfm_eval/ABMIL_regression/
- Implements Attention-based MIL (ABMIL) networks for slide-level prediction.
- Includes nested or repeated cross-validation pipelines, hyperparameter sweeps, class-balancing utilities, and model selection.
- Provides complete training + evaluation workflow for downstream slide-level tasks.
Scripts: model_builders.py, vision_transformer.py
- Handles loading of HFMs (DINOv2, ViTs-based H-optimus, Virchow, Hibou, UNI).
- Ensures consistent preprocessing pipelines across models.
- Supports custom architectures or adaptation modules for histopathology.
Folder: kidney_hfm_eval/Statistical_analysis
- Performs Friedman tests, Wilcoxon tests, Holm-Bonferroni correction, and Compact Letter Display (CLD) grouping for model comparisons.
- Generates violin/box plots, heatmaps, and summary figures used in benchmarking papers.
# Clone the repository
git clone https://github.com/<your-username>/Validation_of_Hist_FMs_new.git
cd Validation_of_Hist_FMs_new/Github\ code
# (Recommended) Create and activate a conda environment
conda create -n hist_fm python=3.10 -y
conda activate hist_fm
# Install dependencies
pip install -r requirements.txt# Clone the repository
pip install kidney_hfm_eval
import kidney_hfm_evalHarishwar Reddy Kasireddy
Ph.D. Student, University of Florida
Sarder Lab – Intelligent Critical Care Center
harishwarreddy.k@ufl.edu
https://github.com/K-Harishwar-Reddy
Performs sequential feature extraction from histopathology image tiles or patches using a registry of HFMs such as UNI, Virchow, Hibou, SP22M, Prov-Gigapath, H-optimus, and others.
This script serves as the first stage of the HFM benchmarking pipeline, generating .pt feature tensors for downstream linear or kNN probing.
- Supports multiple HFMs: UNI, Virchow, Hibou, H-optimus, SP22M, SP85M, Prov-Gigapath, etc.
- Sequential extraction: builds one model at a time to minimize GPU memory usage.
- Fully deterministic and reproducible: TF32 disabled, fixed cuDNN behavior, deterministic GEMMs.
- Automatic directory creation and safe skipping of existing
.ptfiles. - Compatible with CUDA or CPU fallback.
- Configurable from the command line with flexible argument parsing.
A directory tree of tiles or patches:
/path/to/tiles/
├── slide_001/
│ ├── tile_0.png
│ ├── tile_1.png
│ └── ...
├── slide_002/
│ ├── tile_0.png
│ └── ...
For each selected model, a parallel directory containing per-image feature tensors:
/path/to/output/
├── UNI/
│ ├── slide_001/
│ │ ├── tile_0.pt
│ │ ├── tile_1.pt
│ └── ...
├── Virchow/
│ ├── slide_001/
│ ├── ...
Each .pt file contains a single PyTorch tensor corresponding to the feature embedding extracted by the model.
Create a Python script (e.g., run_tile_extraction.py) and insert the following:
from src.kidney_hfm_eval.Feature_Extraction.Extract_FE_tile import run_extraction_model_by_model, model_builders_tile
import torch
# Define input and output directories
INPUT_ROOT = "/path/to/your/tiles"
OUTPUT_BASE = "/path/to/save/embeddings"
# Initialize device and registry
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
builders = model_builders_tile(DEVICE)
# Specify the models you want to extract features from
selected_models = [
"UNI", "UNI2-h", "Virchow", "Virchow2",
"SP22M", "SP85M", "H-optimus-0", "H-optimus-1",
"Prov-Gigapath", "Hibou-B", "Hibou-L"
]
# Create output paths for each model
output_roots = {name: f"{OUTPUT_BASE}/{name}" for name in selected_models}
# Run feature extraction
run_extraction_model_by_model(
input_root=INPUT_ROOT,
model_builder_registry=builders,
output_roots=output_roots,
selected_models=tuple(selected_models),
skip_existing=True # Skip tiles that were already processed
)| Argument | Description |
|---|---|
selected_models |
Tuple of models to run. Remove models you don't need. |
skip_existing=True |
Skips tiles whose .pt file is already present. |
DEVICE |
Automatically selects GPU if available. |
After extracting embeddings for each HFM, you can evaluate their representation quality using linear probing and k-nearest neighbor (kNN) probing. These probes quantify how well each model separates biological classes using only frozen embeddings (no fine-tuning).
Both probes support:
- Multiple HFMs
- StratifiedGroupKFold cross-validation
- Repeated evaluation across random seeds
- Bootstrapped confidence intervals
You need:
A metadata CSV mapping:
slide_id, label, group_id, etc.
slide_idmust match folder names under your embedding root.labelis the target class.group_idensures patient-level grouping for Group K fold cross-validation.
Structure:
emb_root/
├── UNI/
│ ├── slide_001/
│ │ ├── tile_0001.pt
│ │ └── ...
├── Virchow/
└── Hibou-L/
Location where probe results (CSVs, plots, logs) will be saved.
Create a script (e.g., run_probes.py) and insert:
from src.kidney_hfm_eval.Probing.linear_probing import run_linear_probe
from src.kidney_hfm_eval.Probing.kNN_probing import run_kNN_probe
# Paths
csv_file = "/path/to/metadata.csv" # Contains slide_id, label, group info
emb_root = "/path/to/embeddings" # Root folder containing model subfolders
out_dir = "/path/to/output" # Results will be saved here
# Models to evaluate
selected_models = [
"UNI", "UNI2-h", "Virchow", "Virchow2",
"SP22M", "SP85M", "H-optimus-0", "H-optimus-1",
"Prov-Gigapath", "Hibou-B", "Hibou-L"
]
# -----------------------------
# Run Linear Probing
# -----------------------------
run_linear_probe(
csv_file=csv_file,
emb_root=emb_root,
out_dir=out_dir,
models=selected_models,
n_splits=5, # StratifiedGroupKFold = 5 folds
seeds=(0, 1, 2), # Repeated CV across 3 seeds
bootstrap=1000 # Bootstrapped confidence intervals
)
# -----------------------------
# Run kNN Probing
# -----------------------------
run_kNN_probe(
csv_file=csv_file,
emb_root=emb_root,
out_dir=out_dir,
models=selected_models,
n_splits=5, # same CV setup
seeds=(0, 1, 2),
bootstrap=1000
)Each model gets its own subfolder:
out_dir/
├── UNI/
│ ├── bootstrap_CI_LR_Virchow2.csv
│ ├── bootstrap_CI_kNN_Virchow2.csv
│ ├── bootstrap_replicates_kNN_Virchow2.csv
│ ├── bootstrap_replicates_LR_Virchow2.csv
│ ├── cv_results_kNN_Virchow2.csv
│ └── cv_results_LR_Virchow2.csv
├── UNI2-h/
├── Virchow/
└── ...
Each probe automatically produces:
- Accuracy, balanced accuracy, F1, MCC, AUROC, AUPRC
- CV means + std
- Bootstrap confidence intervals (95%)
| Parameter | Meaning |
|---|---|
n_splits=5 |
Number of folds in StratifiedGroupKFold |
seeds=(0,1,2) |
Repeat probing for stability |
bootstrap=1000 |
Resample predictions to compute confidence intervals |
models=[...] |
Subset of HFMs to evaluate |
Regression probing evaluates how well frozen foundation-model embeddings encode continuous biological or clinical variables, such as:
- Cell-type proportions
- eGFR, proteinuria, or other renal function metrics
- Morphometric measurements
- Quantitative pathology features
The regression probe uses ridge regression (SVD-based solver) along with nested cross-validation and bootstrapped confidence intervals.
A directory containing .pt files of continuous labels:
label_root/
├── slide_001.pt # tensor of continuous label(s)
├── slide_002.pt
└── ...
Directory containing embeddings extracted for each foundation model:
emb_root/
├── UNI/
│ ├── slide_001/
│ │ ├── tile_0001.pt
│ │ └── ...
├── Virchow/
└── Hibou-L/
A table mapping slide_id → metadata, including:
slide_id, group_id, etc.
This ensures slide-level grouping for regression CV.
Where regression results will be saved.
Create a script named run_regression.py:
from src.kidney_hfm_eval.Regression.linear_regression import run_regression_probe
# Paths (update these)
emb_root = "/path/to/embeddings"
label_root = "/path/to/labels"
csv_file = "/path/to/metadata.csv"
out_dir = "/path/to/output"
# Foundation models to evaluate
selected_models = [
"UNI", "UNI2-h", "Virchow", "Virchow2",
"SP22M", "SP85M", "H-optimus-0", "H-optimus-1",
"Prov-Gigapath", "Hibou-B", "Hibou-L"
]
# -----------------------------
# Run Regression Probing
# -----------------------------
run_regression_probe(
emb_root=emb_root,
label_root=label_root,
csv_file=csv_file,
out_dir=out_dir,
models=selected_models, # evaluate all selected models
n_splits=5, # StratifiedGroupKFold (5 folds)
seeds=(0, 1, 2), # multiple CV seeds for robustness
bootstrap_iters=1000 # bootstrapped CIs for regression metrics
)python run_regression.pyEach model will receive its own result directory:
out_dir/
├── UNI/
│ ├── regression_metrics.csv
│ ├── bootstrap_statistics.csv
│ ├── predictions_fold0.csv
│ ├── predictions_fold1.csv
│ ├── plots/
│ │ ├── scatter_pred_vs_true.png
│ │ ├── violin_r2.png
│ │ └── ...
│ └── logs/
├── Virchow/
└── Hibou-L/
You get:
- Mean ± SD of R², RMSE, MAE, Pearson r, Spearman ρ
- Bootstrapped CIs for all metrics
- Per-fold predictions
- Scatter plots + distributions
- Reproducible logs
| Parameter | Description |
|---|---|
n_splits=5 |
CV folds (StratifiedGroupKFold) |
seeds=(0,1,2) |
Repeat CV for stability |
bootstrap_iters=1000 |
Bootstrapped confidence intervals |
label_root |
Directory containing continuous labels |
emb_root |
Directory containing model embeddings |
copy_detection.py evaluates the robustness and consistency of HFM embeddings when subjected to various image augmentations (e.g., geometric, color, noise, deformation).
It does this by computing Top-K retrieval accuracy — i.e., whether each original image’s embedding correctly identifies its augmented counterpart among all others.
This task quantifies representation stability — a crucial property for models expected to produce invariant embeddings across domain shifts, stain variations, or transformations.
Your base directory should have one subfolder per model, each containing:
/path/to/Copy_detection/
├── UNI/
│ ├── images/
│ │ ├── tile_001.pt
│ │ ├── tile_002.pt
│ │ └── ...
│ └── aug_images/
│ ├── geo/
│ │ ├── tile_001.pt
│ │ ├── tile_002.pt
│ │ └── ...
│ ├── noise/
│ ├── color/
│ └── deform/
├── Virchow/
│ ├── images/
│ └── aug_images/
│ ├── geo/
│ └── color/
└── ...
Each .pt file must contain a 1D PyTorch tensor — the embedding vector corresponding to an original or augmented image.
| Argument | Type | Description |
|---|---|---|
--base_dir |
str |
Path to the root folder (e.g., Copy_detection/) containing per-model subfolders. |
--out_csv |
str |
Path to save the output CSV summarizing top-K results. |
For every foundation model and augmentation type, the script computes Top-K retrieval accuracy based on cosine similarity between normalized embeddings:
| Metric | Meaning |
|---|---|
| Top-1 Accuracy | Fraction of images where the exact augmented pair ranks 1st in similarity. |
Higher values indicate that the model produces stable, augmentation-invariant embeddings.
python copy_detection.py \
--base_dir /orange/pinaki.sarder/harishwarreddy.k/Copy_detection \
--out_csv /orange/pinaki.sarder/harishwarreddy.k/Copy_detection/results/topk_accuracy.csvThis will automatically:
- Iterate over each model subfolder (e.g.,
UNI,Virchow,SP22M, etc.) - Compute Top-1/3/5 accuracies for all augmentation types under
aug_images/ - Save a summary CSV report with results.
-
Load Embeddings The script loads all
.pttensors fromimages/and each subfolder ofaug_images/. -
Normalize Embeddings Each embedding vector is L2-normalized: [ v_i' = \frac{v_i}{|v_i|} ]
-
Compute Similarities A cosine similarity matrix between original and augmented embeddings is computed: [ S = V_{orig} \cdot V_{aug}^T ]
-
Rank Matching Pairs For each image i, it checks where the corresponding augmented embedding ranks among all similarities.
-
Aggregate Metrics Top-1, Top-3, and Top-5 accuracies are averaged across all images for each augmentation type.
- Embedding filenames in
images/and eachaug_images/<type>/folder must match exactly (e.g.,tile_001.pt↔tile_001.pt). - The script automatically validates filename consistency and reports missing pairs.
- Designed for evaluating FM embeddings produced via
Extract_FE_modularized.py. - Useful for comparing invariance of different models (e.g., UNI vs Virchow vs Hibou).
- You can modify
topk_listin the functioncompute_topk_accuracy()to compute custom Top-K values (e.g.,[1,5,10]).
| Observation | Interpretation |
|---|---|
| High Top-1 accuracy | Model is strongly invariant to augmentations; embeddings are almost identical post-transform. |
| Large gap between Top-1 and Top-5 | Minor distortions or moderate sensitivity to augmentation noise. |
| Low Top-K across augmentations | Embeddings change significantly — model less robust to stain/augmentation shifts. |
Before training ABMIL models, your data must go through two preprocessing steps:
- Generate random K-fold splits for slide-level cross-validation
- Pack tile embeddings into
.npyfiles for efficient MIL training
This section describes how to run both steps.
The first step prepares cross-validation folds using slide-level metadata. This ensures balanced folds, and reproducible seeds.
Your CSV (csv_path) must include:
slide_id, group_id, label
slide_id: unique slide namegroup_id: patient-level grouping (ensures no leakage)label: class label
Create make_folds.py:
from src.kidney_hfm_eval.MIL_preprocessing.make_random_folds import generate_random_folds
generate_random_folds(
csv_path="/path/to/metadata.csv",
seeds=[0, 1, 2], # reproducible 3-seed split generation
n_splits=5, # 5-fold StratifiedGroupKFold
out_path="/path/to/output/folds.json"
)A JSON file containing 3×5 folds (one set per seed):
folds.json
{
"seed_0": {
"fold_0": [...],
"fold_1": [...],
...
},
"seed_1": {...},
"seed_2": {...}
}
This is used by all ABMIL training scripts.
Tile embeddings extracted earlier (one .pt file per tile) are typically slow to load individually.
To train MIL models efficiently, pack all tiles belonging to each slide into a single .npy array.
This step also writes:
- A manifest JSON describing file paths
- A summary JSON containing tile counts, shapes, and embedding stats
/root_dir/
├── UNI/
│ ├── slide_001/
│ │ ├── tile_0001.pt
│ │ ├── tile_0002.pt
├── Virchow/
└── Hibou-L/
selected_models = [
"UNI", "UNI2-h", "Virchow", "Virchow2",
"SP22M", "SP85M", "H-optimus-0", "H-optimus-1",
"Hibou-B", "Hibou-L", "Prov-Gigapath"
]Create a script pack_embeddings.py:
from src.kidney_hfm_eval.MIL_preprocessing.pack_embeddings import pack_all_patients
selected_models = [
"UNI", "UNI2-h", "Virchow", "Virchow2",
"SP22M", "SP85M", "H-optimus-0", "H-optimus-1",
"Hibou-B", "Hibou-L", "Prov-Gigapath"
]
# Base paths
ROOT_DIR = "/path/to/embeddings"
OUT_DIR = "/path/to/packed_embeddings"
MANIFEST_DIR = "/path/to/manifests"
SUMMARY_DIR = "/path/to/summaries"
for fm in selected_models:
pack_all_patients(
root_dir = f"{ROOT_DIR}/{fm}",
out_root = f"{OUT_DIR}/{fm}",
manifest_name= f"{MANIFEST_DIR}/{fm}_manifest.json",
summary_name = f"{SUMMARY_DIR}/{fm}_summary.json"
)After packing, each model will have:
OUT_DIR/UNI/
├── slide_001.npy
├── slide_002.npy
└── ...
MANIFEST_DIR/UNI_manifest.json
SUMMARY_DIR/UNI_summary.json
Maps each slide to its .npy file.
Contains:
- number of tiles
- embedding dimensionality
- min/max statistics
- shape per slide
Once the preprocessing steps are complete, you can run:
- MIL classification
- MIL regression
- Nested CV tuning
- Seed-averaged model evaluation
All MIL scripts accept:
folds.json- packed embeddings (
.npy) - manifests
- summary files
This module trains Attention-Based Multiple Instance Learning (ABMIL) models on packed slide-level embeddings produced earlier. The pipeline performs:
- Nested cross-validation (outer folds × inner folds)
- Hyperparameter tuning (
lr,M,L, etc.) - Bootstrap-based uncertainty estimation
- Multi-seed training for robustness
- Evaluation across all selected foundation models
This is the main pipeline used for slide-level classification benchmarking.
A directory containing .npy files per slide:
root_dir/
├── UNI/
│ ├── slide_001.npy
│ ├── slide_002.npy
├── Virchow/
└── Hibou-L/
Contains:
slide_id, label, ...
Used to map slides to labels and patient groups.
The folds JSON produced by:
generate_random_folds(...)Example:
folds.json
{
"seed_0": {
"fold_0": { "train": [...], "val": [...] },
...
}
}
Where training logs for each model will be saved.
Create a script named run_mil_classification.py:
from src.kidney_hfm_eval.ABMIL_classification.mil_classification import run_mil_pipeline
args = {
"root_dir": "/path/to/packed_embeddings",
"csv_path": "/path/to/metadata.csv",
"outer_fold": "/path/to/folds.json",
"log_file_path": "/path/to/training_logs.txt",
# Hyperparameters to tune
"tune_params": "lr,M,L",
# Training configuration
"epochs": 50,
"patience": 20,
"k_folds": 5, # outer folds
"inner_folds": 4, # nested CV inside each outer fold
"batch_size": 1,
# Reproducibility
"seed": 0,
"num_seeds": 3, # train each model with seeds {0,1,2}
# Uncertainty estimation
"bootstrap": 1000,
# Foundation models to evaluate
"models": [
"UNI", "UNI2-h", "Virchow", "Virchow2",
"SP22M", "SP85M", "H-optimus-0", "H-optimus-1",
"Hibou-B", "Hibou-L", "Prov-Gigapath"
],
}
# Run ABMIL pipeline
run_mil_pipeline(args)python run_mil_classification.pyFor each HFM, you will get:
results/
├── UNI/
│ ├── bootstrap_CI_MIL_UNI.csv
│ ├── bootstrap_replicates_MIL_UNI.csv
│ ├── cv_results_MIL_UNI.csv
│ ├── training_log_UNI.csv
├── Virchow/
└── Hibou-L/
Contains:
- Per-fold metrics
- Best hyperparameters for each outer fold
- Training/validation curves
- Bootstrapped MC estimates
- Classification reports
- Confusion matrices
- Final seed-averaged summary metrics
| Parameter | Description |
|---|---|
tune_params="lr,M,L" |
Parameters explored in inner CV |
epochs=50 |
Maximum training epochs |
patience=20 |
Early stopping |
num_seeds=3 |
Seed averaging (robustness) |
bootstrap=1000 |
Confidence intervals |
outer_fold |
JSON with predefined folds |
models |
Foundation models to train ABMIL on |
This module performs formal statistical comparison across all HFMs using the bootstrap replicate files produced from:
- Linear Probing
- kNN Probing
- ABMIL Classification
- ABMIL Regression
It implements:
- Pairwise model comparison
- Holm-corrected significance testing
- Compact Letter Display (CLD) grouping
- Violin/box plots
- Ranking heatmaps
- Summary tables for each classifier × metric
This is the final stage of the entire benchmarking pipeline.
Each probing or MIL pipeline produces bootstrap replicate CSVs like:
/path/to/results/
UNI/
bootstrap_replicates_LR_mcc.csv
bootstrap_replicates_kNN_auroc.csv
...
Virchow/
Hibou-L/
Pattern:
bootstrap_replicates_{CLASSIFIER}_{METRIC}.csv
These contain 1,000 bootstrap samples per model per metric.
Create a file run_statistical_analysis.py:
from src.kidney_hfm_eval.Statistical_analysis.model_performance_comparison import run_replicate_pipeline
classifiers = ["LR", "kNN"] # OR ["MIL"] or ["LR", "MIL"]
metrics = ["mcc", "balanced_accuracy", "f1", "auroc", "recall"]
# For regression tasks use:
# metrics = ["pearson_r", "r2", "mae", "rmse", "mape"]
# Base directory containing all model replicate CSVs
base_dir = "/path/to/probe_or_mil_results"
# List of foundation models to compare
models = [
"UNI", "UNI2-h", "Virchow", "Virchow2",
"SP22M", "SP85M", "H-optimus-0", "H-optimus-1",
"Prov-Gigapath", "Hibou-B", "Hibou-L"
]
for clf in classifiers:
for metric in metrics:
args = {
"repl_glob": f"{base_dir}/*/bootstrap_replicates_{clf}_*.csv",
"save_dir": f"{base_dir}/Visualization_results/{clf}_{metric.upper()}",
"models": models,
"alpha": 0.05,
"verbose": True,
"metric": metric,
"tag": clf
}
print(f"\n Running {clf} analysis for metric: {metric.upper()} ...")
try:
run_replicate_pipeline(args)
except FileNotFoundError as e:
print(f" Skipping {clf}-{metric}: {e}")python run_statistical_analysis.pyOnce the pipeline runs, results are saved in:
Visualization_results_new/
LR_MCC/
violin_plot.png
box_plot.png
cld_table.csv
pvalue_matrix.csv
rank_plot.png
summary_stats.csv
kNN_AUROC/
MIL_BALANCED_ACCURACY/
...
Each classifier–metric pair contains:
- CLD groupings (which models are statistically tied)
- Holm-corrected p-value matrix
- Ordered rank tables
- Violin & box plots per model*
- Model-wise distributions
- Wilcoxon signed rank tests
- Holm correction for multiple comparisons
A–B–C grouping summarizing significance relationships:
A: UNI, UNI2-h
B: Virchow, Virchow2
C: Hibou-B, Prov-Gigapath
Models are sorted based on:
- Mean bootstrap score
- CLD group
- Effect sizes
Automatically generates:
- Violin & box plots
- Model ranking bars
- CLD summary heatmaps
- Confidence intervals
| Parameter | Description |
|---|---|
classifiers |
LR, kNN, MIL (or combinations) |
metrics |
Classification or regression metrics |
repl_glob |
Pattern to load replicate CSVs |
alpha=0.05 |
Significance level for testing |
models |
Foundation models included |
save_dir |
Output directory for plots & tables |