Gleason Grade Prediction from MRI-Guided Biopsy

Prostate cancer classification using T2-weighted MRI biopsy core data. Three approaches are compared: a classical ML baseline, a 3D CNN, and a Graph Neural Network (GNN), all trained on the public TCIA Prostate-MRI-US-Biopsy dataset (~16,700 biopsy cores).

Binary task: clinically significant cancer (Gleason Grade Group 3+) vs. low-grade/benign (GG0–2).

Pipeline Overview

Step 1  build_manifest.py          Build dataset manifest (labels + needle coordinates)
Step 2  02_coordinate_alignment    Validate DICOM ↔ voxel coordinate transforms
Step 3  extract_rois.py            Extract sparse cylindrical point clouds per core
Step 4  04_baseline.ipynb          Random Forest on hand-crafted features (~AUC 0.72)
Step 5a 05_dl_3dcnn.ipynb          3D CNN on resampled grid (~AUC 0.77)
Step 5b 06_dl_gnn_final.ipynb      GNN on native sparse point cloud (~AUC 0.80)

Setup

pip install -r requirements.txt

Key dependencies: pydicom, torch, torch_geometric, scikit-learn, pandas, numpy.

Data: Download the TCIA dataset and place DICOM series under data/dicom/ and the Excel spreadsheet at data/raw/TCIA-Biopsy-Data_2020-07-14.xlsx.

Running the Pipeline

# 1. Build the manifest (one row per biopsy core, with labels and needle coords)
python src/build_manifest.py

# 2. Validate coordinate system (inspect visuals in notebook)
jupyter notebook notebooks/02_coordinate_alignment.ipynb

# 3. Extract sparse ROIs from MRI volumes
python src/extract_rois.py

# 4–5. Train and evaluate models (notebooks run independently)
jupyter notebook notebooks/04_baseline.ipynb
jupyter notebook notebooks/05_dl_3dcnn.ipynb
jupyter notebook notebooks/06_dl_gnn_final.ipynb

Project Structure

gleason_prediction/
├── src/
│   ├── build_manifest.py        # Step 1: dataset manifest builder
│   ├── dicom_utils.py           # DICOM loading, LPS ↔ voxel transforms
│   └── extract_rois.py          # Step 3: cylindrical ROI extraction
├── notebooks/
│   ├── 02_coordinate_alignment.ipynb
│   ├── 04_baseline.ipynb
│   ├── 05_dl_3dcnn.ipynb
│   └── 06_dl_gnn_final.ipynb
├── data/                        # Git-ignored except metadata CSVs
│   ├── manifest.csv             # Main dataset (16,692 cores)
│   ├── dicom_index.csv          # series_uid → DICOM path cache
│   ├── extraction_report.csv    # Step 3 QA stats
│   ├── cores_to_drop_contamination.csv
│   ├── patches/                 # Sparse point clouds (.npz), one per core
│   ├── grids/                   # Precomputed 3D grids for CNN
│   └── edges/                   # Precomputed graph edges for GNN
└── requirements.txt

Methods

ROI Representation

Each biopsy core is represented as a sparse cylindrical point cloud in a needle-aligned coordinate frame (axes: w = axial along needle, u/v = radial). Voxels within 2 mm of the needle trajectory are extracted from the T2-weighted MRI volume, with intensities z-score normalized per scan. This yields compact ~9 KB .npz files and avoids including irrelevant tissue or neighboring needle signals.

Step 4 — Random Forest Baseline

~20 hand-crafted features per core (intensity statistics, spatial extent, gradient magnitude). Evaluated with 5-fold stratified group cross-validation (splits at subject level to prevent data leakage).

Step 5a — 3D CNN

Point clouds are resampled onto a fixed 3D grid (50 × 12 × 12 bins, covering 100 mm axially and ±2 mm radially). A residual 3D CNN with global average pooling classifies each grid. Augmentation: rotation around the needle axis and radial flips.

Step 5b — GNN (EdgeConv / DGCNN-style)

Nodes are voxels; edges connect nearby voxels (radius graph). Node features: normalized axial position, radial coordinates, radial distance, intensity. EdgeConv layers (MLP(cat(x_i, x_j − x_i))) with residual connections and global max+mean pooling. Graph edges are precomputed once and cached. Augmentation is the same as for the CNN and leaves edge structure invariant.

All three models share identical train/validation/test splits (15% subjects held out for testing) for fair comparison.

Gleason Grading Reference

Grade Group	Gleason Score	Clinical Significance
GG0	—	Benign
GG1	3+3 = 6	Low-grade
GG2	3+4 = 7	Intermediate
GG3	4+3 = 7	Clinically significant (positive class)
GG4	8	Clinically significant
GG5	9–10	Clinically significant

Citation

Natarajan, S., Priester, A., Margolis, D., Huang, J., & Marks, L. (2020). Prostate MRI and Ultrasound With Pathology and Coordinates of Tracked Biopsy (Prostate-MRI-US-Biopsy) (version 2) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.2020.A61IOC1A

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.vscode		.vscode
data/raw		data/raw
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gleason Grade Prediction from MRI-Guided Biopsy

Pipeline Overview

Setup

Running the Pipeline

Project Structure

Methods

ROI Representation

Step 4 — Random Forest Baseline

Step 5a — 3D CNN

Step 5b — GNN (EdgeConv / DGCNN-style)

Gleason Grading Reference

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gleason Grade Prediction from MRI-Guided Biopsy

Pipeline Overview

Setup

Running the Pipeline

Project Structure

Methods

ROI Representation

Step 4 — Random Forest Baseline

Step 5a — 3D CNN

Step 5b — GNN (EdgeConv / DGCNN-style)

Gleason Grading Reference

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages