Nuclei Cell Instance Segmentation Project for AI535 Final Project
Team Members:
- Brandon Gill
- Andy Bui
Cell segmentation is a critical process for defining boundaries in microscopic images, enabling quantitative analysis of cell counts, shapes, and molecular content. By accurately identifying individual cells, this technology supports drug discovery, disease research, and spatial tissue analysis, ultimately helping to improve cancer diagnosis and treatment strategies.
- Architecture: U-Net
- Loss Function: Binary Cross-Entropy (BCE) + Dice Loss
- Evaluation Metric: Intersection over Union (IoU)
- Experiment Tracking: Weights & Biases (WandB)
Cell-Segmentation-Deep-Learning/
├── app/
│ └── app.py
├── checkpoints/
│ └── unet_best_model.pth
├── data/
│ ├── augmented/
│ └── data-science-bowl-2018/
│ ├── stage1_test/
│ ├── stage1_train/
│ ├── stage2_test_final/
│ ├── stage1_sample_submission.csv
│ ├── stage1_solution.csv
│ ├── stage1_train_labels.csv
│ └── stage2_sample_submission_final.csv
├── notebooks/
│ └── 01_data_exploration.ipynb
├── outputs/
├── src/
│ ├── __init__.py
│ ├── dataset.py
│ ├── evaluate.py
│ ├── loss.py
│ ├── metrics.py
│ ├── model.py
│ ├── train.py
│ └── utils.py
├── .gitignore
├── README.md
├── requirements.txt
└── train.slurm
pip install -r requirements.txt
The data for this project is sourced from the 2018 Data Science Bowl on Kaggle: Data Science Bowl 2018 Data
This project is configured to run on the Oregon State University high-performance computing (HPC) cluster using the SLURM workload manager. Since we are performing instance segmentation, utilizing the cluster's GPU nodes is highly recommended for training.
Note: Instance segmentation requires significantly more computational overhead than semantic segmentation — especially when computing watershed algorithms or running models like Mask R-CNN or StarDist. The HPC cluster's GPU nodes are the right tool for this workload.
First, SSH into the cluster (ensure you are on the OSU VPN if off-campus):
ssh your_onid@submit.hpc.engr.oregonstate.eduClone the repository into your home directory or scratch space:
cd hpc-share
git clone https://github.com/BGill8/Cell-Segmentation-Deep-Learning.git
cd Cell-Segmentation-Deep-LearningDo not install packages directly to your base environment. Load the necessary CUDA and Python modules, then create a virtual environment.
# Load necessary modules (adjust versions based on current HPC availability)
module load python/3.10
module load cuda/11.8
module load slurm
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install --upgrade pip
pip install -r requirements.txtImportant: Do not store data or environments in your home directory (
~), or you will instantly hit a strict quota limit and crash. Keep everything in your high-capacity~/hpc-share/drive.
Since the Kaggle CLI requires API key configuration, the easiest way to get the data onto the cluster is to download the data-science-bowl-2018.zip to your local machine, and securely copy it over:
1. Run this on your LOCAL machine (Mac/Windows), not the cluster:
scp /path/to/your/downloads/data-science-bowl-2018.zip your_onid@submit.hpc.engr.oregonstate.edu:/nfs/stak/users/your_onid/hpc-share/Cell-Segmentation-Deep-Learning/- Extract and organize on the cluster:
cd ~/hpc-share/Cell-Segmentation-Deep-Learning/
mkdir -p data/data-science-bowl-2018
unzip data-science-bowl-2018.zip -d data/data-science-bowl-2018/
# Extract the inner training and testing folders
cd data/data-science-bowl-2018
for f in *.zip; do unzip -d "${f%.zip}" "$f"; done
# CRITICAL: Delete all zip files to free up quota
rm *.zip
cd ../../
rm data-science-bowl-2018.zipBefore submitting a job, you must authenticate WandB on the cluster to track instance segmentation metrics (training loss, IoU, mask visualizations, etc.):
wandb loginPaste your API key when prompted.
Never run
python src/train.pydirectly on the login node. Always submit a batch job using SLURM.
Create a file named train.slurm in the project root:
#!/bin/bash
#SBATCH --job-name=nuclei_instance_seg
#SBATCH --partition=dgx2 # Specify the GPU partition (e.g., dgx2 or gpu)
#SBATCH --gres=gpu:1 # Request 1 GPU
#SBATCH --cpus-per-task=4 # Number of CPU cores for data loading
#SBATCH --mem=32G # Memory required
#SBATCH --time=12:00:00 # Maximum time limit (hrs:min:sec)
#SBATCH --output=outputs/slurm-%j.out
# Load modules and activate environment
module load python/3.12 cuda/12.8 gcc/12.5
source /nfs/hpc-share/your_onid/envs/vllm/bin/activate
# Run the training script
python src/train.py --data_path /nfs/hpc-share/your_onid/data-science-bowl-2018/stage1_trainSubmit the job to the queue:
sbatch train.slurm| Method | Command / Location |
|---|---|
| SLURM queue | squeue -u your_onid |
| Live logs | tail -f outputs/slurm-<JOB_ID>.out |
| Metrics dashboard | WandB project dashboard (loss, IoU, mask visualizations) |
To cleanly pass the scratch directory path (and other hyperparameters) from your SLURM script via the command line, add the following argument parser to src/train.py:
import argparse
def get_args():
parser = argparse.ArgumentParser(description="Train nuclei instance segmentation model")
# Data
parser.add_argument("--data_path", type=str,
default="data/data-science-bowl-2018/stage1_train",
help="Path to training data (use scratch path on HPC)")
# Training hyperparameters
parser.add_argument("--epochs", type=int, default=50)
parser.add_argument("--batch_size", type=int, default=8)
parser.add_argument("--lr", type=float, default=1e-4)
parser.add_argument("--num_workers", type=int, default=4,
help="DataLoader workers — match --cpus-per-task in SLURM script")
# Experiment tracking
parser.add_argument("--wandb_project", type=str, default="cell-segmentation")
parser.add_argument("--run_name", type=str, default=None)
return parser.parse_args()
if __name__ == "__main__":
args = get_args()
# e.g., dataset = CellDataset(root=args.data_path)Then in your SLURM script, you can override any default at submission time:
python src/train.py \
--data_path /scratch/your_onid/data-science-bowl-2018/stage1_train \
--epochs 100 \
--batch_size 16 \
--run_name "maskrcnn_run1"# Check available modules
module spider python
module spider cuda
# Check job queue
squeue -u your_onid
# Cancel a job
scancel <JOB_ID>
# Check cluster node availability
sinfo -p gpu
# Check your scratch storage usage
du -sh /scratch/your_onid/