A clean, modular deep learning pipeline for training UNet models on fundus vessel segmentation tasks (FIVES dataset).
# 1. Install dependencies
pip install -r requirements.txt
# 2. Prepare your data (see Data Setup below)
# 3. Train a model
./train.sh exp001_basic_unet # UNet baseline
# OR
./train.sh exp002_roinet # RoiNet with residuals
# OR queue multiple experiments
./queue.sh exp001_basic_unet exp002_roinet # Runs sequentially
# 4. Test the model
./test.sh exp001_basic_unet- Features
- Installation
- Data Setup
- Training
- Testing & Inference
- TensorBoard Visualization
- Configuration
- Project Structure
- Output Structure
- Memory Profiling & Debugging
- Extending the Pipeline
- Modular Configuration System: YAML-based dataset + experiment configs
- Multiple Architectures:
- UNet: Classic encoder-decoder
- RoiNet: Residual blocks with deepened bottleneck
- UTrans: UNet + Transformer for global context
- TransRoiNet: RoiNet + Transformer (best of both worlds)
- Reusable Transformer Blocks: Modular attention components for building hybrid models
- Training Loop: Complete with validation, metrics tracking, and progress bars
- TensorBoard Integration: Real-time visualization of training metrics, learning curves, and predictions
- Memory Profiling: Comprehensive VRAM usage analysis for debugging and optimization
- Early Stopping: Stops training when validation metrics stop improving (with patience)
- Metrics History: Saves all epoch metrics to YAML for easy analysis
- Checkpointing: Saves best and last model checkpoints
- Testing & Inference: Load checkpoints, run predictions, save masks and metrics
- Loss: Dice Loss (smooth, differentiable)
- Metrics: Dice Coefficient, IoU (Intersection over Union), AUC (Area Under ROC Curve)
- Per-Image Metrics: Individual metrics for each test image
- Advanced Logging: Layer activation monitoring and histograms
- Dataset: Automatic loading of images and masks
- Preprocessing: Normalization, padding to multiples of 32
- Image Format: Supports PNG images
- Python 3.8+
- CUDA 11.8+ (for GPU support)
- ~3.5 GB disk space for dependencies
cd /home/vlv/Documents/master/deepLearning/project/codebase
pip install -r requirements.txtpython -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"For detailed installation instructions and troubleshooting, see INSTALL.md.
The pipeline supports multiple FIVES dataset variants at different resolutions and channel configurations:
| Config File | Resolution | Channels | Description |
|---|---|---|---|
fives_rgb.yaml |
2048x2048 | 3 (RGB) | Original high-resolution |
fives_512.yaml |
512x512 | 3 (RGB) | Legacy 512x512 RGB (backward compatible) |
fives512_rgb.yaml |
512x512 | 3 (RGB) | 512x512 RGB |
fives512_g.yaml |
512x512 | 1 (Green) | 512x512 green channel only |
fives256_rgb.yaml |
256x256 | 3 (RGB) | 256x256 RGB |
fives256_g.yaml |
256x256 | 1 (Green) | 256x256 green channel only |
All datasets follow this structure:
codebase/data/FIVES<VARIANT>/
├── train/
│ ├── image/ # Training images (*.png)
│ └── label/ # Training masks (*.png)
├── val/
│ ├── image/ # Validation images
│ └── label/ # Validation masks
└── test/
├── image/ # Test images
└── label/ # Test masks
Where <VARIANT> is:
_RGB- Original resolution RGB512_RGB- 512x512 RGB512_G- 512x512 green channel256_RGB- 256x256 RGB256_G- 256x256 green channel
In your experiment config, reference the dataset:
# For RGB datasets
dataset: "configs/datasets/fives512_rgb.yaml"
# For green channel datasets (remember to set model in_channels: 1)
dataset: "configs/datasets/fives512_g.yaml"To use a different path, edit the corresponding dataset config file:
paths:
root: "/your/custom/path/to/FIVES512_RGB"
train: "/your/custom/path/to/FIVES512_RGB/train"
val: "/your/custom/path/to/FIVES512_RGB/val"
test: "/your/custom/path/to/FIVES512_RGB/test"# Make script executable (first time only)
chmod +x train.sh
# Train with experiment config
./train.sh exp001_basic_unetRun multiple experiments sequentially without manual intervention:
# Make script executable (first time only)
chmod +x queue.sh
# Queue multiple experiments
./queue.sh exp001_basic_unet exp002_roinet exp003_utrans
# Or use specific experiments
./queue.sh exp001_basic_unet exp002_roinetThe queue script will:
- Run each experiment sequentially
- Continue even if one fails
- Log queue summary to
outputs/queue_logs/queue_TIMESTAMP.log - Each experiment's full output saved to its own directory
- Show progress and summary at the end
Useful for overnight training or running multiple configurations.
python scripts/train.py --config configs/experiments/exp001_basic_unet.yaml- Initialization: Loads config, creates model, sets random seed
- Training Loop:
- Trains on training set with progress bar
- Validates after each epoch
- Prints metrics (loss, dice, IoU)
- Saves metrics history to YAML after each epoch
- Logging:
- All console output is saved to
training_log_TIMESTAMP.txtin the experiment directory - Real-time display while training
- Useful for reviewing training details later
- All console output is saved to
- Checkpointing:
- Saves best model when validation metric improves
- Saves last checkpoint every epoch
- Early Stopping:
- Monitors validation metric (e.g., val_dice)
- Stops training if no improvement for N epochs (patience)
- Displays countdown during no-improvement periods
Starting training for 20 epochs...
Epoch 1 [Train]: 100%|████████████| 150/150 [01:23<00:00]
Epoch 1 [Val]: 100%|████████████| 30/30 [00:12<00:00]
Epoch 1/20
Train - Loss: 0.3456, dice: 0.6544, iou: 0.5123
Val - Loss: 0.3789, dice: 0.6211, iou: 0.4890
→ New best val_dice: 0.6211
Epoch 2/20
Train - Loss: 0.2987, dice: 0.7013, iou: 0.5567
Val - Loss: 0.3234, dice: 0.6766, iou: 0.5234
→ New best val_dice: 0.6766
...
All epoch metrics are automatically saved to outputs/experiments/<exp_name>/metrics_history.yaml.
TensorBoard integration is fully supported for real-time training visualization and experiment tracking.
TensorBoard is controlled via the logging section in your experiment config:
logging:
tensorboard: true # Enable/disable TensorBoard logging
log_images: false # Enable/disable image logging (optional)
image_log_frequency: 5 # Log images every N epochs (default: 5, set to 1 for every epoch)Note: All existing experiment configs already have tensorboard: true by default.
During Training:
# In a separate terminal, run:
tensorboard --logdir outputs/experiments/<exp_name>/tensorboard
# For multiple experiments:
tensorboard --logdir outputs/experiments
# Then open in browser:
# http://localhost:6006After Training:
# View logs for a specific experiment
tensorboard --logdir outputs/experiments/exp001_basic_unet/tensorboard
# Compare multiple experiments
tensorboard --logdir outputs/experimentsTraining Metrics:
train/loss- Training losstrain/dice- Training Dice coefficienttrain/iou- Training IoU score
Validation Metrics:
val/loss- Validation lossval/dice- Validation Dice coefficientval/iou- Validation IoU score
Learning Rate:
learning_rate- Current learning rate (tracks scheduler)
Comparison Plots:
comparison/dice- Train vs Val Dice on same plotcomparison/iou- Train vs Val IoU on same plot
When log_images: true in config:
- Frequency: Configurable via
image_log_frequency(default: 5 epochs), or when best model is saved - Content: Side-by-side comparison of:
- Input image (denormalized)
- Ground truth mask
- Predicted mask
- Location:
val/predictionstab - Samples: Up to 4 validation samples per log
Example: To log images every epoch:
logging:
tensorboard: true
log_images: true
image_log_frequency: 1 # Log every epochMonitor neural network layer activations:
- Frequency: Configurable via
activation_log_frequency(default: 5 epochs) - Content: For each monitored layer:
- Histogram of activation values
- Statistics (mean, std, min, max)
- Location:
Histogramstab (distributions),Scalarstab (statistics) - Layer Selection:
"auto": Model-specific defaults (recommended)- Custom list: Specify exact layers
null: Monitor all layers (not recommended)
Example: To log activations every epoch:
logging:
tensorboard: true
log_activations: true
activation_log_frequency: 1
activation_layers: "auto" # Or specify: ["encoder1", "bottleneck"]The model architecture graph is automatically logged at training start:
- Shows layer connections and data flow
- Useful for debugging model structure
- View in the "Graphs" tab
At training completion, logs hyperparameters and final metrics:
- Model type, batch size, learning rate, etc.
- Final validation metrics
- Best metric achieved
- Enables comparison across experiments
Scalars Tab:
- Smooth curves (adjust smoothing slider)
- Compare runs side-by-side
- Toggle specific runs on/off
- Download data as CSV/JSON
Images Tab:
- View prediction quality over time
- Identify overfitting visually
- Track model convergence
Graphs Tab:
- Visualize model architecture
- Verify layer connections
HParams Tab:
- Compare hyperparameters across experiments
- Identify best configurations
- Parallel coordinates plot
-
Start training with TensorBoard enabled:
./train.sh exp001_basic_unet
-
In a separate terminal, start TensorBoard:
tensorboard --logdir outputs/experiments/exp001_basic_unet/tensorboard
-
Open browser:
- Navigate to
http://localhost:6006 - Watch metrics update in real-time as training progresses
- Navigate to
-
Compare multiple experiments:
# Run multiple experiments ./train.sh exp001_basic_unet ./train.sh exp002_roinet ./train.sh exp003_utrans # View all together tensorboard --logdir outputs/experiments
Test metrics are also logged to TensorBoard when running tests:
./test.sh exp001_basic_unetTest logs are saved to: outputs/experiments/<exp_name>/tensorboard/test/
To disable TensorBoard for an experiment:
logging:
tensorboard: false # Disable TensorBoard
log_images: falseTraining will proceed normally with only console output and YAML metrics files.
-
Remote Server: If training on a remote server, use SSH port forwarding:
ssh -L 6006:localhost:6006 user@remote-server tensorboard --logdir /path/to/experiments
-
Custom Port: Use a different port if 6006 is occupied:
tensorboard --logdir outputs/experiments --port 6007
-
Multiple Instances: Run multiple TensorBoard instances for different experiment groups:
tensorboard --logdir outputs/experiments/baseline_models --port 6006 tensorboard --logdir outputs/experiments/transformer_models --port 6007
-
Refresh: TensorBoard auto-refreshes every 30 seconds. Click the refresh button for immediate updates.
# Test with best checkpoint
./test.sh exp001_basic_unet
# Test with last checkpoint
./test.sh exp001_basic_unet lastpython scripts/test.py --config configs/experiments/exp001_basic_unet.yaml- Loads the trained model checkpoint (best.pth or last.pth)
- Runs inference on all test images with progress bar
- Calculates metrics for each image (Dice, IoU)
- Saves predicted masks to
outputs/tests/<exp_name>/predictions/ - Saves metrics to YAML files:
test_metrics.yaml- Average metrics across all test imagesper_image_metrics.yaml- Individual metrics for each image
outputs/tests/exp001_basic_unet/
├── predictions/ # Predicted segmentation masks
│ ├── 1_A.png
│ ├── 2_A.png
│ └── ...
├── test_metrics.yaml # Summary: average metrics
└── per_image_metrics.yaml # Detailed: per-image metrics
test_metrics.yaml:
experiment: exp001_basic_unet
checkpoint: outputs/experiments/exp001_basic_unet/checkpoints/best.pth
num_test_images: 200
average_metrics:
dice: 0.7834
iou: 0.6912per_image_metrics.yaml:
- image: 1_A.png
dice: 0.7912
iou: 0.7034
- image: 2_A.png
dice: 0.8123
iou: 0.7245
...The pipeline uses a two-level configuration system:
File: configs/datasets/fives_512.yaml
Defines dataset properties that rarely change:
- Data paths
- Image dimensions
- Normalization statistics
name: "FIVES512"
paths:
root: "data/FIVES512"
train: "data/FIVES512/train"
val: "data/FIVES512/val"
test: "data/FIVES512/test"
image_size: [512, 512]
num_channels: 3
num_classes: 1
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]File: configs/experiments/exp001_basic_unet.yaml
Contains all training parameters for an experiment:
name: "exp001_basic_unet"
dataset: "configs/datasets/fives_512.yaml"
# Data loading
data:
batch_size: 4
num_workers: 2
pin_memory: true
augmentation:
enabled: false
preprocessing:
normalize: true
pad_to_multiple: 32
# Model architecture
model:
type: "UNet"
in_channels: 3
out_channels: 1
depths: [32, 64, 128, 256, 512]
final_activation: "sigmoid"
# Training settings
training:
epochs: 20
optimizer:
type: "adam"
learning_rate: 0.0001
weight_decay: 0.0001
scheduler:
type: "cosine"
min_lr: 0.000001
loss:
type: "dice"
smooth: 0.000001
metrics:
- "dice"
- "iou"
early_stopping:
enabled: true
patience: 5
monitor: "val_dice"
mode: "max"
# Output
output:
dir: "outputs/experiments/exp001_basic_unet"
save_predictions: true
seed: 42
device: "cuda"Change batch size (for GPU memory):
data:
batch_size: 8 # Increase if you have more GPU memoryAdjust learning rate:
training:
optimizer:
learning_rate: 0.001 # Larger for faster convergenceChange model size:
model:
depths: [16, 32, 64, 128, 256] # Smaller model for less memoryAdjust early stopping:
training:
early_stopping:
enabled: true
patience: 10 # Wait 10 epochs before stopping# Copy existing config
cp configs/experiments/exp001_basic_unet.yaml configs/experiments/exp002_my_test.yaml
# Edit the new config
nano configs/experiments/exp002_my_test.yaml
# Train with new config
./train.sh exp002_my_testcodebase/
├── configs/
│ ├── datasets/ # Dataset configurations
│ │ └── fives_512.yaml
│ └── experiments/ # Experiment configurations
│ ├── exp001_basic_unet.yaml
│ ├── exp002_roinet.yaml
│ ├── exp002_roinet_batch_size.yaml
│ ├── exp003_utrans.yaml
│ └── exp004_transroinet.yaml
│
├── src/ # Source code
│ ├── data/
│ │ ├── dataset.py # Dataset class
│ │ └── __init__.py # DataLoader factory
│ ├── models/
│ │ ├── registry.py # Model registration system
│ │ ├── architectures/
│ │ │ ├── unet.py # UNet implementation
│ │ │ ├── roinet.py # RoiNet implementation
│ │ │ ├── utrans.py # UTrans implementation (UNet + Transformer)
│ │ │ └── transroinet.py # TransRoiNet implementation (RoiNet + Transformer)
│ │ └── blocks/
│ │ ├── conv_blocks.py # CNN blocks (DoubleConv, ResidualBlock)
│ │ └── transformer_blocks.py # Transformer blocks (Self-Attention, FFN, etc.)
│ ├── training/
│ │ ├── trainer.py # Training loop
│ │ ├── losses.py # Loss functions
│ │ └── metrics.py # Metrics
│ └── utils/
│ ├── config.py # Config loading
│ └── helpers.py # Utilities
│
├── scripts/
│ ├── train.py # Training script
│ └── test.py # Testing script
│
├── data/ # Data directory (gitignored)
│ └── FIVES512/
│
├── outputs/ # Training outputs (gitignored)
│ ├── experiments/ # Training results
│ ├── tests/ # Test results
│ └── queue_logs/ # Queue script logs
│
├── train.sh # Training launcher
├── test.sh # Testing launcher
├── queue.sh # Queue multiple experiments
├── requirements.txt # Dependencies
├── INSTALL.md # Installation guide
└── README.md # This file
outputs/experiments/exp001_basic_unet/
├── config.yaml # Copy of experiment config
├── training_log_TIMESTAMP.txt # Complete console output
├── checkpoints/
│ ├── best.pth # Best model (highest val_dice)
│ └── last.pth # Latest checkpoint
├── metrics_history.yaml # All epoch metrics
└── tensorboard/ # TensorBoard logs (if enabled)
├── events.out.tfevents.*
└── test/ # Test results (if test.py was run)
metrics_history.yaml format:
- epoch: 1
train:
loss: 0.3456
dice: 0.6544
iou: 0.5123
val:
val_loss: 0.3789
val_dice: 0.6211
val_iou: 0.4890
- epoch: 2
train:
loss: 0.2987
dice: 0.7013
iou: 0.5567
val:
val_loss: 0.3234
val_dice: 0.6766
val_iou: 0.5234
...outputs/tests/exp001_basic_unet/
├── predictions/ # Predicted masks (PNG images)
│ ├── 1_A.png
│ ├── 2_A.png
│ └── ...
├── test_metrics.yaml # Average metrics
└── per_image_metrics.yaml # Per-image metrics
# For UNet
model:
type: "UNet"
in_channels: 3
out_channels: 1
depths: [32, 64, 128, 256, 512]
final_activation: "sigmoid"
# For TransUNet
model:
type: "TransUNet"
in_channels: 3
out_channels: 1
depths: [64, 128, 256, 512]
transformer_embed_dim: 512 # Transformer embedding dimension
transformer_depth: 6 # Number of transformer blocks
transformer_heads: 8 # Attention heads
transformer_mlp_ratio: 4.0 # FFN expansion ratio
transformer_dropout: 0.1 # Dropout probability
final_activation: "sigmoid"
# For RoiNet
model:
type: "RoiNet"
in_channels: 3
out_channels: 1
depths: [32, 64, 128, 128, 64, 32]
kernel_size: 9
final_activation: "sigmoid"
# For TransRoiNet (RoiNet + Transformer)
model:
type: "TransRoiNet"
in_channels: 3
out_channels: 1
depths: [32, 64, 128, 128, 64, 32]
kernel_size: 9 # Large kernel for residual blocks
transformer_depth: 2 # Number of transformer blocks (lighter)
transformer_heads: 8 # Attention heads
transformer_mlp_ratio: 4.0 # FFN expansion ratio
transformer_dropout: 0.1 # Dropout probability
final_activation: "sigmoid"File: src/training/losses.py
class MyCustomLoss(nn.Module):
def __init__(self, param=1.0):
super().__init__()
self.param = param
def forward(self, pred, target):
# Your loss calculation
return loss
# Add to create_loss function
def create_loss(loss_config):
loss_type = loss_config['type']
if loss_type == 'my_custom':
return MyCustomLoss(param=loss_config.get('param', 1.0))
# ... existing lossesThen use in config:
training:
loss:
type: "my_custom"
param: 2.0File: src/training/metrics.py
def my_custom_metric(pred, target, threshold=0.5):
# Your metric calculation
return metric_value.item()
# Add to METRICS dictionary
METRICS = {
'dice': dice_coefficient,
'iou': iou_score,
'my_metric': my_custom_metric
}Then use in config:
training:
metrics:
- "dice"
- "iou"
- "my_metric"File: src/models/architectures/my_model.py
from ..registry import register_model
import torch.nn as nn
@register_model('MyModel')
class MyModel(nn.Module):
def __init__(self, config):
super().__init__()
# Your architecture
def forward(self, x):
# Forward pass
return outputImport in src/models/__init__.py:
from .architectures.unet import UNet
from .architectures.my_model import MyModel # Add thisUse in config:
model:
type: "MyModel"
# ... your model parametersThe pipeline includes comprehensive VRAM profiling to help debug memory issues and optimize resource usage.
Add to your experiment config:
debug:
profile_memory: true # Enable memory profiling
detailed_memory: true # Show per-layer breakdown
estimate_activations: true # Estimate activation memory
profile_training_step: false # Profile actual training step (CUDA only)Before training starts, you'll see:
GPU Memory (Device: cuda:0):
Total VRAM: 23.65 GB
Currently Used: 245.67 MB
Available: 23.41 GB
Model Memory Breakdown:
Parameters: 93.52 MB
Buffers: 0.12 MB
Total Model: 93.64 MB
Training Memory Estimates:
Gradients: 93.52 MB
Optimizer (ADAM): 187.04 MB
Activations: 1.23 GB
Total Estimated Training Memory: 1.59 GB
(~6.7% of available VRAM)
Per-Layer Memory Breakdown (Top 15):
Layer Name Parameters Memory
---------------------------------------- --------------- ------------
encoder4 8,388,608 32.00 MB
encoder3 2,097,152 8.00 MB
...