Skip to content

hurryingauto3/expflow-hpc

Repository files navigation

ExpFlow - HPC Experiment Manager

Lightweight experiment tracking for HPC clusters. Stop manually editing SLURM scripts.

Python 3.8+ License: MIT SLURM

ExpFlow auto-detects your HPC environment (username, scratch paths, SLURM accounts) and automates experiment tracking - no hardcoded paths, no manual script editing, no Excel spreadsheets.

Quick Start

# Install
pip install git+https://github.com/hurryingauto3/expflow-hpc.git

# Initialize with interactive setup
expflow init -i my-research

# Navigate to project
cd /scratch/YOUR_ID/my-research

# Create template
expflow template baseline

# Check resources and monitor experiments
expflow resources --status
expflow status
expflow logs exp001

Key Features

  • Auto-Detection: Automatically detects username, scratch directory, SLURM accounts, partition access, containers, and conda
  • Interactive Setup: Menu-based initialization with intelligent account and GPU recommendations
  • Experiment Monitoring: Built-in commands for status tracking, log viewing, and job management
  • Checkpoint Resumption: Automatic checkpoint detection and experiment resume support (v0.6.0+)
  • Checkpoint Registry: Structured checkpoint tracking with metadata and best checkpoint selection (v0.7.0+)
  • Container Integration: Generic apptainer/singularity support with auto-detected images and bind mounts (v0.7.0+)
  • Conda Management: Auto-detected conda environments with module support (v0.7.0+)
  • GPU Monitoring: Built-in nvidia-smi monitoring with configurable intervals (v0.7.0+)
  • NCCL Optimization: GPU-specific NCCL presets for H200, A100, L40s, RTX8000 (v0.7.0+)
  • Experiment Pruning: Clean up duplicate runs and invalid experiments with safe archival
  • Resource Advisor: Real-time GPU availability and smart recommendations
  • Partition Validation: Automatic partition-account compatibility testing
  • YAML-Based Configs: No more editing SLURM scripts manually
  • Complete Tracking: Git commits, job IDs, timestamps, and results automatically logged
  • Extensible: Subclass BaseExperimentManager for custom workflows

Installation

From GitHub (Recommended)

pip install git+https://github.com/hurryingauto3/expflow-hpc.git

With Conda

conda create -n expflow python=3.10
conda activate expflow
pip install git+https://github.com/hurryingauto3/expflow-hpc.git

From Source

git clone https://github.com/hurryingauto3/expflow-hpc.git
cd expflow-hpc
pip install -e .

Getting Started

1. Initialize Your Project

Interactive Mode (Recommended):

expflow init -i my-research

Guided setup with:

  • Account selection (prefers "general" accounts for broad access)
  • GPU/Partition selection (H200, L40s, A100, RTX8000 categories)
  • Time limit preferences (6h, 12h, 24h, 48h, 72h)
  • Automatic partition access validation

Quick Mode:

expflow init -q my-research

Uses smart defaults without prompts.

2. Check Your Environment

expflow info

Output:

======================================================================
HPC Environment Information
======================================================================
Username: ah7072
Scratch: /scratch/ah7072
Cluster: greene
Accounts: torch_pr_68_general, torch_pr_68_tandon_advanced
Partitions: l40s_public, h200_public, rtx8000

3. Create Experiment Template

cd /scratch/YOUR_ID/my-research
expflow template baseline

Edit experiment_templates/baseline.yaml:

description: "Baseline experiment"

# Your parameters
model: resnet50
dataset: imagenet
batch_size: 256
learning_rate: 0.1

# Resources (auto-detected defaults)
partition: l40s_public
account: torch_pr_68_general
num_gpus: 4
num_nodes: 1
cpus_per_task: 16
time_limit: "48:00:00"

4. Monitor and Manage Experiments

# View all experiments and running jobs
expflow status

# List experiments
expflow list
expflow list --status running

# View logs
expflow logs exp001              # Last 50 lines
expflow logs exp001 -n 100       # Last 100 lines
expflow logs exp001 --type eval  # Evaluation logs

# Follow logs in real-time
expflow tail exp001

# Cancel jobs
expflow cancel exp001

Core Commands

Initialization

expflow init -i <project>    # Interactive setup with menus
expflow init -q <project>    # Quick setup with defaults
expflow init <project>       # Legacy auto-detect mode

Environment Info

expflow info                 # Show HPC environment details
expflow config               # Show project configuration

Resource Management

expflow resources --status                     # Check GPU availability
expflow partitions                             # Show partition-account access map
expflow partitions --json                      # Export as JSON

Experiment Monitoring

expflow status                                 # Show experiments and SLURM jobs
expflow list                                   # List all experiments
expflow list --status running                  # Filter by status
expflow logs <exp_id>                          # View experiment logs
expflow logs <exp_id> -n 100 --errors          # View last 100 lines of errors
expflow tail <exp_id>                          # Follow logs in real-time
expflow cancel <exp_id>                        # Cancel running jobs
expflow prune --dry-run                        # Preview cleanup of duplicate experiments

Templates

expflow template <name>      # Create experiment template

Why ExpFlow?

Before ExpFlow

# Manually edit SLURM scripts
vim train.slurm

# Hardcoded paths that only work for you
#SBATCH --account=my_account        # Others can't use this!
export DATA=/scratch/myuser/data    # Hardcoded!

sbatch train.slurm

# Track experiments manually in Excel
# Forget git commits
# Lose hyperparameters

With ExpFlow

# One-time setup (works for ANY user)
expflow init -i my-project

# Create experiment from template
python -m my_manager new --exp-id exp001 --template baseline

# Submit (auto-detects YOUR paths, account, GPUs)
python -m my_manager submit exp001

# Monitor in real-time
expflow status
expflow tail exp001

# Auto-harvest results
python -m my_manager harvest exp001
python -m my_manager export results.csv

Time Saved: ~80% reduction in experiment setup time

Resource Advisor

Check GPU availability before submitting:

$ expflow resources --status

======================================================================
GPU Resource Status
======================================================================

L40S_PUBLIC
   Total: 40 GPUs
   Available: 12
   In Use: 28
   Queue: 3 jobs
   Status: AVAILABLE

H200_TANDON
   Total: 10 GPUs
   Available: 0
   In Use: 10
   Queue: 8 jobs
   Wait Time: ~4 hours
   Status: BUSY

Recommendation: Use l40s_public with 4 GPUs (best availability)

Partition Management

View partition-account compatibility:

$ expflow partitions

======================================================================
Partition Access Map
======================================================================

h200_public (GPU: H200) [GPU Required]
  ✓ torch_pr_68_general
  ✓ torch_pr_68_tandon_advanced

l40s_public (GPU: L40s) [GPU Required]
  ✓ torch_pr_68_general

Account Access Summary:
  torch_pr_68_general → h200_public, l40s_public, rtx8000
  torch_pr_68_tandon_advanced → h200_public, h200_tandon

Creating Custom Managers

For project-specific workflows, create a custom manager:

from expflow import BaseExperimentManager

class MyManager(BaseExperimentManager):
    def _generate_train_script(self, config):
        """Generate SLURM training script"""
        return f'''#!/bin/bash
#SBATCH --gres=gpu:{config['num_gpus']}
#SBATCH --partition={config['partition']}
#SBATCH --account={config['account']}

python train.py --model {config['model']} ...
'''

    def _generate_eval_script(self, config):
        """Generate SLURM evaluation script"""
        return "#!/bin/bash\npython evaluate.py ..."

    def harvest_results(self, exp_id):
        """Parse experiment results"""
        return {"accuracy": 0.95, "loss": 0.12}

Use your manager:

python my_manager.py new --exp-id exp001 --template baseline
python my_manager.py submit exp001
python my_manager.py harvest exp001
python my_manager.py export results.csv

Documentation

What's New in v0.7.0

Framework-Level Helpers (Eliminates 150-200 lines of boilerplate):

  • Container Integration: Auto-detected apptainer images with configurable bind mounts
  • Conda Management: Auto-detected conda environments with module support
  • SquashFS Overlays: Generic overlay mounting helpers with automatic fallback
  • Checkpoint Registry: Structured checkpoint tracking with metadata and best selection
  • GPU Monitoring: Configurable nvidia-smi monitoring with automatic cleanup
  • NCCL Optimization: GPU-specific presets (H200, A100, L40s, RTX8000)
  • Variable Substitution: Template system for portable configs (${scratch_dir}, ${username}, etc.)

Migration: Existing managers work unchanged. See MIGRATION_v0.7.md for upgrading.

Previous Releases:

  • v0.6.0: Checkpoint resumption with manager.resume_experiment()
  • v0.5.0: Experiment pruning with expflow prune

See CHANGELOG.md for complete version history.

Requirements

  • Python 3.8+
  • SLURM-based HPC cluster
  • Linux environment

Use Cases

Use Case Status
Image Classification (ResNet, ViT) ✓ Ready
LLM Fine-tuning (LLaMA, GPT) ✓ Ready
Reinforcement Learning (PPO, SAC) ✓ Ready
Computer Vision (Object Detection) ✓ Ready

Contributing

Contributions welcome! See CONTRIBUTING.md.

Share your experiment templates:

examples/templates/your_usecase.yaml

License

MIT License - see LICENSE

Support

Acknowledgments

Built for the NYU HPC deep learning community. Works on any SLURM-based cluster.

Maintained by: Ali Hamza


Stop fighting SLURM. Start doing research.

For complete documentation, see USER_GUIDE.md.

About

Lightweight experiment tracking for HPC clusters.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages