Skip to content

saforem2/ezpz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ‹ ezpz

Write once, run anywhere.

ezpz makes distributed PyTorch launches portable across NVIDIA, AMD, Intel, MPS, and CPUβ€”with zero-code changes and guardrails for HPC schedulers.

It provides a:

  • 🧰 CLI: ezpz with utilities for launching distributed jobs

  • 🐍 Python library ezpz for writing hardware-agnostic, distributed PyTorch code

  • πŸ“ Pre-built examples:

    All of which:

    • Use modern distributed PyTorch features (FSDP, TP, HF Trainer)
    • Can be run anywhere (e.g. NVIDIA, AMD, Intel, MPS, CPU)

Checkout the πŸ“˜ Docs for more information!

🐣 Getting Started

  1. Setup Python environment:
    To use ezpz, we first need a Python environment (preferably virtual) that has torch and mpi4py installed.

    • Already have one? Skip to (2.) below!

    • Otherwise, we can use the provided src/ezpz/bin/utils.sh1 to setup our environment:

      source <(curl -LsSf https://bit.ly/ezpz-utils) && ezpz_setup_env
      [Optional]

      Note: This is technically optional, but recommended.
      Especially if you happen to be running behind a job scheduler (e.g. PBS/Slurm) at any of {ALCF, OLCF, NERSC}, this will automatically load the appropriate modules and use these to bootstrap a virtual environment.

      However, if you already have a Python environment with {torch, mpi4py} installed and would prefer to use that, skip directly to (2.) installing ezpz below

  2. Install ezpz2:

    uv pip install "git+https://github.com/saforem2/ezpz"
    • Need PyTorch or mpi4py?

      If you don't already have PyTorch or mpi4py installed, you can specify these as additional dependencies:

      uv pip install --no-cache --link-mode=copy "git+https://github.com/saforem2/ezpz[torch,mpi]"
    • ... or try without installing!

      If you already have a Python environment with {torch, mpi4py} installed, you can try ezpz without installing it:

      # pip install uv first, if needed
      uv run --with "git+https://github.com/saforem2/ezpz" ezpz doctor
      
      TMPDIR=$(pwd) uv run --with "git+https://github.com/saforem2/ezpz" \
          --python=$(which python3) \
          ezpz test
      
      TMPDIR=$(pwd) uv run --with "git+https://github.com/saforem2/ezpz" \
          --python=$(which python3) \
          ezpz launch \
              python3 -m ezpz.examples.fsdp_tp
  3. Distributed Smoke Test:

    Train simple MLP on MNIST with PyTorch + DDP:

    ezpz test

    See: [πŸ“‘ ezpz test | W&B Report] for sample output and details of metric tracking.

🐍 Python Library

At its core, ezpz is a Python library designed to make writing distributed PyTorch code easy and portable across different hardware backends.

See 🐍 Python Library for more information.

✨ Features

  • See πŸš€ Quickstart for a detailed walk-through of ezpz features.

  • πŸͺ„ Automatic:

    • Accelerator detection: ezpz.get_torch_device(),
      across {cuda, xpu, mps, cpu}
    • Distributed initialization: ezpz.setup_torch(), to pick the right device + backend combo
    • Metric handling and utilities for {tracking, recording, plotting}: ezpz.History() with Weights & Biases support
    • Integration with native job scheduler(s) (PBS, Slurm)
      • with safe fall-backs when no scheduler is detected
    • Single-process logging with filtering for distributed runs

Note

See Examples for ready-to-go examples that can be used as templates or starting points for your own distributed PyTorch workloads.

🧰 ezpz: CLI Toolbox

Once installed, ezpz provides a CLI with a few useful utilities to help with distributed launches and environment validation.

Explicitly, these are:

ezpz doctor  # environment validation and health-check
ezpz test    # distributed smoke test
ezpz launch  # general purpose, scheduler-aware launching

To see the list of available commands, run:

ezpz --help

Note

Checkout 🧰 CLI for additional information.

🩺 ezpz doctor

Health-check your environment and ensure that ezpz is installed correctly

ezpz doctor
ezpz doctor --json   # machine-friendly output for CI

Checks MPI, scheduler detection, Torch import + accelerators, and wandb readiness, returning non-zero on errors.

See: 🩺 Doctor for more information.

βœ… ezpz test

Run the bundled test suite (great for first-time validation):

ezpz test

Or, try without installing:

TMPDIR=$(pwd) uv run \
    --python=$(which python3) \
    --with "git+https://github.com/saforem2/ezpz" \
    ezpz test

See βœ… Test for more information.

πŸš€ ezpz launch

Single entry point for distributed jobs.

ezpz detects PBS/Slurm automatically and falls back to mpirun, forwarding useful environment variables so your script behaves the same on laptops and clusters.

Add your own args to any command (--config, --batch-size, etc.) and ezpz will propagate them through the detected launcher.

Use the provided

ezpz launch <launch flags> -- <cmd> <cmd flags>

to automatically launch <cmd> across all available3 accelerators.

Use it to launch:

  • Arbitrary command(s):

    ezpz launch hostname
  • Arbitrary Python string:

    ezpz launch python3 -c 'import ezpz; ezpz.setup_torch()'
  • One of the ready-to-go examples:

    ezpz launch python3 -m ezpz.test_dist --profile
    ezpz launch -n 8 -- python3 -m ezpz.examples.fsdp_tp --tp 4
  • Your own distributed training script:

    ezpz launch -n 16 -ppn 8 -- python3 -m your_app.train --config configs/your_config.yaml

    to launch your_app.train across 16 processes, 8 per node.

See πŸš€ Launch for more information.

πŸ“ Ready-to-go Examples

See πŸ“ Examples for complete example scripts covering:

  1. Train MLP with DDP on MNIST
  2. Train CNN with FSDP on MNIST
  3. Train ViT with FSDP on MNIST
  4. Train Transformer with FSDP and TP on HF Datasets
  5. Train Diffusion LLM with FSDP on HF Datasets
  6. Train or Fine-Tune an LLM with FSDP and HF Trainer on HF Datasets

βš™οΈ Environment Variables

Additional configuration can be done through environment variables, including:

  1. The colorized logging output can be toggled via the NO_COLOR environment var, e.g. to turn off colors:

    NO_COLOR=1 ezpz launch python3 -m your_app.train
  2. Forcing a specific torch device (useful on GPU hosts when you want CPU-only):

    TORCH_DEVICE=cpu ezpz test
  3. Changing the plot marker used in the text-based plots:

    # highest resolution, may not be supported in all terminals
    EZPZ_TPLOT_MARKER="braille" ezpz launch python3 -m your_app.train
    # next-best resolution, more widely supported
    EZPZ_TPLOT_MARKER="fhd" ezpz launch python3 -m your_app.train

βž• More Information

Footnotes

  1. The https://bit.ly/ezpz-utils URL is just a short link for convenience that actually points to https://raw.githubusercontent.com/saforem2/ezpz/main/src/ezpz/bin/utils.sh ↩

  2. If you don't have uv installed, you can install it via:

    pip install uv
    

    See the uv documentation for more details. ↩

  3. By default, this will detect if we're running behind a job scheduler (e.g. PBS or Slurm). If so, we automatically determine the specifics of the currently active job; explicitly, this will determine:

    1. The number of available nodes
    2. How many GPUs are present on each of these nodes
    3. How many GPUs we have total

    It will then use this information to automatically construct the appropriate {mpiexec, srun} command to launch, and finally, execute the launch cmd. ↩

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •