Write once, run anywhere.
ezpz makes distributed PyTorch launches portable across NVIDIA, AMD, Intel,
MPS, and CPUβwith zero-code changes and guardrails for HPC schedulers.
It provides a:
-
π§° CLI:
ezpzwith utilities for launching distributed jobs -
π Python library
ezpzfor writing hardware-agnostic, distributed PyTorch code -
π Pre-built examples:
All of which:
- Use modern distributed PyTorch features (FSDP, TP, HF Trainer)
- Can be run anywhere (e.g. NVIDIA, AMD, Intel, MPS, CPU)
Checkout the π Docs for more information!
-
Setup Python environment:
To useezpz, we first need a Python environment (preferably virtual) that hastorchandmpi4pyinstalled.-
Already have one? Skip to (2.) below!
-
Otherwise, we can use the provided src/ezpz/bin/utils.sh1 to setup our environment:
source <(curl -LsSf https://bit.ly/ezpz-utils) && ezpz_setup_env
[Optional]
Note: This is technically optional, but recommended.
Especially if you happen to be running behind a job scheduler (e.g. PBS/Slurm) at any of {ALCF, OLCF, NERSC}, this will automatically load the appropriate modules and use these to bootstrap a virtual environment.However, if you already have a Python environment with {
torch,mpi4py} installed and would prefer to use that, skip directly to (2.) installingezpzbelow
-
-
Install
ezpz2:uv pip install "git+https://github.com/saforem2/ezpz"-
Need PyTorch or
mpi4py?If you don't already have PyTorch or
mpi4pyinstalled, you can specify these as additional dependencies:uv pip install --no-cache --link-mode=copy "git+https://github.com/saforem2/ezpz[torch,mpi]" -
... or try without installing!
If you already have a Python environment with {
torch,mpi4py} installed, you can tryezpzwithout installing it:# pip install uv first, if needed uv run --with "git+https://github.com/saforem2/ezpz" ezpz doctor TMPDIR=$(pwd) uv run --with "git+https://github.com/saforem2/ezpz" \ --python=$(which python3) \ ezpz test TMPDIR=$(pwd) uv run --with "git+https://github.com/saforem2/ezpz" \ --python=$(which python3) \ ezpz launch \ python3 -m ezpz.examples.fsdp_tp
-
-
Distributed Smoke Test:
Train simple MLP on MNIST with PyTorch + DDP:
ezpz testSee: [π ezpz test | W&B Report] for sample output and details of metric tracking.
At its core, ezpz is a Python library designed to make writing distributed
PyTorch code easy and portable across different hardware backends.
See π Python Library for more information.
-
See π Quickstart for a detailed walk-through of
ezpzfeatures. -
πͺ Automatic:
- Accelerator detection:
ezpz.get_torch_device(),
across {cuda,xpu,mps,cpu} - Distributed initialization:
ezpz.setup_torch(), to pick the right device + backend combo - Metric handling and utilities for {tracking, recording, plotting}:
ezpz.History()with Weights & Biases support - Integration with native job scheduler(s) (PBS, Slurm)
- with safe fall-backs when no scheduler is detected
- Single-process logging with filtering for distributed runs
- Accelerator detection:
Note
See Examples for ready-to-go examples that can be used as templates or starting points for your own distributed PyTorch workloads.
Once installed, ezpz provides a CLI with a few useful utilities
to help with distributed launches and environment validation.
Explicitly, these are:
ezpz doctor # environment validation and health-check
ezpz test # distributed smoke test
ezpz launch # general purpose, scheduler-aware launchingTo see the list of available commands, run:
ezpz --helpNote
Checkout π§° CLI for additional information.
Health-check your environment and ensure that ezpz is installed correctly
ezpz doctor
ezpz doctor --json # machine-friendly output for CIChecks MPI, scheduler detection, Torch import + accelerators, and wandb readiness, returning non-zero on errors.
See: π©Ί Doctor for more information.
Run the bundled test suite (great for first-time validation):
ezpz testOr, try without installing:
TMPDIR=$(pwd) uv run \
--python=$(which python3) \
--with "git+https://github.com/saforem2/ezpz" \
ezpz testSee β Test for more information.
Single entry point for distributed jobs.
ezpz detects PBS/Slurm automatically and falls back to mpirun, forwarding
useful environment variables so your script behaves the same on laptops and
clusters.
Add your own args to any command (--config, --batch-size, etc.) and ezpz
will propagate them through the detected launcher.
Use the provided
ezpz launch <launch flags> -- <cmd> <cmd flags>to automatically launch <cmd> across all available3
accelerators.
Use it to launch:
-
Arbitrary command(s):
ezpz launch hostname
-
Arbitrary Python string:
ezpz launch python3 -c 'import ezpz; ezpz.setup_torch()' -
One of the ready-to-go examples:
ezpz launch python3 -m ezpz.test_dist --profile ezpz launch -n 8 -- python3 -m ezpz.examples.fsdp_tp --tp 4
-
Your own distributed training script:
ezpz launch -n 16 -ppn 8 -- python3 -m your_app.train --config configs/your_config.yaml
to launch
your_app.trainacross 16 processes, 8 per node.
See π Launch for more information.
See π Examples for complete example scripts covering:
- Train MLP with DDP on MNIST
- Train CNN with FSDP on MNIST
- Train ViT with FSDP on MNIST
- Train Transformer with FSDP and TP on HF Datasets
- Train Diffusion LLM with FSDP on HF Datasets
- Train or Fine-Tune an LLM with FSDP and HF Trainer on HF Datasets
Additional configuration can be done through environment variables, including:
-
The colorized logging output can be toggled via the
NO_COLORenvironment var, e.g. to turn off colors:NO_COLOR=1 ezpz launch python3 -m your_app.train
-
Forcing a specific torch device (useful on GPU hosts when you want CPU-only):
TORCH_DEVICE=cpu ezpz test -
Changing the plot marker used in the text-based plots:
# highest resolution, may not be supported in all terminals EZPZ_TPLOT_MARKER="braille" ezpz launch python3 -m your_app.train # next-best resolution, more widely supported EZPZ_TPLOT_MARKER="fhd" ezpz launch python3 -m your_app.train
- Examples live under
ezpz.examples.*βcopy them or extend them for your workloads. - Stuck? Check the docs, or run
ezpz doctorfor actionable hints. - See my recent talk on:
LLMs on Aurora: Hands On with
ezpzfor a detailed walk-through containing examples and use cases.
Footnotes
-
The https://bit.ly/ezpz-utils URL is just a short link for convenience that actually points to https://raw.githubusercontent.com/saforem2/ezpz/main/src/ezpz/bin/utils.sh β©
-
If you don't have
uvinstalled, you can install it via:pip install uvSee the uv documentation for more details. β©
-
By default, this will detect if we're running behind a job scheduler (e.g. PBS or Slurm). If so, we automatically determine the specifics of the currently active job; explicitly, this will determine:
- The number of available nodes
- How many GPUs are present on each of these nodes
- How many GPUs we have total
It will then use this information to automatically construct the appropriate {
mpiexec,srun} command to launch, and finally, execute the launch cmd. β©