Add LeRobot pi0-FAST DROID multi-node training test case by KeitaW · Pull Request #1003 · awslabs/awsome-distributed-training

KeitaW · 2026-02-26T22:27:41Z

Summary

Adds a new test case at 3.test_cases/pytorch/lerobot/pi0-fast-droid/ for multi-node distributed training of a Vision-Language-Action (VLA) policy
Uses LeRobot v0.4.3 with the pi0-FAST architecture (SigLIP vision + Gemma 2B language backbone) on the DROID 1.0.1 dataset (76k+ real-robot trajectories, 1.7TB)
Multi-node launch via HuggingFace Accelerate over Slurm + Enroot/Pyxis with EFA networking

Files

File	Purpose
`Dockerfile`	NGC PyTorch 25.08 + EFA 1.47.0 + NCCL 2.28.4 from source + aws-ofi-nccl 1.14.1 + LeRobot[pi]
`Makefile`	Build → clean → enroot import
`README.md`	Full docs with architecture diagram, config tables, scaling variants, troubleshooting
`slurm/run.sh`	Multi-node sbatch script (Accelerate + Pyxis, HyperPod auto-resume)
`requirements.txt`	Accelerate + wandb
`env_vars.example`	Template for HF_TOKEN, output dir, node count
`test_pi0_fast_droid.py`	Pytest: docker build, smoke test, CLI entrypoint, EFA check
`src/__init__.py`	Package placeholder

Version compliance

EFA 1.47.0 >= 1.47.0 ✓
NCCL 2.28.4 >= 2.28 ✓
CUDA 13.0 (via NGC 25.08-py3) >= 13.0 ✓

Test plan

Verify Docker image builds successfully on a build node
Run single-node training smoke test (1 node, small batch, few steps)
Run multi-node training (2+ nodes) and verify all ranks join correctly
Verify EFA is used for NCCL communication (NCCL_DEBUG=INFO output)
Verify DROID dataset streaming/caching works from shared filesystem
Run /ad-helper:lint-contribution — all checks pass
Run /ad-helper:check-versions — all versions meet CI minimums

New test case under 3.test_cases/pytorch/lerobot/pi0-fast-droid/ for training a Vision-Language-Action (VLA) policy using LeRobot's pi0-FAST architecture on the DROID 1.0.1 dataset (76k+ trajectories, 1.7TB) across multiple GPU nodes via HuggingFace Accelerate + Slurm/Pyxis.

KeitaW force-pushed the scaffold/pytorch-lerobot-pi0-fast-droid branch from 071e494 to bc5ac25 Compare February 26, 2026 22:30

KeitaW self-assigned this Feb 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LeRobot pi0-FAST DROID multi-node training test case#1003

Add LeRobot pi0-FAST DROID multi-node training test case#1003
KeitaW wants to merge 1 commit intomainfrom
scaffold/pytorch-lerobot-pi0-fast-droid

KeitaW commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KeitaW commented Feb 26, 2026

Summary

Files

Version compliance

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant