Skip to content

Add LeRobot pi0-FAST DROID multi-node training test case#1003

Draft
KeitaW wants to merge 1 commit intomainfrom
scaffold/pytorch-lerobot-pi0-fast-droid
Draft

Add LeRobot pi0-FAST DROID multi-node training test case#1003
KeitaW wants to merge 1 commit intomainfrom
scaffold/pytorch-lerobot-pi0-fast-droid

Conversation

@KeitaW
Copy link
Collaborator

@KeitaW KeitaW commented Feb 26, 2026

Summary

  • Adds a new test case at 3.test_cases/pytorch/lerobot/pi0-fast-droid/ for multi-node distributed training of a Vision-Language-Action (VLA) policy
  • Uses LeRobot v0.4.3 with the pi0-FAST architecture (SigLIP vision + Gemma 2B language backbone) on the DROID 1.0.1 dataset (76k+ real-robot trajectories, 1.7TB)
  • Multi-node launch via HuggingFace Accelerate over Slurm + Enroot/Pyxis with EFA networking

Files

File Purpose
Dockerfile NGC PyTorch 25.08 + EFA 1.47.0 + NCCL 2.28.4 from source + aws-ofi-nccl 1.14.1 + LeRobot[pi]
Makefile Build → clean → enroot import
README.md Full docs with architecture diagram, config tables, scaling variants, troubleshooting
slurm/run.sh Multi-node sbatch script (Accelerate + Pyxis, HyperPod auto-resume)
requirements.txt Accelerate + wandb
env_vars.example Template for HF_TOKEN, output dir, node count
test_pi0_fast_droid.py Pytest: docker build, smoke test, CLI entrypoint, EFA check
src/__init__.py Package placeholder

Version compliance

  • EFA 1.47.0 >= 1.47.0 ✓
  • NCCL 2.28.4 >= 2.28 ✓
  • CUDA 13.0 (via NGC 25.08-py3) >= 13.0 ✓

Test plan

  • Verify Docker image builds successfully on a build node
  • Run single-node training smoke test (1 node, small batch, few steps)
  • Run multi-node training (2+ nodes) and verify all ranks join correctly
  • Verify EFA is used for NCCL communication (NCCL_DEBUG=INFO output)
  • Verify DROID dataset streaming/caching works from shared filesystem
  • Run /ad-helper:lint-contribution — all checks pass
  • Run /ad-helper:check-versions — all versions meet CI minimums

New test case under 3.test_cases/pytorch/lerobot/pi0-fast-droid/ for
training a Vision-Language-Action (VLA) policy using LeRobot's pi0-FAST
architecture on the DROID 1.0.1 dataset (76k+ trajectories, 1.7TB)
across multiple GPU nodes via HuggingFace Accelerate + Slurm/Pyxis.
@KeitaW KeitaW force-pushed the scaffold/pytorch-lerobot-pi0-fast-droid branch from 071e494 to bc5ac25 Compare February 26, 2026 22:30
@KeitaW KeitaW self-assigned this Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant