Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP (#198) by WarrenZhu050413 · Pull Request #200 · meta-pytorch/torchft

WarrenZhu050413 · 2025-05-22T04:46:31Z

TorchFT Examples

This PR adds a comprehensive set of examples demonstrating various fault tolerance features and distributed training approaches in TorchFT. Each example includes a per example README that provide step-by-step instructions, and sample outputs to help users understand and incorporate these features into their training script. All the examples build on top of train_ddp.py.

The PR came from my own experience understanding the different features of torchFT. I found it hard to start running other features outside of the given train_ddp.py at the beginning, which made it more difficult for me to have a sense of the various features offered by torchFT.

@d4l3k provided useful feedback in how to structure the examples.

Examples Included:

DDP with Proactive Failure Recovery (examples/ddp_proactive)
- Demonstrates how to enable proactive detection and response to worker failures
- Includes detailed explanation of recovery mechanism with annotated logs
- Shows significant reduction in recovery time compared to timeout-based approaches
DiLoCo (Distributed Local Convergence) (examples/diloco)
- Implements DiLoCo training methodology
- Shows how to configure and optimize local convergence parameters
- Documents performance characteristics and tradeoffs
LocalSGD (examples/localsgd)
- Demonstrates LocalSGD with periodic synchronization strategy
- Provides guidance on setting appropriate synchronization frequency
- Includes performance comparison considerations
Live Checkpoint Recovery (examples/live_checkpoint_recovery)
- Shows how to implement checkpoint-based recovery for fault tolerance
- Documents the checkpoint storage and retrieval process
- Includes recovery time analysis and optimization tips

…) (meta-pytorch#188)

H-Huang

Wow! This is really awesome, thank you for your contributions. I will take a look more closely soon.

One thing I think would be a good idea is to run these examples as part of CI as well so we know we aren't regressing anything. We might have to create a separate CI workflow for this.

WarrenZhu050413 · 2025-05-24T02:39:40Z

@d4l3k also commented on this. Working on the CI rn!

WarrenZhu050413 · 2025-05-24T05:52:00Z

Added CI.

This is the output when I run examples/test_examples.py locally. The tests are currently CPU only, using torchx.py's default.

For the .yaml file, the environment is set up identically to torchft/.github/workflows/unittest.yaml.

> pytest examples/test_examples.py
======================================== test session starts =========================================
platform linux -- Python 3.11.11, pytest-8.3.5, pluggy-1.5.0
rootdir: /srv/apps/torchft
configfile: pyproject.toml
plugins: typeguard-2.13.3, timeout-2.3.1, anyio-4.9.0
timeout: 60.0s
timeout method: thread
timeout func_only: False
collected 12 items                                                                                   

examples/test_examples.py ............                                                         [100%]

=================================== 12 passed in 99.86s (0:01:39) ====================================

…Recovery, and proactive failure detection with DDP, along with CI (meta-pytorch#198)

Added proactive heartbeat timeout failure propagation (meta-pytorch#164…

7b550aa

…) (meta-pytorch#188)

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 22, 2025

H-Huang reviewed May 23, 2025

View reviewed changes

WarrenZhu050413 force-pushed the torchft_examples branch 2 times, most recently from 089bcd5 to 478c162 Compare May 24, 2025 05:47

WarrenZhu050413 force-pushed the torchft_examples branch from 478c162 to 02499b4 Compare May 24, 2025 05:59

Added example training scripts for localsgd, DiLoCo, Live Checkpoint …

f5ee704

…Recovery, and proactive failure detection with DDP, along with CI (meta-pytorch#198)

WarrenZhu050413 force-pushed the torchft_examples branch from 02499b4 to f5ee704 Compare May 25, 2025 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP (#198)#200

Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP (#198)#200
WarrenZhu050413 wants to merge 2 commits intometa-pytorch:mainfrom
WarrenZhu050413:torchft_examples

WarrenZhu050413 commented May 22, 2025

Uh oh!

H-Huang left a comment

Uh oh!

WarrenZhu050413 commented May 24, 2025

Uh oh!

WarrenZhu050413 commented May 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

WarrenZhu050413 commented May 22, 2025

TorchFT Examples

Examples Included:

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

WarrenZhu050413 commented May 24, 2025

Uh oh!

WarrenZhu050413 commented May 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants