Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP (#198)#200
Open
WarrenZhu050413 wants to merge 2 commits intometa-pytorch:mainfrom
Conversation
H-Huang
reviewed
May 23, 2025
Contributor
H-Huang
left a comment
There was a problem hiding this comment.
Wow! This is really awesome, thank you for your contributions. I will take a look more closely soon.
One thing I think would be a good idea is to run these examples as part of CI as well so we know we aren't regressing anything. We might have to create a separate CI workflow for this.
Contributor
Author
|
@d4l3k also commented on this. Working on the CI rn! |
089bcd5 to
478c162
Compare
Contributor
Author
|
Added CI. This is the output when I run For the > pytest examples/test_examples.py
======================================== test session starts =========================================
platform linux -- Python 3.11.11, pytest-8.3.5, pluggy-1.5.0
rootdir: /srv/apps/torchft
configfile: pyproject.toml
plugins: typeguard-2.13.3, timeout-2.3.1, anyio-4.9.0
timeout: 60.0s
timeout method: thread
timeout func_only: False
collected 12 items
examples/test_examples.py ............ [100%]
=================================== 12 passed in 99.86s (0:01:39) ==================================== |
478c162 to
02499b4
Compare
…Recovery, and proactive failure detection with DDP, along with CI (meta-pytorch#198)
02499b4 to
f5ee704
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TorchFT Examples
This PR adds a comprehensive set of examples demonstrating various fault tolerance features and distributed training approaches in TorchFT. Each example includes a per example README that provide step-by-step instructions, and sample outputs to help users understand and incorporate these features into their training script. All the examples build on top of
train_ddp.py.The PR came from my own experience understanding the different features of torchFT. I found it hard to start running other features outside of the given
train_ddp.pyat the beginning, which made it more difficult for me to have a sense of the various features offered by torchFT.@d4l3k provided useful feedback in how to structure the examples.
Examples Included:
DDP with Proactive Failure Recovery (
examples/ddp_proactive)DiLoCo (Distributed Local Convergence) (
examples/diloco)LocalSGD (
examples/localsgd)Live Checkpoint Recovery (
examples/live_checkpoint_recovery)