-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Description
I propose integrating three related distributed optimization algorithms into a single, unified optimizer class. These methods extend the capabilities of infrequent communication training beyond standard Local SGD by aggregating optimizer states.
The proposed integration includes:
Local Adam(ICLR 2025): A provably convergent algorithm that requires aggregating optimizer states alongside model parameters.DES-LOC(ICLR 2026): Builds on Local Adam by allowing optimizer states and model parameters to synchronize at independent, decoupled frequencies.MT-DAO(ICLR 2026): Enhances performance through a modified quasi-hyperbolic inner update rule that allows for a much lower decay rate to be used for the first momentum.
I currently have a working implementation of these methods and would like to contribute them to torchft as a robust alternative to standard Local SGD.
Motivation
Current infrequent communication methods often struggle with stability and lack convergence guarantees when applied to adaptive optimizers. Integrating these methods offers three specific technical advantages:
-
Aggregating optimizer states (as done in
Local Adam/DES-LOC) provides a provably convergent algorithm under standard assumptions, even with Non-IID data/loss functions. This improves training stability compared to synchronizing model parameters alone. -
These methods solve initialization issues when the number of workers changes (e.g., pausing and restarting with diverse worker counts). By maintaining an aggregated optimizer state, we ensure a low-variance initialization for new or rejoining workers, avoiding the pitfalls of standard state checkpointing in elastic settings.
-
Aggregating optimizer states can improve performance with the right hyperparameter choices. For example,
MT-DAOintroduces a quasi-hyperbolic momentum formulation with a low decay rate forAdam. In this setup, the aggregated first momentum changes very slowly and acts as a regularizer, improving empirical performance.
Proposed Implementation
The implementation would serve as a drop-in replacement for Local SGD, introducing optimizer state fragments, similar to the model state fragments in Streaming DiLoCo, to support future extensibility.
Next Steps
I am happy to open a Pull Request with the implementation. Please let me know if this aligns with the project roadmap or if there are specific design constraints I should consider. Future extensions may include support for arbitrary outer optimizers and/or streaming variants.