[New Feature] Integrate Optional Optimizer State Aggregation into a Unified Optimizer Class

**Description**

I propose integrating three related distributed optimization algorithms into a single, unified optimizer class. These methods extend the capabilities of infrequent communication training beyond standard Local SGD by aggregating optimizer states.

The proposed integration includes:
1. [`Local Adam`](https://openreview.net/forum?id=VNg7srnvD9) (ICLR 2025): A provably convergent algorithm that requires aggregating optimizer states alongside model parameters.
2. [`DES-LOC`](https://openreview.net/forum?id=6N2qFixxYZ) (ICLR 2026): Builds on Local Adam by allowing optimizer states and model parameters to synchronize at independent, decoupled frequencies.
3. [`MT-DAO`](https://openreview.net/forum?id=5yPP238v4c) (ICLR 2026): Enhances performance through a modified quasi-hyperbolic inner update rule that allows for a much lower decay rate to be used for the first momentum.

I currently have a working implementation of these methods and would like to contribute them to `torchft` as a robust alternative to standard `Local SGD`.

**Motivation**

Current infrequent communication methods often struggle with stability and lack convergence guarantees when applied to adaptive optimizers. Integrating these methods offers three specific technical advantages:

1.  Aggregating optimizer states (as done in `Local Adam`/`DES-LOC`) provides a provably convergent algorithm under standard assumptions, even with Non-IID data/loss functions. This improves training stability compared to synchronizing model parameters alone.

2. These methods solve initialization issues when the number of workers changes (e.g., pausing and restarting with diverse worker counts). By maintaining an aggregated optimizer state, we ensure a low-variance initialization for new or rejoining workers, avoiding the pitfalls of standard state checkpointing in elastic settings. 

3.  Aggregating optimizer states can improve performance with the right hyperparameter choices. For example, `MT-DAO` introduces a quasi-hyperbolic momentum formulation with a low decay rate for `Adam`. In this setup, the aggregated first momentum changes very slowly and acts as a regularizer, improving empirical performance.

**Proposed Implementation**

The implementation would serve as a drop-in replacement for `Local SGD`, introducing optimizer state fragments, similar to the model state fragments in `Streaming DiLoCo`, to support future extensibility.

**Next Steps**

I am happy to open a Pull Request with the implementation. Please let me know if this aligns with the project roadmap or if there are specific design constraints I should consider. Future extensions may include support for arbitrary outer optimizers and/or streaming variants.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Feature] Integrate Optional Optimizer State Aggregation into a Unified Optimizer Class #312

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[New Feature] Integrate Optional Optimizer State Aggregation into a Unified Optimizer Class #312

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions