Skip to content

[New Feature] Integrate Optional Optimizer State Aggregation into a Unified Optimizer Class #312

@Iacob-Alexandru-Andrei

Description

@Iacob-Alexandru-Andrei

Description

I propose integrating three related distributed optimization algorithms into a single, unified optimizer class. These methods extend the capabilities of infrequent communication training beyond standard Local SGD by aggregating optimizer states.

The proposed integration includes:

  1. Local Adam (ICLR 2025): A provably convergent algorithm that requires aggregating optimizer states alongside model parameters.
  2. DES-LOC (ICLR 2026): Builds on Local Adam by allowing optimizer states and model parameters to synchronize at independent, decoupled frequencies.
  3. MT-DAO (ICLR 2026): Enhances performance through a modified quasi-hyperbolic inner update rule that allows for a much lower decay rate to be used for the first momentum.

I currently have a working implementation of these methods and would like to contribute them to torchft as a robust alternative to standard Local SGD.

Motivation

Current infrequent communication methods often struggle with stability and lack convergence guarantees when applied to adaptive optimizers. Integrating these methods offers three specific technical advantages:

  1. Aggregating optimizer states (as done in Local Adam/DES-LOC) provides a provably convergent algorithm under standard assumptions, even with Non-IID data/loss functions. This improves training stability compared to synchronizing model parameters alone.

  2. These methods solve initialization issues when the number of workers changes (e.g., pausing and restarting with diverse worker counts). By maintaining an aggregated optimizer state, we ensure a low-variance initialization for new or rejoining workers, avoiding the pitfalls of standard state checkpointing in elastic settings.

  3. Aggregating optimizer states can improve performance with the right hyperparameter choices. For example, MT-DAO introduces a quasi-hyperbolic momentum formulation with a low decay rate for Adam. In this setup, the aggregated first momentum changes very slowly and acts as a regularizer, improving empirical performance.

Proposed Implementation

The implementation would serve as a drop-in replacement for Local SGD, introducing optimizer state fragments, similar to the model state fragments in Streaming DiLoCo, to support future extensibility.

Next Steps

I am happy to open a Pull Request with the implementation. Please let me know if this aligns with the project roadmap or if there are specific design constraints I should consider. Future extensions may include support for arbitrary outer optimizers and/or streaming variants.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions