Skip to content

Fix evaluate's overflow + distributed config#16

Merged
moskomule merged 3 commits intomainfrom
fix
Feb 18, 2026
Merged

Fix evaluate's overflow + distributed config#16
moskomule merged 3 commits intomainfrom
fix

Conversation

@moskomule
Copy link
Owner

This pull request refactors the distributed training configuration system to simplify and unify how distributed strategies (DDP and FSDP) are handled. Instead of having separate DDP and FSDP classes, a single Distributed class now supports both strategies using configuration parameters. The changes also update related logic throughout the codebase to use this new unified approach and improve evaluation loss calculation.

Distributed training configuration refactor:

  • Removed the DDP and FSDP classes and replaced them with a unified Distributed class that uses dp_replicate_degree and dp_shard_degree to select between DDP and FSDP. Additional FSDP-specific options are now part of the Distributed class, and a __post_init__ check ensures only supported configurations are used. (sarasa/config.py [1] [2] [3] [4]; sarasa/__init__.py [5]; configs/llama3-1b.py [6] [7]

  • Updated all code that previously referenced DDP or FSDP to use the new Distributed class, including configuration creation and CLI loading. (sarasa/config.py [1] [2]; configs/llama3-1b.py [3] [4]

  • Refactored distributed application logic: selection between DDP and FSDP is now based on the values of dp_replicate_degree and dp_shard_degree, and dtype handling is passed explicitly instead of being set on the config. (sarasa/utils.py [1] [2]; sarasa/train.py [3] [4]

Evaluation and testing improvements:

  • Improved loss calculation in evaluation: now accumulates per-batch losses and divides by the total number of valid tokens for better accuracy. (sarasa/evaluate.py [1] [2]

  • Updated or removed tests to reflect the new configuration approach and removed tests that depended on the old FSDP class. (tests/test_config.py [1] [2]

Copilot AI review requested due to automatic review settings February 18, 2026 14:01
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the distributed training configuration system to use a unified Distributed class instead of separate DDP and FSDP classes. The changes also fix an evaluation loss calculation bug where total_tokens was accumulated before the all_reduce operation in distributed settings.

Changes:

  • Unified distributed configuration using dp_replicate_degree and dp_shard_degree parameters instead of separate DDP/FSDP classes
  • Fixed evaluation loss calculation to correctly accumulate total_tokens after all_reduce operation
  • Updated imports, configuration files, and tests to use the new Distributed class

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sarasa/config.py Replaced DDP/FSDP classes with unified Distributed class, added dp_replicate_degree and dp_shard_degree parameters, moved FSDP-specific options into Distributed class
sarasa/init.py Updated exports to use Distributed instead of DDP/FSDP
sarasa/utils.py Updated apply_distributed to check dp_replicate_degree/dp_shard_degree and accept explicit dtype parameters
sarasa/train.py Pass dtype parameters explicitly to apply_distributed, updated AMP context condition
sarasa/evaluate.py Fixed evaluation loss calculation to correctly handle distributed token counting
configs/llama3-1b.py Removed explicit FSDP configuration, now uses default Distributed
tests/test_config.py Removed test for old FSDP dtype inheritance, minor formatting change

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +191 to +192
if not (self.dp_replicate_degree == 1 and self.dp_shard_degree == -1):
raise NotImplementedError()
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The post_init validation logic is incorrect. It only allows the default configuration where dp_replicate_degree=1 and dp_shard_degree=-1 (FSDP mode), but rejects any other configuration including DDP mode. This means users cannot actually use DDP with this new unified class. The validation should allow either DDP (dp_replicate_degree != 1, dp_shard_degree == 1) or FSDP (dp_replicate_degree == 1, dp_shard_degree == -1), not only FSDP.

Suggested change
if not (self.dp_replicate_degree == 1 and self.dp_shard_degree == -1):
raise NotImplementedError()
# Allow either:
# - FSDP mode: no replication, full sharding
# - DDP mode: replication, no sharding
is_fsdp = self.dp_replicate_degree == 1 and self.dp_shard_degree == -1
is_ddp = self.dp_replicate_degree != 1 and self.dp_shard_degree == 1
if not (is_fsdp or is_ddp):
raise NotImplementedError(
f"Unsupported distributed configuration: "
f"dp_replicate_degree={self.dp_replicate_degree}, "
f"dp_shard_degree={self.dp_shard_degree}. "
"Supported modes are: "
"FSDP (dp_replicate_degree=1, dp_shard_degree=-1) or "
"DDP (dp_replicate_degree!=1, dp_shard_degree=1)."
)

Copilot uses AI. Check for mistakes.
@dataclasses.dataclass
class Distributed:
dp_replicate_degree: int = 1
"""Degree of DDP. 1 means no replication"""
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring is misleading. "Degree of DDP" suggests this controls the degree of data parallelism when using DDP, but a value of 1 means "no replication" which implies no DDP at all. Consider clarifying this to something like "Data parallelism replication factor. Set to 1 to disable DDP replication (use FSDP or single-device training), or > 1 to enable DDP with that replication factor."

Suggested change
"""Degree of DDP. 1 means no replication"""
"""Data parallelism replication factor. Set to 1 to disable DDP replication
(use FSDP or single-device training), or > 1 to enable DDP with that
replication factor."""

Copilot uses AI. Check for mistakes.
"""Degree of DDP. 1 means no replication"""

dp_shard_degree: int = -1
"""Degree of FSDP. -1 means full sharding"""
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring is misleading. "Degree of FSDP" suggests this controls the degree of sharding when using FSDP, but the value -1 has a special meaning for "full sharding" which is not a degree but a mode indicator. Consider clarifying this to something like "Sharding degree for FSDP. Set to -1 for full sharding, 1 for no sharding (use DDP or single-device training), or > 1 for partial sharding with that degree."

Suggested change
"""Degree of FSDP. -1 means full sharding"""
"""Sharding degree for FSDP. Set to -1 for full sharding (the only supported mode)."""

Copilot uses AI. Check for mistakes.
@moskomule moskomule merged commit a41752b into main Feb 18, 2026
2 checks passed
@moskomule moskomule deleted the fix branch February 18, 2026 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants