Fix evaluate's overflow + distributed config by moskomule · Pull Request #16 · moskomule/sarasa

moskomule · 2026-02-18T14:01:57Z

This pull request refactors the distributed training configuration system to simplify and unify how distributed strategies (DDP and FSDP) are handled. Instead of having separate DDP and FSDP classes, a single Distributed class now supports both strategies using configuration parameters. The changes also update related logic throughout the codebase to use this new unified approach and improve evaluation loss calculation.

Distributed training configuration refactor:

Removed the DDP and FSDP classes and replaced them with a unified Distributed class that uses dp_replicate_degree and dp_shard_degree to select between DDP and FSDP. Additional FSDP-specific options are now part of the Distributed class, and a __post_init__ check ensures only supported configurations are used. (sarasa/config.py [1] [2] [3] [4]; sarasa/__init__.py [5]; configs/llama3-1b.py [6] [7]
Updated all code that previously referenced DDP or FSDP to use the new Distributed class, including configuration creation and CLI loading. (sarasa/config.py [1] [2]; configs/llama3-1b.py [3] [4]
Refactored distributed application logic: selection between DDP and FSDP is now based on the values of dp_replicate_degree and dp_shard_degree, and dtype handling is passed explicitly instead of being set on the config. (sarasa/utils.py [1] [2]; sarasa/train.py [3] [4]

Evaluation and testing improvements:

Improved loss calculation in evaluation: now accumulates per-batch losses and divides by the total number of valid tokens for better accuracy. (sarasa/evaluate.py [1] [2]
Updated or removed tests to reflect the new configuration approach and removed tests that depended on the old FSDP class. (tests/test_config.py [1] [2]

Copilot

Pull request overview

This PR refactors the distributed training configuration system to use a unified Distributed class instead of separate DDP and FSDP classes. The changes also fix an evaluation loss calculation bug where total_tokens was accumulated before the all_reduce operation in distributed settings.

Changes:

Unified distributed configuration using dp_replicate_degree and dp_shard_degree parameters instead of separate DDP/FSDP classes
Fixed evaluation loss calculation to correctly accumulate total_tokens after all_reduce operation
Updated imports, configuration files, and tests to use the new Distributed class

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
sarasa/config.py	Replaced DDP/FSDP classes with unified Distributed class, added dp_replicate_degree and dp_shard_degree parameters, moved FSDP-specific options into Distributed class
sarasa/init.py	Updated exports to use Distributed instead of DDP/FSDP
sarasa/utils.py	Updated apply_distributed to check dp_replicate_degree/dp_shard_degree and accept explicit dtype parameters
sarasa/train.py	Pass dtype parameters explicitly to apply_distributed, updated AMP context condition
sarasa/evaluate.py	Fixed evaluation loss calculation to correctly handle distributed token counting
configs/llama3-1b.py	Removed explicit FSDP configuration, now uses default Distributed
tests/test_config.py	Removed test for old FSDP dtype inheritance, minor formatting change

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-18T14:08:19Z

sarasa/config.py

+        if not (self.dp_replicate_degree == 1 and self.dp_shard_degree == -1):
+            raise NotImplementedError()


The post_init validation logic is incorrect. It only allows the default configuration where dp_replicate_degree=1 and dp_shard_degree=-1 (FSDP mode), but rejects any other configuration including DDP mode. This means users cannot actually use DDP with this new unified class. The validation should allow either DDP (dp_replicate_degree != 1, dp_shard_degree == 1) or FSDP (dp_replicate_degree == 1, dp_shard_degree == -1), not only FSDP.

Suggested change

if not (self.dp_replicate_degree == 1 and self.dp_shard_degree == -1):

raise NotImplementedError()

# Allow either:

# - FSDP mode: no replication, full sharding

# - DDP mode: replication, no sharding

is_fsdp = self.dp_replicate_degree == 1 and self.dp_shard_degree == -1

is_ddp = self.dp_replicate_degree != 1 and self.dp_shard_degree == 1

if not (is_fsdp or is_ddp):

raise NotImplementedError(

f"Unsupported distributed configuration: "

f"dp_replicate_degree={self.dp_replicate_degree}, "

f"dp_shard_degree={self.dp_shard_degree}. "

"Supported modes are: "

"FSDP (dp_replicate_degree=1, dp_shard_degree=-1) or "

"DDP (dp_replicate_degree!=1, dp_shard_degree=1)."

)

Copilot · 2026-02-18T14:08:20Z

sarasa/config.py

 @dataclasses.dataclass
 class Distributed:
+    dp_replicate_degree: int = 1
+    """Degree of DDP. 1 means no replication"""


The docstring is misleading. "Degree of DDP" suggests this controls the degree of data parallelism when using DDP, but a value of 1 means "no replication" which implies no DDP at all. Consider clarifying this to something like "Data parallelism replication factor. Set to 1 to disable DDP replication (use FSDP or single-device training), or > 1 to enable DDP with that replication factor."

Suggested change

"""Degree of DDP. 1 means no replication"""

"""Data parallelism replication factor. Set to 1 to disable DDP replication

(use FSDP or single-device training), or > 1 to enable DDP with that

replication factor."""

Copilot · 2026-02-18T14:08:20Z

sarasa/config.py

+    """Degree of DDP. 1 means no replication"""
+
+    dp_shard_degree: int = -1
+    """Degree of FSDP. -1 means full sharding"""


The docstring is misleading. "Degree of FSDP" suggests this controls the degree of sharding when using FSDP, but the value -1 has a special meaning for "full sharding" which is not a degree but a mode indicator. Consider clarifying this to something like "Sharding degree for FSDP. Set to -1 for full sharding, 1 for no sharding (use DDP or single-device training), or > 1 for partial sharding with that degree."

Suggested change

"""Degree of FSDP. -1 means full sharding"""

"""Sharding degree for FSDP. Set to -1 for full sharding (the only supported mode)."""

tests/test_config.py

moskomule added 2 commits February 18, 2026 21:51

fix inf

76eb5c8

unified distributed config

f9352ff

Copilot AI review requested due to automatic review settings February 18, 2026 14:01

Copilot started reviewing on behalf of moskomule February 18, 2026 14:02 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

fix type

5f4b396

moskomule merged commit a41752b into main Feb 18, 2026
2 checks passed

moskomule deleted the fix branch February 18, 2026 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix evaluate's overflow + distributed config#16

Fix evaluate's overflow + distributed config#16
moskomule merged 3 commits intomainfrom
fix

moskomule commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if not (self.dp_replicate_degree == 1 and self.dp_shard_degree == -1):
		raise NotImplementedError()

-        if not (self.dp_replicate_degree == 1 and self.dp_shard_degree == -1):
-            raise NotImplementedError()
+        # Allow either:
+        # - FSDP mode: no replication, full sharding
+        # - DDP mode: replication, no sharding
+        is_fsdp = self.dp_replicate_degree == 1 and self.dp_shard_degree == -1
+        is_ddp = self.dp_replicate_degree != 1 and self.dp_shard_degree == 1
+        if not (is_fsdp or is_ddp):
+            raise NotImplementedError(
+                f"Unsupported distributed configuration: "
+                f"dp_replicate_degree={self.dp_replicate_degree}, "
+                f"dp_shard_degree={self.dp_shard_degree}. "
+                "Supported modes are: "
+                "FSDP (dp_replicate_degree=1, dp_shard_degree=-1) or "
+                "DDP (dp_replicate_degree!=1, dp_shard_degree=1)."
+            )

-    """Degree of DDP. 1 means no replication"""
+    """Data parallelism replication factor. Set to 1 to disable DDP replication
+    (use FSDP or single-device training), or > 1 to enable DDP with that
+    replication factor."""

	"""Degree of FSDP. -1 means full sharding"""
	"""Sharding degree for FSDP. Set to -1 for full sharding (the only supported mode)."""

Conversation

moskomule commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants