Rebase #514

xiuhu17 · 2026-01-24T09:55:46Z

No description provided.

Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com> Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: zijiexia <zijie_xia@icloud.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

Co-authored-by: Ratish1 <formula733@gmail.com>"

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

…ark#422)

…tron & FSDP alignment (radixark#412)

…4 training, bug fix, etc. (radixark#426)

Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu> Co-authored-by: Jiajun Li <guapisolo@gmail.com>

Co-authored-by: Yusheng Su <radixark@ac-h200-user-3.tail134ba0.ts.net>

Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com>

gemini-code-assist · 2026-01-24T10:16:37Z

Summary of Changes

Hello @xiuhu17, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces several key updates and new features to the repository. It focuses on improving build reproducibility, enhancing attention mechanisms, adding support for FSDP training, and providing more customization options. The changes include updates to build scripts, Dockerfiles, code implementations, and documentation, all aimed at improving the performance, stability, and flexibility of the system.

Highlights

New File: .github/CODEOWNERS: Adds a CODEOWNERS file to define responsible parties for different parts of the repository, improving code review workflow.
New File: .gitmodules: Introduces a .gitmodules file, incorporating submodules for nemo-gym and mini-swe-agent, facilitating better organization of external dependencies.
Build Script Modifications: Updates the build_conda.sh script to specify SGLANG and MEGATRON commit hashes, ensuring reproducible builds and compatibility.
Dockerfile Updates: Modifies the main Dockerfile to align with the updated SGLANG and MEGATRON commit hashes, and adds tilelang installation.
Dockerfile ROCm Modifications: Updates the ROCm Dockerfile to align with the latest changes and patches, ensuring compatibility with the ROCm platform.
Dockerfile DeepseekV32 Modifications: Adds a new Dockerfile for DeepseekV32, including necessary dependencies, patches, and configurations.
Documentation Updates: Updates the docker/README.md to reflect the current stable versions and adds new documentation for PD Disaggregation and miles Router Middleware.
DSA Implementation: Implements DSA (Deep Sparse Attention) and integrates it with context parallelism, enhancing attention mechanisms.
MoE Routing Replay: Adds routing replay to stabilize MoE RL training.
TileLang Kernel Integration: Integrates TileLang kernels for sparse MLA (Multi-Level Attention), improving performance.
FSDP Training Support: Adds support for FSDP (Fully Sharded Data Parallel) as a training backend, enabling direct loading of HuggingFace format weights.
Customization Options: Provides extensive customization options through function path arguments, allowing injection of custom logic at various stages.

Ignored Files

Ignored by pattern: .github/workflows/** (4)
- .github/workflows/conda-ci.yml
- .github/workflows/pr-test.yml
- .github/workflows/pr-test.yml.j2
- .github/workflows/release-docs.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request appears to be a major rebase, incorporating a wide range of new features, refactoring, and dependency updates. The changes are extensive, touching many parts of the codebase.

Key improvements include the introduction of a unified training loop structure, better support for FSDP, and new features like INT4 training, on-policy distillation, and integrations with Terminal Bench and SWE-agent. The refactoring efforts, such as creating a ParallelState and centralizing data processing logic, are commendable and should improve maintainability.

However, there is a significant concern regarding the monkey-patching of torch.distributed._shard.sharded_tensor.ShardedTensor in miles/backends/megatron_utils/checkpoint.py to disable validation. While this might be a necessary performance optimization for now, it is a risky hack that could lead to subtle bugs and become a maintenance burden. It would be preferable to find a solution through official APIs or contribute an improvement upstream to PyTorch.

I've also identified a potential regression where speculative decoding metrics might no longer be logged. Please see the specific comment for details.

gemini-code-assist · 2026-01-24T10:20:19Z

miles/ray/rollout.py

    log_dict = {}
    log_dict |= dict_add_prefix(compute_statistics(response_lengths), "response_len/")
    log_dict |= _compute_zero_std_metrics(args, samples)
-    log_dict |= _compute_spec_metrics(args, samples)


The call to _compute_spec_metrics seems to have been accidentally removed. This will prevent speculative decoding metrics (like acceptance rate and length) from being logged, which is a regression. This line should be restored to ensure these important metrics are tracked.

gemini-code-assist · 2026-01-24T10:20:19Z

docker/deepseekv32/Dockerfile

+#     mv /usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/libnccl.so.2 /usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/libnccl.so.2.bak && \
+#     cp -r third_party/nccl/build/lib/* /usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/
+
+RUN [ ! -f /root/.tmux.conf ] || rm /root/.tmux.conf


This command to remove .tmux.conf is a bit unconventional. Using rm -f is more idiomatic and readable for this purpose, as it will not error if the file does not exist and is simpler to understand.

RUN rm -f /root/.tmux.conf

gemini-code-assist

Code Review

This pull request introduces a significant and wide-ranging set of changes, primarily focused on refactoring the training backends for better modularity and adding extensive new features. Key improvements include the introduction of a ParallelState abstraction to unify parallelism logic, major refactoring of the FSDP and Megatron actors, and the addition of fault tolerance for rollout engines. The PR also adds support for new models like DeepSeek-V3.2, new agent frameworks such as strands-sglang and swe-agent, and enhances low-precision training capabilities. The documentation has been substantially updated to reflect these new features. My review identifies one area of concern regarding a monkey-patch for performance optimization, which could pose a future maintenance risk.

gemini-code-assist · 2026-01-24T10:21:24Z

miles/backends/megatron_utils/checkpoint.py

 from miles.utils import megatron_bridge_utils

+try:
+    # Here we patch out the `validate_non_overlapping_shards_metadata` in both functions
+    # because it is really slow for large models with many shards.
+    # TODO: find a less hacky way to do this.
+    import torch.distributed as dist
+    import torch.distributed._shard.sharding_spec as shard_spec
+    from torch.distributed._shard.sharded_tensor import ShardedTensor
+    from torch.distributed._shard.sharded_tensor.metadata import ShardedTensorMetadata
+    from torch.distributed._shard.sharded_tensor.shard import Shard
+    from torch.distributed._shard.sharded_tensor.utils import _parse_and_validate_remote_device
+    from torch.distributed._shard.sharding_spec.api import EnumerableShardingSpec
+
+    def __post_init__(self):
+        pass
+
+    EnumerableShardingSpec.__post_init__ = __post_init__
+
+    @classmethod
+    def _init_from_local_shards_and_global_metadata(  # type: ignore[override]
+        cls,
+        local_shards: list[Shard],
+        sharded_tensor_metadata: ShardedTensorMetadata,
+        process_group=None,
+        init_rrefs=False,
+        sharding_spec=None,
+    ) -> ShardedTensor:
+        """
+        Initialize a ShardedTensor with local shards and a global
+        ShardedTensorMetadata built on each rank.
+
+        Warning: This API is experimental and subject to change. It does
+                 not do cross rank validations, and fully rely on the user
+                 for the correctness of sharded_tensor_metadata on each rank
+        """
+        process_group = cls._normalize_pg(process_group)
+        current_rank = dist.get_rank()  # intentional to get global rank
+
+        shards_metadata = sharded_tensor_metadata.shards_metadata
+
+        local_shard_metadatas = []
+
+        # collect local shard metadatas from the global sharded_tensor_metadata
+        for shard_metadata in shards_metadata:  # type: ignore[attr-defined]
+            rank, local_device = _parse_and_validate_remote_device(process_group, shard_metadata.placement)
+
+            if current_rank == rank:
+                local_shard_metadatas.append(shard_metadata)
+
+        shards_metadata = sharded_tensor_metadata.shards_metadata
+        tensor_properties = sharded_tensor_metadata.tensor_properties
+
+        if sharding_spec is None:
+            spec = shard_spec._infer_sharding_spec_from_shards_metadata(shards_metadata)
+        else:
+            spec = sharding_spec
+
+        sharded_tensor = ShardedTensor.__new__(
+            ShardedTensor,
+            spec,
+            sharded_tensor_metadata.size,
+            dtype=tensor_properties.dtype,
+            layout=tensor_properties.layout,
+            pin_memory=tensor_properties.pin_memory,
+            requires_grad=tensor_properties.requires_grad,
+        )
+
+        # done validation, add local_shards
+        sharded_tensor._local_shards = local_shards
+        sharded_tensor._prepare_init(process_group=process_group, init_rrefs=init_rrefs)
+
+        # run post initialization, i.e. map registration, rpc initialization
+        sharded_tensor._post_init()
+        return sharded_tensor
+
+    ShardedTensor._init_from_local_shards_and_global_metadata = _init_from_local_shards_and_global_metadata
+
+except ImportError:
+    pass
+


This large try...except block monkey-patches PyTorch's ShardedTensor and EnumerableShardingSpec to bypass a performance-intensive validation step. While the performance gain might be necessary for large models, this approach is brittle and poses a significant maintenance risk. It's likely to break with future PyTorch updates.

To mitigate this risk, consider the following:

Add PyTorch version checks: Gate this patch to specific PyTorch versions that are known to be compatible. This will prevent silent failures or unexpected behavior when the library is updated.

Improve error handling: Instead of a silent except ImportError: pass, log a warning if the patching fails. This would make it clear that the performance optimization is not being applied.

Upstream the issue: If this is a general performance problem in PyTorch's distributed checkpointing, it would be best to report it to the PyTorch team. They might provide a proper API to disable this validation or offer a more efficient implementation in the future.

The TODO comment indicates awareness of the issue, but strengthening the implementation with version checks and better error handling would make this less risky.

Ratish1 and others added 30 commits December 30, 2025 19:40

feat: Implement lazy data loading for Dataset (radixark#246)

9a3b297

Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com> Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>

Revert "feat: Implement lazy data loading for Dataset" (radixark#372)

b0d3341

[MISC] add codeowners (radixark#373)

8ba715e

[Misc] update codeowners (radixark#374)

bc61a7d

[auto-sync] update code (radixark#383)

8a988dc

update code (radixark#385)

9dfd339

Cherry Pick commits to local fix CI unit tests (radixark#393)

927d653

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

[example] Add SWE-agent example (radixark#367)

47a5bdf

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: zijiexia <zijie_xia@icloud.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

add background health check to miles native router (radixark#260)

7434e9a

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

update code (radixark#401)

c77bdcf

Co-authored-by: Ratish1 <formula733@gmail.com>"

Update example dir (radixark#345)

43b9543

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

update code (radixark#411)

236b640

move swe to experimental (radixark#421)

60fc56e

feat: add int4 reinforcement learning training support (Part1) (radix…

5d7a21c

…ark#422)

refactor [1/X]: unify training backends by general utils, tested Mega…

f68eef8

…tron & FSDP alignment (radixark#412)

[squashed] Support VLM Multi-turn Training with Megatron, Support INT…

20ab4f2

…4 training, bug fix, etc. (radixark#426)

[minor] delete unused util file (radixark#428)

636c995

Remove AI response in the doc (radixark#429)

e6571fa

Fix rollout-all-samples (radixark#431)

1f619e1

Fix retool example incorrectly handling max_tool_calls (radixark#462)

2cdc1f7

Integrate Terminal Bench into Miles (radixark#447)

5dd0044

Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu> Co-authored-by: Jiajun Li <guapisolo@gmail.com>

[CI] Fix and setup CI (radixark#402)

dfd822c

Co-authored-by: Yusheng Su <radixark@ac-h200-user-3.tail134ba0.ts.net>

[CI] R3 bug fix & add CI test for R3 (radixark#496)

38c152f

first attempt in supporting deepseek v3.2

fc1076f

update

a7373e7

add several fix, supported thd + CP on megatron's dsa, added dockerfile

b62966e

update dockerfile: TE version, fast-hadamard-transform

1a25680

update patches

f1674e9

update script

66d3b24

minor fix

ccdff92

yueming-yuan and others added 25 commits January 22, 2026 00:32

supported bshd

8af1384

lint

42c680f

rename, add argument assert, lint

0791a9c

tmp fix

d8cb73a

update megatron patch

7009e11

update transformers patch

4499325

disable amem

6f1e130

add script

cbd2e9f

update

9dc5258

fix

cab9686

rm unused script

8d51fe0

fix

f7beab4

add docs

dd68706

Fix torch native CP attention backend for DSA (radixark#406)

f16e095

tilelang kernel + matrix absorb in megatron (radixark#461)

e28d439

update

2c3534b

update

d81f29c

Enable experimental rollout flag for CI tests (radixark#492)

474d542

Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com>

update

e1e2305

Merge branch 'radixark:main' into dsv32_r3

a16fb3f

Fix PYTHONPATH for AMD container Megatron-LM location (radixark#506)

72bafb1

update

f8e4cd8

Merge branch 'radixark:main' into dsv32_r3

92da5dc

update

b556aec

update

adf07aa

gemini-code-assist bot reviewed Jan 24, 2026

View reviewed changes

update

1def842

xiuhu17 closed this Jan 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase #514

Rebase #514

xiuhu17 commented Jan 24, 2026

Uh oh!

gemini-code-assist bot commented Jan 24, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 24, 2026

Uh oh!

gemini-code-assist bot Jan 24, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Rebase #514

Rebase #514

Conversation

xiuhu17 commented Jan 24, 2026

Uh oh!

gemini-code-assist bot commented Jan 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants