Skip to content

Conversation

@xiuhu17
Copy link

@xiuhu17 xiuhu17 commented Jan 24, 2026

No description provided.

Ratish1 and others added 30 commits December 30, 2025 19:40
Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com>
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: zijiexia <zijie_xia@icloud.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: Ratish1 <formula733@gmail.com>"
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>
Co-authored-by: Jiajun Li <guapisolo@gmail.com>
Co-authored-by: Yusheng Su <radixark@ac-h200-user-3.tail134ba0.ts.net>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @xiuhu17, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces several key updates and new features to the repository. It focuses on improving build reproducibility, enhancing attention mechanisms, adding support for FSDP training, and providing more customization options. The changes include updates to build scripts, Dockerfiles, code implementations, and documentation, all aimed at improving the performance, stability, and flexibility of the system.

Highlights

  • New File: .github/CODEOWNERS: Adds a CODEOWNERS file to define responsible parties for different parts of the repository, improving code review workflow.
  • New File: .gitmodules: Introduces a .gitmodules file, incorporating submodules for nemo-gym and mini-swe-agent, facilitating better organization of external dependencies.
  • Build Script Modifications: Updates the build_conda.sh script to specify SGLANG and MEGATRON commit hashes, ensuring reproducible builds and compatibility.
  • Dockerfile Updates: Modifies the main Dockerfile to align with the updated SGLANG and MEGATRON commit hashes, and adds tilelang installation.
  • Dockerfile ROCm Modifications: Updates the ROCm Dockerfile to align with the latest changes and patches, ensuring compatibility with the ROCm platform.
  • Dockerfile DeepseekV32 Modifications: Adds a new Dockerfile for DeepseekV32, including necessary dependencies, patches, and configurations.
  • Documentation Updates: Updates the docker/README.md to reflect the current stable versions and adds new documentation for PD Disaggregation and miles Router Middleware.
  • DSA Implementation: Implements DSA (Deep Sparse Attention) and integrates it with context parallelism, enhancing attention mechanisms.
  • MoE Routing Replay: Adds routing replay to stabilize MoE RL training.
  • TileLang Kernel Integration: Integrates TileLang kernels for sparse MLA (Multi-Level Attention), improving performance.
  • FSDP Training Support: Adds support for FSDP (Fully Sharded Data Parallel) as a training backend, enabling direct loading of HuggingFace format weights.
  • Customization Options: Provides extensive customization options through function path arguments, allowing injection of custom logic at various stages.
Ignored Files
  • Ignored by pattern: .github/workflows/** (4)
    • .github/workflows/conda-ci.yml
    • .github/workflows/pr-test.yml
    • .github/workflows/pr-test.yml.j2
    • .github/workflows/release-docs.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request appears to be a major rebase, incorporating a wide range of new features, refactoring, and dependency updates. The changes are extensive, touching many parts of the codebase.

Key improvements include the introduction of a unified training loop structure, better support for FSDP, and new features like INT4 training, on-policy distillation, and integrations with Terminal Bench and SWE-agent. The refactoring efforts, such as creating a ParallelState and centralizing data processing logic, are commendable and should improve maintainability.

However, there is a significant concern regarding the monkey-patching of torch.distributed._shard.sharded_tensor.ShardedTensor in miles/backends/megatron_utils/checkpoint.py to disable validation. While this might be a necessary performance optimization for now, it is a risky hack that could lead to subtle bugs and become a maintenance burden. It would be preferable to find a solution through official APIs or contribute an improvement upstream to PyTorch.

I've also identified a potential regression where speculative decoding metrics might no longer be logged. Please see the specific comment for details.

log_dict = {}
log_dict |= dict_add_prefix(compute_statistics(response_lengths), "response_len/")
log_dict |= _compute_zero_std_metrics(args, samples)
log_dict |= _compute_spec_metrics(args, samples)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The call to _compute_spec_metrics seems to have been accidentally removed. This will prevent speculative decoding metrics (like acceptance rate and length) from being logged, which is a regression. This line should be restored to ensure these important metrics are tracked.

# mv /usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/libnccl.so.2 /usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/libnccl.so.2.bak && \
# cp -r third_party/nccl/build/lib/* /usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/

RUN [ ! -f /root/.tmux.conf ] || rm /root/.tmux.conf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This command to remove .tmux.conf is a bit unconventional. Using rm -f is more idiomatic and readable for this purpose, as it will not error if the file does not exist and is simpler to understand.

RUN rm -f /root/.tmux.conf

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and wide-ranging set of changes, primarily focused on refactoring the training backends for better modularity and adding extensive new features. Key improvements include the introduction of a ParallelState abstraction to unify parallelism logic, major refactoring of the FSDP and Megatron actors, and the addition of fault tolerance for rollout engines. The PR also adds support for new models like DeepSeek-V3.2, new agent frameworks such as strands-sglang and swe-agent, and enhances low-precision training capabilities. The documentation has been substantially updated to reflect these new features. My review identifies one area of concern regarding a monkey-patch for performance optimization, which could pose a future maintenance risk.

Comment on lines 11 to +91
from miles.utils import megatron_bridge_utils

try:
# Here we patch out the `validate_non_overlapping_shards_metadata` in both functions
# because it is really slow for large models with many shards.
# TODO: find a less hacky way to do this.
import torch.distributed as dist
import torch.distributed._shard.sharding_spec as shard_spec
from torch.distributed._shard.sharded_tensor import ShardedTensor
from torch.distributed._shard.sharded_tensor.metadata import ShardedTensorMetadata
from torch.distributed._shard.sharded_tensor.shard import Shard
from torch.distributed._shard.sharded_tensor.utils import _parse_and_validate_remote_device
from torch.distributed._shard.sharding_spec.api import EnumerableShardingSpec

def __post_init__(self):
pass

EnumerableShardingSpec.__post_init__ = __post_init__

@classmethod
def _init_from_local_shards_and_global_metadata( # type: ignore[override]
cls,
local_shards: list[Shard],
sharded_tensor_metadata: ShardedTensorMetadata,
process_group=None,
init_rrefs=False,
sharding_spec=None,
) -> ShardedTensor:
"""
Initialize a ShardedTensor with local shards and a global
ShardedTensorMetadata built on each rank.
Warning: This API is experimental and subject to change. It does
not do cross rank validations, and fully rely on the user
for the correctness of sharded_tensor_metadata on each rank
"""
process_group = cls._normalize_pg(process_group)
current_rank = dist.get_rank() # intentional to get global rank

shards_metadata = sharded_tensor_metadata.shards_metadata

local_shard_metadatas = []

# collect local shard metadatas from the global sharded_tensor_metadata
for shard_metadata in shards_metadata: # type: ignore[attr-defined]
rank, local_device = _parse_and_validate_remote_device(process_group, shard_metadata.placement)

if current_rank == rank:
local_shard_metadatas.append(shard_metadata)

shards_metadata = sharded_tensor_metadata.shards_metadata
tensor_properties = sharded_tensor_metadata.tensor_properties

if sharding_spec is None:
spec = shard_spec._infer_sharding_spec_from_shards_metadata(shards_metadata)
else:
spec = sharding_spec

sharded_tensor = ShardedTensor.__new__(
ShardedTensor,
spec,
sharded_tensor_metadata.size,
dtype=tensor_properties.dtype,
layout=tensor_properties.layout,
pin_memory=tensor_properties.pin_memory,
requires_grad=tensor_properties.requires_grad,
)

# done validation, add local_shards
sharded_tensor._local_shards = local_shards
sharded_tensor._prepare_init(process_group=process_group, init_rrefs=init_rrefs)

# run post initialization, i.e. map registration, rpc initialization
sharded_tensor._post_init()
return sharded_tensor

ShardedTensor._init_from_local_shards_and_global_metadata = _init_from_local_shards_and_global_metadata

except ImportError:
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This large try...except block monkey-patches PyTorch's ShardedTensor and EnumerableShardingSpec to bypass a performance-intensive validation step. While the performance gain might be necessary for large models, this approach is brittle and poses a significant maintenance risk. It's likely to break with future PyTorch updates.

To mitigate this risk, consider the following:

  1. Add PyTorch version checks: Gate this patch to specific PyTorch versions that are known to be compatible. This will prevent silent failures or unexpected behavior when the library is updated.
  2. Improve error handling: Instead of a silent except ImportError: pass, log a warning if the patching fails. This would make it clear that the performance optimization is not being applied.
  3. Upstream the issue: If this is a general performance problem in PyTorch's distributed checkpointing, it would be best to report it to the PyTorch team. They might provide a proper API to disable this validation or offer a more efficient implementation in the future.

The TODO comment indicates awareness of the issue, but strengthening the implementation with version checks and better error handling would make this less risky.

@xiuhu17 xiuhu17 closed this Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.