stand-alone rl by hjh0119 · Pull Request #42 · modelscope/twinkle

hjh0119 · 2026-02-06T17:51:51Z

No description provided.

gemini-code-assist · 2026-02-06T17:53:09Z

Summary of Changes

Hello @hjh0119, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request overhauls the distributed RL training infrastructure by shifting from a tightly coupled 'hybrid mode' to a more flexible 'standalone mode'. This fundamental change enables greater scalability and resource utilization by allowing training and inference components to run independently. The core of this enhancement is a new CheckpointEngineManager that facilitates efficient weight synchronization across different GPUs or nodes using collective communication primitives like NCCL and HCCL, ensuring that the inference sampler always has the latest model weights from the trainer.

Highlights

Standalone Mode for Distributed RL: Introduced a new 'standalone mode' architecture for distributed Reinforcement Learning (RL) training, allowing training models and inference samplers to operate on separate GPUs or nodes. This replaces the previous 'hybrid mode' where they were colocated.
New Checkpoint Engine for Weight Synchronization: Implemented a CheckpointEngineManager with NCCL and HCCL backends to efficiently synchronize model weights between distributed training models and inference samplers. This manager orchestrates a multi-step process for robust and scalable weight transfer.
Enhanced GRPO Loss Function: The GRPOLoss function has been significantly improved to handle packed sequences and align auxiliary log-probabilities from samplers or reference models, ensuring accurate loss computation in complex training scenarios.
MegatronModel Integration: Added support for MegatronModel in standalone GRPO training, including a new cookbook example (megatron_lora.py) and a custom SimpleWeightSync mechanism to manage weight transfer due to specific NCCL compatibility requirements with Megatron's distributed setup.
Refactored VLLMSampler and IPCWeightLoader: The VLLMSampler and IPCWeightLoader have been refactored to integrate with the new CheckpointEngineManager. The IPCWeightLoader now delegates its core weight transfer logic to the VLLMEngine's updated update_weights method, which uses ZMQ and CUDA IPC/shared memory for streaming transfers.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

cookbook/grpo/lora.py
- Updated to reflect 'Standalone Mode' training and replaced IPCWeightLoader with CheckpointEngineManager.
- Refactored sample processing to directly build InputFeature and handle old_logps and advantages explicitly.
- Removed HybridModelSamplerActor and related hybrid mode configurations.
cookbook/grpo/lora_gpu.py
- Removed the entire file, indicating a consolidation or deprecation of its functionality.
cookbook/grpo/megatron_lora.py
- Added a new cookbook for GRPO training using MegatronModel in standalone mode.
- Introduced SimpleWeightSync to manage weight transfer for MegatronModel due to NCCL compatibility considerations.
src/twinkle/checkpoint_engine/init.py
- Added new module for checkpoint engine, registering various backend implementations.
src/twinkle/checkpoint_engine/base.py
- Added base classes for CheckpointEngine and its registry, including a 'naive' colocated implementation.
src/twinkle/checkpoint_engine/hccl_checkpoint_engine.py
- Added HCCL-based checkpoint engine for Ascend NPU, enabling distributed weight sync.
src/twinkle/checkpoint_engine/manager.py
- Added CheckpointEngineManager to orchestrate the multi-step weight synchronization process between models and samplers.
src/twinkle/checkpoint_engine/mixin.py
- Added CheckpointEngineMixin to provide common checkpoint engine lifecycle methods to model and sampler classes.
src/twinkle/checkpoint_engine/nccl_checkpoint_engine.py
- Added NCCL-based checkpoint engine for CUDA GPUs, leveraging ray.util.collective for distributed weight sync.
src/twinkle/loss/grpo.py
- Enhanced GRPOLoss to handle packed sequences and align auxiliary log-probabilities from samplers/reference models.
- Updated __call__ signature to accept advantages directly.
src/twinkle/model/megatron/megatron.py
- Integrated CheckpointEngineMixin and implemented send_weights_via_checkpoint_engine and get_peft_config_dict for MegatronModel.
- Adjusted forward_backward and set_processor for compatibility with new loss and processor handling.
src/twinkle/model/transformers/transformers.py
- Integrated CheckpointEngineMixin and implemented send_weights_via_checkpoint_engine and get_peft_config_dict for TransformersModel.
- Enabled gradient checkpointing and ensured training mode propagation.
src/twinkle/preprocessor/init.py
- Exported CountdownProcessor.
src/twinkle/preprocessor/llm.py
- Added CountdownProcessor class for the Countdown Game task.
src/twinkle/processor/grpo.py
- Minor cleanup (removed TODO comment).
src/twinkle/sampler/base.py
- Made _get_template a remote function.
src/twinkle/sampler/vllm_engine.py
- Updated sample method to support lora_request and removed reload_weights from wake_up.
- Introduced reset_prefix_cache and a refactored update_weights for streaming IPC/SHM transfer.
- Added shutdown method for proper engine cleanup.
src/twinkle/sampler/vllm_sampler.py
- Integrated CheckpointEngineMixin and implemented receive_weights_via_checkpoint_engine.
- Added shutdown method and logic to use checkpoint-synced LoRA during sampling.
- Added import_weights_dict for compatibility with SimpleWeightSync.
src/twinkle/sampler/vllm_worker_extension.py
- Refactored weight loading logic, including PEFT prefix stripping and vLLM stacked parameter mapping.
- Added set_death_signal for robust process management.
src/twinkle/utils/network.py
- Added is_valid_ipv6_address and stateless_init_process_group for distributed communication setup.
src/twinkle/weight_loader/ipc_loader.py
- Refactored to delegate weight transfer to VLLMEngine.update_weights, simplifying its role.
tests/sampler/test_megatron_weight_sync.py
- Added new test for MegatronModel standalone weight synchronization.
tests/sampler/test_weight_sync.py
- Updated to test standalone weight synchronization with TransformersModel and CheckpointEngineManager.

Activity

The pull request introduces a new, more robust architecture for weight synchronization in distributed RL training, moving from a 'hybrid' (colocated) to a 'standalone' (disaggregated) mode.
This involves adding new core components like CheckpointEngineManager and specific backend implementations (NCCLCheckpointEngine, HCCLCheckpointEngine).
Existing models (TransformersModel, MegatronModel) and samplers (VLLMSampler) have been updated to integrate with this new system via a CheckpointEngineMixin.
The GRPOLoss has been made more flexible to handle various log-probability inputs and packed sequences.
A new example for MegatronModel training and a corresponding test have been added, while an older lora_gpu.py script was removed.
The IPCWeightLoader has been simplified, delegating its core functionality to the new VLLMEngine.update_weights method.
The changes aim to improve scalability and flexibility for distributed RL training setups.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a major and well-executed refactoring to support standalone reinforcement learning, where the training model and inference sampler are decoupled. The new CheckpointEngine module provides a robust mechanism for weight synchronization using NCCL/HCCL, which is a significant improvement for distributed training setups. The cookbooks for both TransformersModel and MegatronModel are clear and effectively demonstrate the new architecture.

I've identified two main issues:

A critical bug in the GRPOLoss implementation where the label shifting was removed, which will likely lead to incorrect loss calculation.
A high-severity memory regression in the refactored IPCWeightLoader for hybrid mode, which could cause out-of-memory errors with large models.

Apart from these points, the code quality is high, with good documentation and clean APIs. The bug fix for FSDP in TransformersModel and the robust handling of packed sequences in GRPOLoss are particularly noteworthy improvements. Once the identified issues are addressed, this will be an excellent contribution.

gemini-code-assist · 2026-02-06T17:57:23Z

src/twinkle/loss/grpo.py

        loss_mask = (labels != self.ignore_index).bool()
        masked_labels = labels.clone()
        masked_labels[~loss_mask] = 0
        logps = selective_log_softmax(logits, masked_labels)


The logic for calculating logps seems to have a bug. For autoregressive models, the logits at a given position t are used to predict the token at position t+1. Therefore, to calculate the log probability of the sequence, you need to align logits[:, t, :] with labels[:, t+1]. The previous implementation used torch.roll(labels, shifts=-1, dims=1) to achieve this, but it has been removed.

Without this shift, the loss is calculated using logits[:, t, :] and labels[:, t], which is incorrect and will likely lead to poor training performance. Please reintroduce the label shifting.

Suggested change

loss_mask = (labels != self.ignore_index).bool()

masked_labels = labels.clone()

masked_labels[~loss_mask] = 0

logps = selective_log_softmax(logits, masked_labels)

labels = torch.roll(labels, shifts=-1, dims=1)

loss_mask = (labels != self.ignore_index).bool()

masked_labels = labels.clone()

masked_labels[~loss_mask] = 0

logps = selective_log_softmax(logits, masked_labels)

gemini-code-assist · 2026-02-06T17:57:23Z

src/twinkle/weight_loader/ipc_loader.py

+        weights = {}
+        for name, tensor in self._get_weights_iterator(adapter_name):
+            tensor = Torch.to_local_tensor(tensor)
+            weights[name] = tensor.to(self.dtype, non_blocking=True)


This implementation collects all model weights into a new weights dictionary in memory before passing them to the sampler engine. This can lead to an out-of-memory error for large models, as it effectively doubles the memory required to hold the model weights.

The previous implementation of IPCWeightLoader streamed weights in buckets to avoid this issue. This change appears to be a regression in terms of memory efficiency. Please consider re-implementing a streaming mechanism to avoid materializing the entire state dictionary in memory.

update

c07b80f

gemini-code-assist bot reviewed Feb 6, 2026

View reviewed changes

hjh0119 added 3 commits February 7, 2026 14:34

update

5b7202b

merge dev

ae5dcff

fix merge bug

064d162

hjh0119 merged commit 2c3a20e into dev Feb 7, 2026
0 of 4 checks passed

tastelikefeet deleted the rl-loss-3 branch February 13, 2026 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stand-alone rl#42

stand-alone rl#42
hjh0119 merged 4 commits intodevfrom
rl-loss-3

hjh0119 commented Feb 6, 2026

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Uh oh!

gemini-code-assist bot Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hjh0119 commented Feb 6, 2026

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant