Skip to content

stand-alone rl#42

Merged
hjh0119 merged 4 commits intodevfrom
rl-loss-3
Feb 7, 2026
Merged

stand-alone rl#42
hjh0119 merged 4 commits intodevfrom
rl-loss-3

Conversation

@hjh0119
Copy link
Copy Markdown
Collaborator

@hjh0119 hjh0119 commented Feb 6, 2026

No description provided.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @hjh0119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request overhauls the distributed RL training infrastructure by shifting from a tightly coupled 'hybrid mode' to a more flexible 'standalone mode'. This fundamental change enables greater scalability and resource utilization by allowing training and inference components to run independently. The core of this enhancement is a new CheckpointEngineManager that facilitates efficient weight synchronization across different GPUs or nodes using collective communication primitives like NCCL and HCCL, ensuring that the inference sampler always has the latest model weights from the trainer.

Highlights

  • Standalone Mode for Distributed RL: Introduced a new 'standalone mode' architecture for distributed Reinforcement Learning (RL) training, allowing training models and inference samplers to operate on separate GPUs or nodes. This replaces the previous 'hybrid mode' where they were colocated.
  • New Checkpoint Engine for Weight Synchronization: Implemented a CheckpointEngineManager with NCCL and HCCL backends to efficiently synchronize model weights between distributed training models and inference samplers. This manager orchestrates a multi-step process for robust and scalable weight transfer.
  • Enhanced GRPO Loss Function: The GRPOLoss function has been significantly improved to handle packed sequences and align auxiliary log-probabilities from samplers or reference models, ensuring accurate loss computation in complex training scenarios.
  • MegatronModel Integration: Added support for MegatronModel in standalone GRPO training, including a new cookbook example (megatron_lora.py) and a custom SimpleWeightSync mechanism to manage weight transfer due to specific NCCL compatibility requirements with Megatron's distributed setup.
  • Refactored VLLMSampler and IPCWeightLoader: The VLLMSampler and IPCWeightLoader have been refactored to integrate with the new CheckpointEngineManager. The IPCWeightLoader now delegates its core weight transfer logic to the VLLMEngine's updated update_weights method, which uses ZMQ and CUDA IPC/shared memory for streaming transfers.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • cookbook/grpo/lora.py
    • Updated to reflect 'Standalone Mode' training and replaced IPCWeightLoader with CheckpointEngineManager.
    • Refactored sample processing to directly build InputFeature and handle old_logps and advantages explicitly.
    • Removed HybridModelSamplerActor and related hybrid mode configurations.
  • cookbook/grpo/lora_gpu.py
    • Removed the entire file, indicating a consolidation or deprecation of its functionality.
  • cookbook/grpo/megatron_lora.py
    • Added a new cookbook for GRPO training using MegatronModel in standalone mode.
    • Introduced SimpleWeightSync to manage weight transfer for MegatronModel due to NCCL compatibility considerations.
  • src/twinkle/checkpoint_engine/init.py
    • Added new module for checkpoint engine, registering various backend implementations.
  • src/twinkle/checkpoint_engine/base.py
    • Added base classes for CheckpointEngine and its registry, including a 'naive' colocated implementation.
  • src/twinkle/checkpoint_engine/hccl_checkpoint_engine.py
    • Added HCCL-based checkpoint engine for Ascend NPU, enabling distributed weight sync.
  • src/twinkle/checkpoint_engine/manager.py
    • Added CheckpointEngineManager to orchestrate the multi-step weight synchronization process between models and samplers.
  • src/twinkle/checkpoint_engine/mixin.py
    • Added CheckpointEngineMixin to provide common checkpoint engine lifecycle methods to model and sampler classes.
  • src/twinkle/checkpoint_engine/nccl_checkpoint_engine.py
    • Added NCCL-based checkpoint engine for CUDA GPUs, leveraging ray.util.collective for distributed weight sync.
  • src/twinkle/loss/grpo.py
    • Enhanced GRPOLoss to handle packed sequences and align auxiliary log-probabilities from samplers/reference models.
    • Updated __call__ signature to accept advantages directly.
  • src/twinkle/model/megatron/megatron.py
    • Integrated CheckpointEngineMixin and implemented send_weights_via_checkpoint_engine and get_peft_config_dict for MegatronModel.
    • Adjusted forward_backward and set_processor for compatibility with new loss and processor handling.
  • src/twinkle/model/transformers/transformers.py
    • Integrated CheckpointEngineMixin and implemented send_weights_via_checkpoint_engine and get_peft_config_dict for TransformersModel.
    • Enabled gradient checkpointing and ensured training mode propagation.
  • src/twinkle/preprocessor/init.py
    • Exported CountdownProcessor.
  • src/twinkle/preprocessor/llm.py
    • Added CountdownProcessor class for the Countdown Game task.
  • src/twinkle/processor/grpo.py
    • Minor cleanup (removed TODO comment).
  • src/twinkle/sampler/base.py
    • Made _get_template a remote function.
  • src/twinkle/sampler/vllm_engine.py
    • Updated sample method to support lora_request and removed reload_weights from wake_up.
    • Introduced reset_prefix_cache and a refactored update_weights for streaming IPC/SHM transfer.
    • Added shutdown method for proper engine cleanup.
  • src/twinkle/sampler/vllm_sampler.py
    • Integrated CheckpointEngineMixin and implemented receive_weights_via_checkpoint_engine.
    • Added shutdown method and logic to use checkpoint-synced LoRA during sampling.
    • Added import_weights_dict for compatibility with SimpleWeightSync.
  • src/twinkle/sampler/vllm_worker_extension.py
    • Refactored weight loading logic, including PEFT prefix stripping and vLLM stacked parameter mapping.
    • Added set_death_signal for robust process management.
  • src/twinkle/utils/network.py
    • Added is_valid_ipv6_address and stateless_init_process_group for distributed communication setup.
  • src/twinkle/weight_loader/ipc_loader.py
    • Refactored to delegate weight transfer to VLLMEngine.update_weights, simplifying its role.
  • tests/sampler/test_megatron_weight_sync.py
    • Added new test for MegatronModel standalone weight synchronization.
  • tests/sampler/test_weight_sync.py
    • Updated to test standalone weight synchronization with TransformersModel and CheckpointEngineManager.
Activity
  • The pull request introduces a new, more robust architecture for weight synchronization in distributed RL training, moving from a 'hybrid' (colocated) to a 'standalone' (disaggregated) mode.
  • This involves adding new core components like CheckpointEngineManager and specific backend implementations (NCCLCheckpointEngine, HCCLCheckpointEngine).
  • Existing models (TransformersModel, MegatronModel) and samplers (VLLMSampler) have been updated to integrate with this new system via a CheckpointEngineMixin.
  • The GRPOLoss has been made more flexible to handle various log-probability inputs and packed sequences.
  • A new example for MegatronModel training and a corresponding test have been added, while an older lora_gpu.py script was removed.
  • The IPCWeightLoader has been simplified, delegating its core functionality to the new VLLMEngine.update_weights method.
  • The changes aim to improve scalability and flexibility for distributed RL training setups.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a major and well-executed refactoring to support standalone reinforcement learning, where the training model and inference sampler are decoupled. The new CheckpointEngine module provides a robust mechanism for weight synchronization using NCCL/HCCL, which is a significant improvement for distributed training setups. The cookbooks for both TransformersModel and MegatronModel are clear and effectively demonstrate the new architecture.

I've identified two main issues:

  1. A critical bug in the GRPOLoss implementation where the label shifting was removed, which will likely lead to incorrect loss calculation.
  2. A high-severity memory regression in the refactored IPCWeightLoader for hybrid mode, which could cause out-of-memory errors with large models.

Apart from these points, the code quality is high, with good documentation and clean APIs. The bug fix for FSDP in TransformersModel and the robust handling of packed sequences in GRPOLoss are particularly noteworthy improvements. Once the identified issues are addressed, this will be an excellent contribution.

Comment on lines 345 to 348
loss_mask = (labels != self.ignore_index).bool()
masked_labels = labels.clone()
masked_labels[~loss_mask] = 0
logps = selective_log_softmax(logits, masked_labels)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for calculating logps seems to have a bug. For autoregressive models, the logits at a given position t are used to predict the token at position t+1. Therefore, to calculate the log probability of the sequence, you need to align logits[:, t, :] with labels[:, t+1]. The previous implementation used torch.roll(labels, shifts=-1, dims=1) to achieve this, but it has been removed.

Without this shift, the loss is calculated using logits[:, t, :] and labels[:, t], which is incorrect and will likely lead to poor training performance. Please reintroduce the label shifting.

Suggested change
loss_mask = (labels != self.ignore_index).bool()
masked_labels = labels.clone()
masked_labels[~loss_mask] = 0
logps = selective_log_softmax(logits, masked_labels)
labels = torch.roll(labels, shifts=-1, dims=1)
loss_mask = (labels != self.ignore_index).bool()
masked_labels = labels.clone()
masked_labels[~loss_mask] = 0
logps = selective_log_softmax(logits, masked_labels)

Comment on lines +73 to +76
weights = {}
for name, tensor in self._get_weights_iterator(adapter_name):
tensor = Torch.to_local_tensor(tensor)
weights[name] = tensor.to(self.dtype, non_blocking=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This implementation collects all model weights into a new weights dictionary in memory before passing them to the sampler engine. This can lead to an out-of-memory error for large models, as it effectively doubles the memory required to hold the model weights.

The previous implementation of IPCWeightLoader streamed weights in buckets to avoid this issue. This change appears to be a regression in terms of memory efficiency. Please consider re-implementing a streaming mechanism to avoid materializing the entire state dictionary in memory.

@hjh0119 hjh0119 merged commit 2c3a20e into dev Feb 7, 2026
0 of 4 checks passed
@tastelikefeet tastelikefeet deleted the rl-loss-3 branch February 13, 2026 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant