Skip to content

Conversation

@GuanxingLu
Copy link
Contributor

@GuanxingLu GuanxingLu commented Dec 31, 2025

Description

  1. Lora - fsdp backend
  2. disk sync weight
  3. Update LoRA weights via tensor
    (Update LoRA weights to the SGLang rollout engine via tensor, which is faster than the previous disk sync approach)
    Waiting for this sglang PR to be merged: Update LoRA Weights via Tensor sgl-project/sglang#16226

Code Style Compliance

  • Performance: Minimized synchronization calls (.item(), .cpu(), .tolist()) in inference paths
  • Architecture: No duplicate code > 5 lines; files < 2,000 lines
  • Function Purity: Avoided in-place modification of input arguments (unless explicitly documented for memory optimization)
  • Pythonic: Lean constructors, minimal dynamic attributes, proper type hints on public APIs
  • Testing: Provided a test script that reviewers can copy & paste to run immediately

Changes Made

  1. Add a new argument named --lora-sync-from-tensor to turn on this feature.
  2. Miles call load_lora_adapter_from_tensors
  3. Implement load_lora_adapter_from_tensors in SGLang code.

Testing

To run the code, please clone and checkout to https://github.com/GuanxingLu/sglang/tree/feature/lora-from-tensor first.

WANDB_API_KEY=<your_key> ENABLE_LORA=1 python tests/ci/gpu_lock_exec.py --count 2 -- python tests/test_qwen3_0.6B_fsdp_colocated_2xGPU.py

WANDB_API_KEY=<your_key> ENABLE_LORA=1 python tests/ci/gpu_lock_exec.py --count 2 -- python tests/test_qwen3_0.6B_fsdp_colocated_2xGPU.py

Related Issues

#326, #352, sgl-project/sglang#15703

GuanxingLu and others added 2 commits December 25, 2025 10:47
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @GuanxingLu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's capability to handle LoRA (Low-Rank Adaptation) by introducing a more efficient tensor-based synchronization method for LoRA weights, moving away from slower disk I/O. It integrates LoRA deeply into the model training and serving pipeline, providing extensive configuration options and improved checkpointing. The changes aim to boost performance and flexibility for users leveraging LoRA in their models.

Highlights

  • LoRA Weight Synchronization via Tensors: Implemented a new, faster method for synchronizing LoRA weights to the SGLang rollout engine by directly passing tensors, replacing the previous disk-based synchronization approach. This is enabled by a new --lora-sync-from-tensor argument.
  • LoRA Integration and Configuration: Introduced comprehensive LoRA support, including new command-line arguments for configuring LoRA rank, alpha, target modules, excluded modules, and pre-trained adapter paths. The system now dynamically applies LoRA to models and handles LoRA-specific gradient checkpointing and optimizer parameter filtering.
  • Enhanced Checkpointing for LoRA: Modified the checkpointing mechanism to intelligently save and load LoRA adapters separately from the base model weights, utilizing a dedicated 'adapter' directory. This allows for more efficient management of LoRA-specific checkpoints.
  • SGLang Engine Updates: The SGLang engine has been updated to support loading and unloading LoRA adapters directly from serialized tensors, as well as from disk paths, providing flexibility in how LoRA weights are managed during runtime.
  • Granular Offloading Control: Added a new --offload-rollout-level argument, allowing users to specify whether to offload KV cache, model weights, or both during rollout, providing more fine-grained memory management.
  • Dependency Update: The peft library has been added to requirements.txt, indicating its new role as a core dependency for LoRA functionality.
Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/pr-test.yml
    • .github/workflows/pr-test.yml.j2
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@GuanxingLu GuanxingLu closed this Dec 31, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant feature: LoRA support with an efficient tensor-based weight synchronization mechanism for the SGLang rollout engine. The changes are comprehensive, touching argument parsing, model initialization, checkpointing, weight update logic, and testing. The implementation of two LoRA weight sync methods (file-based and tensor-based) is a great addition. My review focuses on improving the robustness and style of the new code. I've identified a potential issue with hardcoded module names for LoRA targeting and a minor redundancy in the weight update logic. Overall, this is a solid contribution that enhances the framework's capabilities.

Comment on lines +1481 to +1593
if args.target_modules == "all-linear":
modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding the list of target_modules for all-linear is not robust and might fail for models with different layer names. This approach is not portable across different model architectures.

A better approach would be to dynamically identify all linear layers from the model instance. This would involve:

  1. Keeping args.target_modules as the string "all-linear" in miles_validate_args.
  2. Moving the logic to resolve target_modules (including all-linear, comma-separated lists, and exclude_modules) into miles/backends/fsdp_utils/lora_utils.py::apply_lora_to_model where the model object is available for inspection.

Here's an example of how you could dynamically find linear layers inside apply_lora_to_model:

import torch.nn as nn

if isinstance(args.target_modules, str) and args.target_modules == "all-linear":
    linear_modules = set()
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear):
            # You might want to add more filtering logic here
            module_name = name.split('.')[-1]
            linear_modules.add(module_name)
    
    # Exclude layers that are typically not targeted
    if 'lm_head' in linear_modules:
        linear_modules.remove('lm_head')

    target_modules = list(linear_modules)
else:
    # Handle comma-separated string
    if isinstance(args.target_modules, str):
        target_modules = [m.strip() for m in args.target_modules.split(',')]
    else:
        target_modules = args.target_modules

if args.exclude_modules:
    # Handle comma-separated string for exclude_modules
    if isinstance(args.exclude_modules, str):
        exclude_set = {m.strip() for m in args.exclude_modules.split(',')}
    else:
        exclude_set = set(args.exclude_modules)
    target_modules = [m for m in target_modules if m not in exclude_set]

# Update args for later use if needed. This is optional.
# args.target_modules = target_modules 

@GuanxingLu GuanxingLu reopened this Dec 31, 2025
@GuanxingLu GuanxingLu marked this pull request as draft December 31, 2025 12:21
@GuanxingLu GuanxingLu force-pushed the feature/fsdp_lora_from_tensor branch from a803681 to 65fcc08 Compare December 31, 2025 12:38
@GuanxingLu GuanxingLu changed the title Update LoRA Weights to Rollout Engine via Tensor [WIP] Update LoRA Weights to Rollout Engine via Tensor Dec 31, 2025
@yushengsu-thu yushengsu-thu marked this pull request as ready for review January 1, 2026 23:24
Copilot AI review requested due to automatic review settings January 1, 2026 23:24
@yushengsu-thu
Copy link
Collaborator

yushengsu-thu commented Jan 1, 2026

run-ci-short

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements LoRA (Low-Rank Adaptation) weight synchronization via tensors for the SGLang rollout engine, which provides a faster alternative to the previous disk-based synchronization approach. The PR introduces a new --lora-sync-from-tensor flag to enable this feature.

Key Changes:

  • Added comprehensive LoRA support with new command-line arguments (--lora-rank, --lora-alpha, --target-modules, etc.)
  • Implemented tensor-based and file-based LoRA weight synchronization methods for the rollout engine
  • Enhanced offload/onload logic to support granular control over what gets offloaded (kv_cache, weights) via --offload-rollout-level

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
miles/backends/fsdp_utils/lora_utils.py New utility module providing LoRA model application, disk persistence, and tensor extraction functionality
miles/backends/fsdp_utils/update_weight_utils.py Enhanced weight update logic to support LoRA weights via both tensor and file-based synchronization
miles/backends/fsdp_utils/checkpoint.py Modified checkpoint save/load to handle LoRA-only checkpoints separately from full model checkpoints
miles/backends/fsdp_utils/actor.py Integrated LoRA model wrapping and gradient checkpointing configuration for LoRA models
miles/backends/sglang_utils/sglang_engine.py Added LoRA adapter management methods and SGLang server configuration for LoRA support
miles/rollout/sglang_rollout.py Modified generation payload to include LoRA adapter when enabled
miles/utils/arguments.py Added LoRA-related CLI arguments and validation logic, plus --offload-rollout-level for granular offload control
train.py Updated offload/onload logic to support selective offloading of weights, kv_cache, and CUDA graphs
miles/ray/rollout.py Enhanced rollout manager's offload method to accept tags parameter for selective memory release
miles/ray/placement_group.py Added TODO comment about weight offloading optimization
tests/test_qwen3_0.6B_fsdp_distributed.py Added LoRA test configuration with environment variable control
tests/test_qwen3_0.6B_fsdp_colocated_2xGPU.py Added LoRA test configuration with environment variable control
tests/test_qwen2.5_0.5B_gsm8k_async.py Updated Hugging Face CLI command from huggingface-cli download to hf download
tests/test_qwen2.5_0.5B_gsm8k.py Updated Hugging Face CLI command from huggingface-cli download to hf download
tests/test_external_rollout.py Updated Hugging Face CLI command from huggingface-cli download to hf download
requirements.txt Added peft library dependency for LoRA support
.github/workflows/pr-test.yml.j2 Added LoRA test configurations to CI workflow template
.github/workflows/pr-test.yml Added LoRA test configurations to CI workflow

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 414 to 521
if num_new_engines == 0:
return num_new_engines, None
return num_new_engines
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return value inconsistency: This function now returns only num_new_engines (a single integer), but other code locations expect it to return a tuple. At line 70 and 107 of the same file, the return value is assigned to self.num_new_engines (expecting a single value), but at line 335, when debug_train_only is True, it still returns 0, None (a tuple). This inconsistency will cause the function to return different types depending on the condition, which is likely a bug.

Copilot uses AI. Check for mistakes.

if args.offload_rollout:
ray.get(rollout_manager.offload.remote())
offload_tags = [GPU_MEMORY_TYPE_CUDA_GRAPH]
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential runtime error: GPU_MEMORY_TYPE_CUDA_GRAPH is added to offload_tags unconditionally, but it may be None if the import fails (as seen in lines 5-7 of this file). When GPU_MEMORY_TYPE_CUDA_GRAPH is None, it will be appended to the list and passed to the rollout manager's offload method, which could cause unexpected behavior. Consider checking if it's not None before appending, similar to the check at line 99.

Suggested change
offload_tags = [GPU_MEMORY_TYPE_CUDA_GRAPH]
offload_tags = []
if GPU_MEMORY_TYPE_CUDA_GRAPH is not None:
offload_tags.append(GPU_MEMORY_TYPE_CUDA_GRAPH)

Copilot uses AI. Check for mistakes.

logger = logging.getLogger(__name__)

LORA_READY_MARKER = ".lora_ready"
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused constant: LORA_READY_MARKER is defined but never used in this file or imported elsewhere in the changed files. Consider removing it if it's not needed.

Suggested change
LORA_READY_MARKER = ".lora_ready"

Copilot uses AI. Check for mistakes.
Comment on lines 134 to 136
lora_weights, config_dict = get_lora_weights_and_config(self.model)
dist.barrier()

if dist.get_rank() == 0:
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient synchronization pattern: The method calls get_lora_weights_and_config for all ranks, which gathers the full state dict on all ranks, but only rank 0 uses the result. This causes unnecessary memory and compute on non-zero ranks. Consider moving line 134 inside the if dist.get_rank() == 0: block to avoid this overhead.

Suggested change
lora_weights, config_dict = get_lora_weights_and_config(self.model)
dist.barrier()
if dist.get_rank() == 0:
dist.barrier()
if dist.get_rank() == 0:
lora_weights, config_dict = get_lora_weights_and_config(self.model)

Copilot uses AI. Check for mistakes.

dist.barrier()

save_lora_to_disk(self.model, self._lora_save_dir)
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient synchronization pattern: The method calls save_lora_to_disk for all ranks (line 107), which internally gathers the full state dict on all ranks even though only rank 0 saves it to disk. This causes unnecessary memory and compute on non-zero ranks, similar to the tensor-based approach.

Suggested change
save_lora_to_disk(self.model, self._lora_save_dir)
if dist.get_rank() == 0:
save_lora_to_disk(self.model, self._lora_save_dir)

Copilot uses AI. Check for mistakes.
ray.get(rollout_manager.check_weights.remote(action="reset_tensors"))

if args.offload_rollout:
# TODO: Optimization in the future: offload model weights to cpu to make more space for training?
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated TODO comment: The TODO comment suggests "offload model weights to cpu to make more space for training" as a future optimization. However, this PR actually implements weight offloading functionality via the --offload-rollout-level argument which includes a 'weight' option. Consider updating this comment to reflect the current implementation or removing it if the optimization is already in place.

Suggested change
# TODO: Optimization in the future: offload model weights to cpu to make more space for training?
# Offload model weights/state to CPU to free up GPU memory for training.

Copilot uses AI. Check for mistakes.
logger.info(f"Deleted LoRA adapter from {save_path}")


def get_lora_weights_and_config(module: nn.Module) -> tuple[dict[str, any], dict[str, any]]:
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hint uses lowercase 'any' instead of 'Any' from typing module. This should be 'Any' (capitalized) which requires importing from the typing module.

Copilot uses AI. Check for mistakes.
kwargs["dtype"] = "float16"
if args.lora_rank > 0 or args.lora_adapter_path is not None:
kwargs["enable_lora"] = True
kwargs["max_lora_rank"] = args.lora_rank
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential AttributeError: When LoRA is enabled via --lora-adapter-path but --lora-rank is 0, this will set max_lora_rank to 0, which may not be valid for the SGLang engine. The condition should also check if lora_adapter_path is provided and set an appropriate rank value, or the validation logic should ensure lora_rank is always set when using a LoRA adapter.

Suggested change
kwargs["max_lora_rank"] = args.lora_rank
# Ensure a valid positive LoRA rank is passed to the SGLang engine.
# If LoRA is enabled via adapter path but lora_rank is not set to a
# positive value, default to rank 1 to avoid an invalid configuration.
if getattr(args, "lora_rank", None) and args.lora_rank > 0:
max_lora_rank = args.lora_rank
else:
max_lora_rank = 1
kwargs["max_lora_rank"] = max_lora_rank

Copilot uses AI. Check for mistakes.
bias="none",
)

model = get_peft_model(model, lora_config) # autocast_adapter_dtype=False)
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out code: There is commented-out code # autocast_adapter_dtype=False that should either be removed or uncommented with a clear explanation of why it's needed. Leaving commented code without explanation reduces code clarity.

Suggested change
model = get_peft_model(model, lora_config) # autocast_adapter_dtype=False)
model = get_peft_model(model, lora_config)

Copilot uses AI. Check for mistakes.
Comment on lines +345 to +416
def load_lora_adapter(self, lora_name: str, lora_path: str):
return self._make_request(
"load_lora_adapter",
{"lora_name": lora_name, "lora_path": lora_path},
)

def load_lora_adapter_from_tensors(self, lora_name: str, serialized_tensors: str, config_dict: dict):
return self._make_request(
"load_lora_adapter_from_tensors",
{"lora_name": lora_name, "serialized_tensors": serialized_tensors, "config_dict": config_dict},
)

def unload_lora_adapter(self, lora_name: str):
return self._make_request(
"unload_lora_adapter",
{"lora_name": lora_name},
)
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation: The newly added methods load_lora_adapter, load_lora_adapter_from_tensors, and unload_lora_adapter lack docstrings. Public API methods should have docstrings explaining their parameters, return values, and purpose, especially for important functionality like LoRA adapter management.

Copilot uses AI. Check for mistakes.
@yushengsu-thu yushengsu-thu self-assigned this Jan 2, 2026
@yushengsu-thu yushengsu-thu changed the title [WIP] Update LoRA Weights to Rollout Engine via Tensor [WIP] Support Lora training - fsdp backend Jan 5, 2026
"--actor-num-nodes 1 "
"--actor-num-gpus-per-node 2 "
"--colocate "
"--offload-rollout-level kv_cache weight "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to distinguish from LoRA and full params for offload-rollout-level?

@GuanxingLu GuanxingLu force-pushed the feature/fsdp_lora_from_tensor branch from 65fcc08 to 3ef536d Compare January 10, 2026 15:45
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
@GuanxingLu GuanxingLu force-pushed the feature/fsdp_lora_from_tensor branch 2 times, most recently from 3dcb4ee to ed7c6e5 Compare January 10, 2026 16:14
@GuanxingLu GuanxingLu force-pushed the feature/fsdp_lora_from_tensor branch 2 times, most recently from ac33c57 to 3ea51ec Compare January 10, 2026 17:39
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
@GuanxingLu GuanxingLu force-pushed the feature/fsdp_lora_from_tensor branch from 3ea51ec to 3d6484c Compare January 11, 2026 12:42
GuanxingLu and others added 2 commits January 12, 2026 06:44
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
@GuanxingLu GuanxingLu force-pushed the feature/fsdp_lora_from_tensor branch from 2fb20a7 to 3b975df Compare January 12, 2026 18:22
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
@GuanxingLu GuanxingLu force-pushed the feature/fsdp_lora_from_tensor branch from 3b975df to b56d80d Compare January 13, 2026 01:29
@yushengsu-thu yushengsu-thu changed the title [WIP] Support Lora training - fsdp backend Support Lora training - fsdp backend Jan 20, 2026
@yushengsu-thu yushengsu-thu force-pushed the feature/fsdp_lora_from_tensor branch from c8e00a7 to 1e3117c Compare January 20, 2026 00:41
Co-authored-by: PopSoda2002 [zhouhp.me@gmail.com](mailto:zhouhp.me@gmail.com)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants