-
Notifications
You must be signed in to change notification settings - Fork 81
Support Lora training - fsdp backend #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support Lora training - fsdp backend #377
Conversation
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
Summary of ChangesHello @GuanxingLu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the system's capability to handle LoRA (Low-Rank Adaptation) by introducing a more efficient tensor-based synchronization method for LoRA weights, moving away from slower disk I/O. It integrates LoRA deeply into the model training and serving pipeline, providing extensive configuration options and improved checkpointing. The changes aim to boost performance and flexibility for users leveraging LoRA in their models. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant feature: LoRA support with an efficient tensor-based weight synchronization mechanism for the SGLang rollout engine. The changes are comprehensive, touching argument parsing, model initialization, checkpointing, weight update logic, and testing. The implementation of two LoRA weight sync methods (file-based and tensor-based) is a great addition. My review focuses on improving the robustness and style of the new code. I've identified a potential issue with hardcoded module names for LoRA targeting and a minor redundancy in the weight update logic. Overall, this is a solid contribution that enhances the framework's capabilities.
| if args.target_modules == "all-linear": | ||
| modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding the list of target_modules for all-linear is not robust and might fail for models with different layer names. This approach is not portable across different model architectures.
A better approach would be to dynamically identify all linear layers from the model instance. This would involve:
- Keeping
args.target_modulesas the string"all-linear"inmiles_validate_args. - Moving the logic to resolve
target_modules(includingall-linear, comma-separated lists, andexclude_modules) intomiles/backends/fsdp_utils/lora_utils.py::apply_lora_to_modelwhere themodelobject is available for inspection.
Here's an example of how you could dynamically find linear layers inside apply_lora_to_model:
import torch.nn as nn
if isinstance(args.target_modules, str) and args.target_modules == "all-linear":
linear_modules = set()
for name, module in model.named_modules():
if isinstance(module, nn.Linear):
# You might want to add more filtering logic here
module_name = name.split('.')[-1]
linear_modules.add(module_name)
# Exclude layers that are typically not targeted
if 'lm_head' in linear_modules:
linear_modules.remove('lm_head')
target_modules = list(linear_modules)
else:
# Handle comma-separated string
if isinstance(args.target_modules, str):
target_modules = [m.strip() for m in args.target_modules.split(',')]
else:
target_modules = args.target_modules
if args.exclude_modules:
# Handle comma-separated string for exclude_modules
if isinstance(args.exclude_modules, str):
exclude_set = {m.strip() for m in args.exclude_modules.split(',')}
else:
exclude_set = set(args.exclude_modules)
target_modules = [m for m in target_modules if m not in exclude_set]
# Update args for later use if needed. This is optional.
# args.target_modules = target_modules a803681 to
65fcc08
Compare
|
run-ci-short |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request implements LoRA (Low-Rank Adaptation) weight synchronization via tensors for the SGLang rollout engine, which provides a faster alternative to the previous disk-based synchronization approach. The PR introduces a new --lora-sync-from-tensor flag to enable this feature.
Key Changes:
- Added comprehensive LoRA support with new command-line arguments (
--lora-rank,--lora-alpha,--target-modules, etc.) - Implemented tensor-based and file-based LoRA weight synchronization methods for the rollout engine
- Enhanced offload/onload logic to support granular control over what gets offloaded (kv_cache, weights) via
--offload-rollout-level
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
miles/backends/fsdp_utils/lora_utils.py |
New utility module providing LoRA model application, disk persistence, and tensor extraction functionality |
miles/backends/fsdp_utils/update_weight_utils.py |
Enhanced weight update logic to support LoRA weights via both tensor and file-based synchronization |
miles/backends/fsdp_utils/checkpoint.py |
Modified checkpoint save/load to handle LoRA-only checkpoints separately from full model checkpoints |
miles/backends/fsdp_utils/actor.py |
Integrated LoRA model wrapping and gradient checkpointing configuration for LoRA models |
miles/backends/sglang_utils/sglang_engine.py |
Added LoRA adapter management methods and SGLang server configuration for LoRA support |
miles/rollout/sglang_rollout.py |
Modified generation payload to include LoRA adapter when enabled |
miles/utils/arguments.py |
Added LoRA-related CLI arguments and validation logic, plus --offload-rollout-level for granular offload control |
train.py |
Updated offload/onload logic to support selective offloading of weights, kv_cache, and CUDA graphs |
miles/ray/rollout.py |
Enhanced rollout manager's offload method to accept tags parameter for selective memory release |
miles/ray/placement_group.py |
Added TODO comment about weight offloading optimization |
tests/test_qwen3_0.6B_fsdp_distributed.py |
Added LoRA test configuration with environment variable control |
tests/test_qwen3_0.6B_fsdp_colocated_2xGPU.py |
Added LoRA test configuration with environment variable control |
tests/test_qwen2.5_0.5B_gsm8k_async.py |
Updated Hugging Face CLI command from huggingface-cli download to hf download |
tests/test_qwen2.5_0.5B_gsm8k.py |
Updated Hugging Face CLI command from huggingface-cli download to hf download |
tests/test_external_rollout.py |
Updated Hugging Face CLI command from huggingface-cli download to hf download |
requirements.txt |
Added peft library dependency for LoRA support |
.github/workflows/pr-test.yml.j2 |
Added LoRA test configurations to CI workflow template |
.github/workflows/pr-test.yml |
Added LoRA test configurations to CI workflow |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
miles/ray/rollout.py
Outdated
| if num_new_engines == 0: | ||
| return num_new_engines, None | ||
| return num_new_engines |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return value inconsistency: This function now returns only num_new_engines (a single integer), but other code locations expect it to return a tuple. At line 70 and 107 of the same file, the return value is assigned to self.num_new_engines (expecting a single value), but at line 335, when debug_train_only is True, it still returns 0, None (a tuple). This inconsistency will cause the function to return different types depending on the condition, which is likely a bug.
|
|
||
| if args.offload_rollout: | ||
| ray.get(rollout_manager.offload.remote()) | ||
| offload_tags = [GPU_MEMORY_TYPE_CUDA_GRAPH] |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential runtime error: GPU_MEMORY_TYPE_CUDA_GRAPH is added to offload_tags unconditionally, but it may be None if the import fails (as seen in lines 5-7 of this file). When GPU_MEMORY_TYPE_CUDA_GRAPH is None, it will be appended to the list and passed to the rollout manager's offload method, which could cause unexpected behavior. Consider checking if it's not None before appending, similar to the check at line 99.
| offload_tags = [GPU_MEMORY_TYPE_CUDA_GRAPH] | |
| offload_tags = [] | |
| if GPU_MEMORY_TYPE_CUDA_GRAPH is not None: | |
| offload_tags.append(GPU_MEMORY_TYPE_CUDA_GRAPH) |
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| LORA_READY_MARKER = ".lora_ready" |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused constant: LORA_READY_MARKER is defined but never used in this file or imported elsewhere in the changed files. Consider removing it if it's not needed.
| LORA_READY_MARKER = ".lora_ready" |
| lora_weights, config_dict = get_lora_weights_and_config(self.model) | ||
| dist.barrier() | ||
|
|
||
| if dist.get_rank() == 0: |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inefficient synchronization pattern: The method calls get_lora_weights_and_config for all ranks, which gathers the full state dict on all ranks, but only rank 0 uses the result. This causes unnecessary memory and compute on non-zero ranks. Consider moving line 134 inside the if dist.get_rank() == 0: block to avoid this overhead.
| lora_weights, config_dict = get_lora_weights_and_config(self.model) | |
| dist.barrier() | |
| if dist.get_rank() == 0: | |
| dist.barrier() | |
| if dist.get_rank() == 0: | |
| lora_weights, config_dict = get_lora_weights_and_config(self.model) |
|
|
||
| dist.barrier() | ||
|
|
||
| save_lora_to_disk(self.model, self._lora_save_dir) |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inefficient synchronization pattern: The method calls save_lora_to_disk for all ranks (line 107), which internally gathers the full state dict on all ranks even though only rank 0 saves it to disk. This causes unnecessary memory and compute on non-zero ranks, similar to the tensor-based approach.
| save_lora_to_disk(self.model, self._lora_save_dir) | |
| if dist.get_rank() == 0: | |
| save_lora_to_disk(self.model, self._lora_save_dir) |
| ray.get(rollout_manager.check_weights.remote(action="reset_tensors")) | ||
|
|
||
| if args.offload_rollout: | ||
| # TODO: Optimization in the future: offload model weights to cpu to make more space for training? |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated TODO comment: The TODO comment suggests "offload model weights to cpu to make more space for training" as a future optimization. However, this PR actually implements weight offloading functionality via the --offload-rollout-level argument which includes a 'weight' option. Consider updating this comment to reflect the current implementation or removing it if the optimization is already in place.
| # TODO: Optimization in the future: offload model weights to cpu to make more space for training? | |
| # Offload model weights/state to CPU to free up GPU memory for training. |
| logger.info(f"Deleted LoRA adapter from {save_path}") | ||
|
|
||
|
|
||
| def get_lora_weights_and_config(module: nn.Module) -> tuple[dict[str, any], dict[str, any]]: |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type hint uses lowercase 'any' instead of 'Any' from typing module. This should be 'Any' (capitalized) which requires importing from the typing module.
| kwargs["dtype"] = "float16" | ||
| if args.lora_rank > 0 or args.lora_adapter_path is not None: | ||
| kwargs["enable_lora"] = True | ||
| kwargs["max_lora_rank"] = args.lora_rank |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential AttributeError: When LoRA is enabled via --lora-adapter-path but --lora-rank is 0, this will set max_lora_rank to 0, which may not be valid for the SGLang engine. The condition should also check if lora_adapter_path is provided and set an appropriate rank value, or the validation logic should ensure lora_rank is always set when using a LoRA adapter.
| kwargs["max_lora_rank"] = args.lora_rank | |
| # Ensure a valid positive LoRA rank is passed to the SGLang engine. | |
| # If LoRA is enabled via adapter path but lora_rank is not set to a | |
| # positive value, default to rank 1 to avoid an invalid configuration. | |
| if getattr(args, "lora_rank", None) and args.lora_rank > 0: | |
| max_lora_rank = args.lora_rank | |
| else: | |
| max_lora_rank = 1 | |
| kwargs["max_lora_rank"] = max_lora_rank |
| bias="none", | ||
| ) | ||
|
|
||
| model = get_peft_model(model, lora_config) # autocast_adapter_dtype=False) |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out code: There is commented-out code # autocast_adapter_dtype=False that should either be removed or uncommented with a clear explanation of why it's needed. Leaving commented code without explanation reduces code clarity.
| model = get_peft_model(model, lora_config) # autocast_adapter_dtype=False) | |
| model = get_peft_model(model, lora_config) |
| def load_lora_adapter(self, lora_name: str, lora_path: str): | ||
| return self._make_request( | ||
| "load_lora_adapter", | ||
| {"lora_name": lora_name, "lora_path": lora_path}, | ||
| ) | ||
|
|
||
| def load_lora_adapter_from_tensors(self, lora_name: str, serialized_tensors: str, config_dict: dict): | ||
| return self._make_request( | ||
| "load_lora_adapter_from_tensors", | ||
| {"lora_name": lora_name, "serialized_tensors": serialized_tensors, "config_dict": config_dict}, | ||
| ) | ||
|
|
||
| def unload_lora_adapter(self, lora_name: str): | ||
| return self._make_request( | ||
| "unload_lora_adapter", | ||
| {"lora_name": lora_name}, | ||
| ) |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing documentation: The newly added methods load_lora_adapter, load_lora_adapter_from_tensors, and unload_lora_adapter lack docstrings. Public API methods should have docstrings explaining their parameters, return values, and purpose, especially for important functionality like LoRA adapter management.
| "--actor-num-nodes 1 " | ||
| "--actor-num-gpus-per-node 2 " | ||
| "--colocate " | ||
| "--offload-rollout-level kv_cache weight " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to distinguish from LoRA and full params for offload-rollout-level?
65fcc08 to
3ef536d
Compare
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
3dcb4ee to
ed7c6e5
Compare
ac33c57 to
3ea51ec
Compare
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
3ea51ec to
3d6484c
Compare
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
2fb20a7 to
3b975df
Compare
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
3b975df to
b56d80d
Compare
…/miles into feature/fsdp_lora_from_tensor
c8e00a7 to
1e3117c
Compare
Co-authored-by: PopSoda2002 [zhouhp.me@gmail.com](mailto:zhouhp.me@gmail.com)
Description
(Update LoRA weights to the SGLang rollout engine via tensor, which is faster than the previous disk sync approach)
Waiting for this sglang PR to be merged: Update LoRA Weights via Tensor sgl-project/sglang#16226
Code Style Compliance
.item(),.cpu(),.tolist()) in inference pathsChanges Made
--lora-sync-from-tensorto turn on this feature.load_lora_adapter_from_tensorsload_lora_adapter_from_tensorsin SGLang code.Testing
To run the code, please clone and checkout to https://github.com/GuanxingLu/sglang/tree/feature/lora-from-tensor first.
Related Issues
#326, #352, sgl-project/sglang#15703