Conversation
Summary of ChangesHello @hjh0119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the framework's capabilities by integrating Megatron-Core, a powerful library for large-scale distributed model training. The core purpose is to allow users to leverage advanced parallelism techniques (TP, PP, CP, EP, DP) for training and fine-tuning large language models, particularly with LoRA. This integration provides a robust foundation for scaling model development and deployment, offering both local and Ray-based execution options. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This is a substantial and well-executed pull request that introduces Megatron-Core support into the twinkle framework. The changes demonstrate a deep understanding of distributed training concepts and the intricacies of integrating with complex libraries like Megatron-Core and PEFT. The new MegatronModel wrapper, the MegatronStrategy, and the weight conversion TwinkleBridge are all well-designed components. The addition of multi-tenancy support is particularly impressive. I've pointed out a critical merge conflict that needs to be resolved and a few areas for improvement, mainly around production readiness of examples and more specific exception handling to improve robustness. Overall, this is a high-quality contribution.
cookbook/sft/streaming_dataset.py
Outdated
| <<<<<<< HEAD | ||
| #device_mesh = DeviceMesh( | ||
| # device_type='cuda', | ||
| # mesh=np.array([0,1,2,3]), | ||
| # mesh_dim_names=('dp',) | ||
| #) | ||
|
|
||
| twinkle.initialize(mode='ray', nproc_per_node=4, groups=device_group, global_device_mesh=device_mesh, lazy_collect=False) | ||
| ======= | ||
| twinkle.initialize(mode='ray', groups=device_group, global_device_mesh=device_mesh) | ||
| >>>>>>> origin/dev |
cookbook/megatron/lora.py
Outdated
| except Exception: | ||
| pass |
There was a problem hiding this comment.
The except Exception: pass is too broad and will silently ignore any and all errors during the cleanup of distributed resources. This can make debugging difficult if an unexpected error occurs. It's better to catch more specific exceptions (e.g., AttributeError, RuntimeError) or at least log the exception that was caught.
| except Exception: | |
| pass | |
| except Exception as e: | |
| logger.warning(f'Error during cleanup, ignoring: {e}') |
cookbook/megatron/server.py
Outdated
| return jsonify(result) | ||
|
|
||
| logger.info(f'Starting server on {args.host}:{args.port}') | ||
| app.run(host=args.host, port=args.port, threaded=False) |
There was a problem hiding this comment.
The server is started using app.run(), which is Flask's built-in development server. This server is not suitable for production environments as it's not designed to be efficient, stable, or secure. For deployment, you should use a production-grade WSGI server like Gunicorn or uWSGI.
For example:
gunicorn --workers 4 --bind 0.0.0.0:8000 cookbook.megatron.server:app
src/twinkle/megatron/model/bridge.py
Outdated
| self.ep_group = mpu.get_expert_model_parallel_group() | ||
| self.etp_rank = mpu.get_expert_tensor_parallel_rank() | ||
| self.etp_group = mpu.get_expert_tensor_parallel_group() | ||
| except: |
There was a problem hiding this comment.
The bare except: clause is dangerous as it catches all exceptions, including system-exiting ones like SystemExit and KeyboardInterrupt. This can hide critical errors and make the program difficult to terminate. Please specify the exceptions you intend to catch, for example except (AttributeError, ImportError):.
| except: | |
| except (AttributeError, ImportError): |
src/twinkle/megatron/model/bridge.py
Outdated
| moe_grouped_gemm=False, | ||
| qk_layernorm=mg_config_dict.get('qk_layernorm', False), | ||
| ) | ||
| except Exception: |
There was a problem hiding this comment.
Using except Exception: is too broad. It can catch unexpected errors and make debugging harder. It's better to catch a more specific exception that you expect might be raised by get_gpt_layer_with_transformer_engine_spec, for instance ImportError if transformer_engine is not available. If you must catch a broad exception, consider logging it to aid in debugging.
| except Exception: | |
| except (ImportError, AttributeError) as e: |
Multi-tenant LoRA training code has not been fully tested. Files removed from git tracking (kept locally): - src/twinkle/megatron/distributed/ - src/twinkle/megatron/model/multi_tenant_megatron.py - cookbook/megatron/megatron_multi_tenant/ - tests/megatron/test_multi_tenant_*.py
The modified forward_backward() was causing hangs and GEMM errors. Reverted to the clean version that works correctly with TP/PP.
Key fixes: 1. Simplified input processing to match working version (70ff0ba) - Process inputs once at the beginning, not per microbatch - Properly handle labels by storing separately before deletion 2. Fixed sequence_parallel padding for MoE models - Detect actual sequence_parallel setting from model.config - Bridge auto-enables sequence_parallel for MoE with TP > 1 - Pad sequence length to be divisible by TP size 3. Reverted loss_func to return 2 values (compatible with Megatron scheduler) - Old format: (loss, {'loss': loss}) - Was incorrectly returning 3 values causing compatibility issues Tested: - Dense model (Qwen2.5-7B) with TP=2, PP=2: Step 0 loss 1.168556 - MoE model (Qwen3-30B-A3B) with TP=2, EP=2: Step 0 loss 1.474237
For DP > 1, added a barrier at the end of forward_backward to ensure all DP replicas complete the same training step before moving to the next batch. This prevents P2P communication deadlocks in subsequent training iterations.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces significant new functionality by adding initial support for Megatron-Core, including training examples for LoRA and MoE, a bridge for weight conversion, and necessary integrations into the twinkle framework. The changes are extensive and well-structured. My review focuses on improving robustness, maintainability, and fixing a few critical bugs. Key areas for improvement include replacing broad exception handling with more specific catches to prevent silent failures, fixing a critical bug in the Ray integration logic, and addressing potential maintainability issues related to internal library usage and hardcoded paths in test scripts.
| return # No gradients to sync | ||
|
|
||
| # Coalesced all-reduce for efficiency | ||
| from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors |
There was a problem hiding this comment.
Using internal functions from torch._utils like _flatten_dense_tensors and _unflatten_dense_tensors is risky as they are not part of the public API and can be changed or removed without notice in future PyTorch versions. This could break your code unexpectedly. Consider finding an alternative from the public API or vendoring the function if it's critical for performance.
| env["MEGATRON_LM_PATH"] = "/mnt/nas2/hujinghan.hjh/Megatron-LM" | ||
| env["PYTHONPATH"] = "/mnt/nas2/hujinghan.hjh/Megatron-LM:/mnt/nas2/hujinghan.hjh/twinkle/src:" + env.get("PYTHONPATH", "") |
There was a problem hiding this comment.
|
|
||
| def cleanup(): | ||
| """Kill any lingering processes.""" | ||
| os.system("pkill -9 -f 'lora.py|MegatronModel|ray' 2>/dev/null") |
There was a problem hiding this comment.
The pkill command pkill -9 -f 'lora.py|MegatronModel|ray' is quite broad and could unintentionally terminate other unrelated processes that match these patterns. To make it safer, consider making the pattern more specific, for example by including the full path to the script, or by managing process IDs more directly if possible.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces extensive initial support for Megatron-Core, enabling advanced model parallelism capabilities within the twinkle framework. The changes include a sophisticated bridge for converting HuggingFace model weights to the Megatron format, new training scripts for LoRA and MoE fine-tuning, and a high-level MegatronModel wrapper that abstracts away much of the complexity of distributed training. The implementation is impressive, demonstrating a deep understanding of both Megatron-Core and PEFT. My review identifies a couple of areas for improvement, mainly related to error handling in an example script and ensuring correctness in the MoE weight loading logic under specific parallelism configurations.
Initial Megatron support