Skip to content

[feat] Add initial Megatron support [1/N]#2

Merged
hjh0119 merged 24 commits intodevfrom
megatron
Jan 19, 2026
Merged

[feat] Add initial Megatron support [1/N]#2
hjh0119 merged 24 commits intodevfrom
megatron

Conversation

@hjh0119
Copy link
Copy Markdown
Collaborator

@hjh0119 hjh0119 commented Jan 13, 2026

Initial Megatron support

  • Implemented a Megatron bridge to convert hf2mcore formats.
  • Verified 5D parallelism (DP/TP/PP/CP/EP) in local mode with loss alignment.
  • Tested on Qwen2.5-dense models.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @hjh0119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the framework's capabilities by integrating Megatron-Core, a powerful library for large-scale distributed model training. The core purpose is to allow users to leverage advanced parallelism techniques (TP, PP, CP, EP, DP) for training and fine-tuning large language models, particularly with LoRA. This integration provides a robust foundation for scaling model development and deployment, offering both local and Ray-based execution options.

Highlights

  • Megatron-Core Integration: Introduced initial support for Megatron-Core, enabling distributed training with Tensor, Pipeline, Context, Expert, and Data Parallelism.
  • HuggingFace to Megatron Bridge: Implemented a bridge (TwinkleGPTBridge) to seamlessly convert HuggingFace model weights to Megatron-Core format, supporting Qwen2.5/Qwen3 models.
  • LoRA Training Client and Server: Added a client-server architecture (cookbook/megatron/client.py and cookbook/megatron/server.py) for remote LoRA training with Megatron models.
  • PEFT Compatibility: Integrated PEFT (Parameter-Efficient Fine-Tuning) with Megatron models, including a custom LoraParallelLinear layer and a deepcopy patch to handle Megatron's tp_group.
  • Distributed Training Enhancements: Introduced MegatronStrategy for managing Megatron's parallel state and TenantProcessGroupManager for multi-tenant process group isolation, allowing flexible distributed configurations.
  • Vocab-Parallel Cross Entropy Loss: Added MegatronCrossEntropyLoss to correctly compute cross entropy when the vocabulary is sharded across Tensor Parallel ranks.
  • Ray Mode Execution: Enabled basic execution in Ray mode, with specific handling for synchronous operations required by Megatron's collective communications.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a substantial and well-executed pull request that introduces Megatron-Core support into the twinkle framework. The changes demonstrate a deep understanding of distributed training concepts and the intricacies of integrating with complex libraries like Megatron-Core and PEFT. The new MegatronModel wrapper, the MegatronStrategy, and the weight conversion TwinkleBridge are all well-designed components. The addition of multi-tenancy support is particularly impressive. I've pointed out a critical merge conflict that needs to be resolved and a few areas for improvement, mainly around production readiness of examples and more specific exception handling to improve robustness. Overall, this is a high-quality contribution.

Comment on lines +32 to +42
<<<<<<< HEAD
#device_mesh = DeviceMesh(
# device_type='cuda',
# mesh=np.array([0,1,2,3]),
# mesh_dim_names=('dp',)
#)

twinkle.initialize(mode='ray', nproc_per_node=4, groups=device_group, global_device_mesh=device_mesh, lazy_collect=False)
=======
twinkle.initialize(mode='ray', groups=device_group, global_device_mesh=device_mesh)
>>>>>>> origin/dev
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This file contains unresolved merge conflicts. Please resolve the conflicts between HEAD and origin/dev to ensure the code is in a runnable state.

Comment on lines +169 to +170
except Exception:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The except Exception: pass is too broad and will silently ignore any and all errors during the cleanup of distributed resources. This can make debugging difficult if an unexpected error occurs. It's better to catch more specific exceptions (e.g., AttributeError, RuntimeError) or at least log the exception that was caught.

Suggested change
except Exception:
pass
except Exception as e:
logger.warning(f'Error during cleanup, ignoring: {e}')

return jsonify(result)

logger.info(f'Starting server on {args.host}:{args.port}')
app.run(host=args.host, port=args.port, threaded=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The server is started using app.run(), which is Flask's built-in development server. This server is not suitable for production environments as it's not designed to be efficient, stable, or secure. For deployment, you should use a production-grade WSGI server like Gunicorn or uWSGI.

For example:

gunicorn --workers 4 --bind 0.0.0.0:8000 cookbook.megatron.server:app

self.ep_group = mpu.get_expert_model_parallel_group()
self.etp_rank = mpu.get_expert_tensor_parallel_rank()
self.etp_group = mpu.get_expert_tensor_parallel_group()
except:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The bare except: clause is dangerous as it catches all exceptions, including system-exiting ones like SystemExit and KeyboardInterrupt. This can hide critical errors and make the program difficult to terminate. Please specify the exceptions you intend to catch, for example except (AttributeError, ImportError):.

Suggested change
except:
except (AttributeError, ImportError):

moe_grouped_gemm=False,
qk_layernorm=mg_config_dict.get('qk_layernorm', False),
)
except Exception:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using except Exception: is too broad. It can catch unexpected errors and make debugging harder. It's better to catch a more specific exception that you expect might be raised by get_gpt_layer_with_transformer_engine_spec, for instance ImportError if transformer_engine is not available. If you must catch a broad exception, consider logging it to aid in debugging.

Suggested change
except Exception:
except (ImportError, AttributeError) as e:

Multi-tenant LoRA training code has not been fully tested.
Files removed from git tracking (kept locally):
- src/twinkle/megatron/distributed/
- src/twinkle/megatron/model/multi_tenant_megatron.py
- cookbook/megatron/megatron_multi_tenant/
- tests/megatron/test_multi_tenant_*.py
The modified forward_backward() was causing hangs and GEMM errors.
Reverted to the clean version that works correctly with TP/PP.
Key fixes:
1. Simplified input processing to match working version (70ff0ba)
   - Process inputs once at the beginning, not per microbatch
   - Properly handle labels by storing separately before deletion

2. Fixed sequence_parallel padding for MoE models
   - Detect actual sequence_parallel setting from model.config
   - Bridge auto-enables sequence_parallel for MoE with TP > 1
   - Pad sequence length to be divisible by TP size

3. Reverted loss_func to return 2 values (compatible with Megatron scheduler)
   - Old format: (loss, {'loss': loss})
   - Was incorrectly returning 3 values causing compatibility issues

Tested:
- Dense model (Qwen2.5-7B) with TP=2, PP=2: Step 0 loss 1.168556
- MoE model (Qwen3-30B-A3B) with TP=2, EP=2: Step 0 loss 1.474237
For DP > 1, added a barrier at the end of forward_backward to ensure
all DP replicas complete the same training step before moving to the
next batch. This prevents P2P communication deadlocks in subsequent
training iterations.
@hjh0119
Copy link
Copy Markdown
Collaborator Author

hjh0119 commented Jan 19, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality by adding initial support for Megatron-Core, including training examples for LoRA and MoE, a bridge for weight conversion, and necessary integrations into the twinkle framework. The changes are extensive and well-structured. My review focuses on improving robustness, maintainability, and fixing a few critical bugs. Key areas for improvement include replacing broad exception handling with more specific catches to prevent silent failures, fixing a critical bug in the Ray integration logic, and addressing potential maintainability issues related to internal library usage and hardcoded paths in test scripts.

return # No gradients to sync

# Coalesced all-reduce for efficiency
from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using internal functions from torch._utils like _flatten_dense_tensors and _unflatten_dense_tensors is risky as they are not part of the public API and can be changed or removed without notice in future PyTorch versions. This could break your code unexpectedly. Consider finding an alternative from the public API or vendoring the function if it's critical for performance.

Comment on lines +27 to +28
env["MEGATRON_LM_PATH"] = "/mnt/nas2/hujinghan.hjh/Megatron-LM"
env["PYTHONPATH"] = "/mnt/nas2/hujinghan.hjh/Megatron-LM:/mnt/nas2/hujinghan.hjh/twinkle/src:" + env.get("PYTHONPATH", "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test script contains hardcoded paths for MEGATRON_LM_PATH and PYTHONPATH. This makes the script not portable and hard to run in different environments. It would be better to make these configurable, for example, by reading them from environment variables with sensible defaults.


def cleanup():
"""Kill any lingering processes."""
os.system("pkill -9 -f 'lora.py|MegatronModel|ray' 2>/dev/null")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pkill command pkill -9 -f 'lora.py|MegatronModel|ray' is quite broad and could unintentionally terminate other unrelated processes that match these patterns. To make it safer, consider making the pattern more specific, for example by including the full path to the script, or by managing process IDs more directly if possible.

@hjh0119
Copy link
Copy Markdown
Collaborator Author

hjh0119 commented Jan 19, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces extensive initial support for Megatron-Core, enabling advanced model parallelism capabilities within the twinkle framework. The changes include a sophisticated bridge for converting HuggingFace model weights to the Megatron format, new training scripts for LoRA and MoE fine-tuning, and a high-level MegatronModel wrapper that abstracts away much of the complexity of distributed training. The implementation is impressive, demonstrating a deep understanding of both Megatron-Core and PEFT. My review identifies a couple of areas for improvement, mainly related to error handling in an example script and ensuring correctness in the MoE weight loading logic under specific parallelism configurations.

@hjh0119 hjh0119 merged commit 266cf83 into dev Jan 19, 2026
@tastelikefeet tastelikefeet deleted the megatron branch February 13, 2026 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant