[feat] Add initial Megatron support [1/N] by hjh0119 · Pull Request #2 · modelscope/twinkle

hjh0119 · 2026-01-13T14:12:28Z

Initial Megatron support

Implemented a Megatron bridge to convert hf2mcore formats.
Verified 5D parallelism (DP/TP/PP/CP/EP) in local mode with loss alignment.
Tested on Qwen2.5-dense models.

gemini-code-assist · 2026-01-13T14:13:28Z

Summary of Changes

Hello @hjh0119, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the framework's capabilities by integrating Megatron-Core, a powerful library for large-scale distributed model training. The core purpose is to allow users to leverage advanced parallelism techniques (TP, PP, CP, EP, DP) for training and fine-tuning large language models, particularly with LoRA. This integration provides a robust foundation for scaling model development and deployment, offering both local and Ray-based execution options.

Highlights

Megatron-Core Integration: Introduced initial support for Megatron-Core, enabling distributed training with Tensor, Pipeline, Context, Expert, and Data Parallelism.
HuggingFace to Megatron Bridge: Implemented a bridge (TwinkleGPTBridge) to seamlessly convert HuggingFace model weights to Megatron-Core format, supporting Qwen2.5/Qwen3 models.
LoRA Training Client and Server: Added a client-server architecture (cookbook/megatron/client.py and cookbook/megatron/server.py) for remote LoRA training with Megatron models.
PEFT Compatibility: Integrated PEFT (Parameter-Efficient Fine-Tuning) with Megatron models, including a custom LoraParallelLinear layer and a deepcopy patch to handle Megatron's tp_group.
Distributed Training Enhancements: Introduced MegatronStrategy for managing Megatron's parallel state and TenantProcessGroupManager for multi-tenant process group isolation, allowing flexible distributed configurations.
Vocab-Parallel Cross Entropy Loss: Added MegatronCrossEntropyLoss to correctly compute cross entropy when the vocabulary is sharded across Tensor Parallel ranks.
Ray Mode Execution: Enabled basic execution in Ray mode, with specific handling for synchronous operations required by Megatron's collective communications.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This is a substantial and well-executed pull request that introduces Megatron-Core support into the twinkle framework. The changes demonstrate a deep understanding of distributed training concepts and the intricacies of integrating with complex libraries like Megatron-Core and PEFT. The new MegatronModel wrapper, the MegatronStrategy, and the weight conversion TwinkleBridge are all well-designed components. The addition of multi-tenancy support is particularly impressive. I've pointed out a critical merge conflict that needs to be resolved and a few areas for improvement, mainly around production readiness of examples and more specific exception handling to improve robustness. Overall, this is a high-quality contribution.

gemini-code-assist · 2026-01-13T14:26:50Z

cookbook/sft/streaming_dataset.py

+<<<<<<< HEAD
+#device_mesh = DeviceMesh(
+#    device_type='cuda',
+#    mesh=np.array([0,1,2,3]),
+#    mesh_dim_names=('dp',)
+#)
+
+twinkle.initialize(mode='ray', nproc_per_node=4, groups=device_group, global_device_mesh=device_mesh, lazy_collect=False)
+=======
 twinkle.initialize(mode='ray', groups=device_group, global_device_mesh=device_mesh)
+>>>>>>> origin/dev


This file contains unresolved merge conflicts. Please resolve the conflicts between HEAD and origin/dev to ensure the code is in a runnable state.

gemini-code-assist · 2026-01-13T14:26:50Z

cookbook/megatron/lora.py

+    except Exception:
+        pass


The except Exception: pass is too broad and will silently ignore any and all errors during the cleanup of distributed resources. This can make debugging difficult if an unexpected error occurs. It's better to catch more specific exceptions (e.g., AttributeError, RuntimeError) or at least log the exception that was caught.

Suggested change

except Exception:

pass

except Exception as e:

logger.warning(f'Error during cleanup, ignoring: {e}')

gemini-code-assist · 2026-01-13T14:26:50Z

cookbook/megatron/server.py

+        return jsonify(result)
+
+    logger.info(f'Starting server on {args.host}:{args.port}')
+    app.run(host=args.host, port=args.port, threaded=False)


The server is started using app.run(), which is Flask's built-in development server. This server is not suitable for production environments as it's not designed to be efficient, stable, or secure. For deployment, you should use a production-grade WSGI server like Gunicorn or uWSGI.

For example:

gunicorn --workers 4 --bind 0.0.0.0:8000 cookbook.megatron.server:app

gemini-code-assist · 2026-01-13T14:26:50Z

src/twinkle/megatron/model/bridge.py

+                self.ep_group = mpu.get_expert_model_parallel_group()
+                self.etp_rank = mpu.get_expert_tensor_parallel_rank()
+                self.etp_group = mpu.get_expert_tensor_parallel_group()
+            except:


The bare except: clause is dangerous as it catches all exceptions, including system-exiting ones like SystemExit and KeyboardInterrupt. This can hide critical errors and make the program difficult to terminate. Please specify the exceptions you intend to catch, for example except (AttributeError, ImportError):.

Suggested change

except:

except (AttributeError, ImportError):

gemini-code-assist · 2026-01-13T14:26:50Z

src/twinkle/megatron/model/bridge.py

+                moe_grouped_gemm=False,
+                qk_layernorm=mg_config_dict.get('qk_layernorm', False),
+            )
+        except Exception:


Using except Exception: is too broad. It can catch unexpected errors and make debugging harder. It's better to catch a more specific exception that you expect might be raised by get_gpt_layer_with_transformer_engine_spec, for instance ImportError if transformer_engine is not available. If you must catch a broad exception, consider logging it to aid in debugging.

Suggested change

except Exception:

except (ImportError, AttributeError) as e:

Multi-tenant LoRA training code has not been fully tested. Files removed from git tracking (kept locally): - src/twinkle/megatron/distributed/ - src/twinkle/megatron/model/multi_tenant_megatron.py - cookbook/megatron/megatron_multi_tenant/ - tests/megatron/test_multi_tenant_*.py

The modified forward_backward() was causing hangs and GEMM errors. Reverted to the clean version that works correctly with TP/PP.

Key fixes: 1. Simplified input processing to match working version (70ff0ba) - Process inputs once at the beginning, not per microbatch - Properly handle labels by storing separately before deletion 2. Fixed sequence_parallel padding for MoE models - Detect actual sequence_parallel setting from model.config - Bridge auto-enables sequence_parallel for MoE with TP > 1 - Pad sequence length to be divisible by TP size 3. Reverted loss_func to return 2 values (compatible with Megatron scheduler) - Old format: (loss, {'loss': loss}) - Was incorrectly returning 3 values causing compatibility issues Tested: - Dense model (Qwen2.5-7B) with TP=2, PP=2: Step 0 loss 1.168556 - MoE model (Qwen3-30B-A3B) with TP=2, EP=2: Step 0 loss 1.474237

For DP > 1, added a barrier at the end of forward_backward to ensure all DP replicas complete the same training step before moving to the next batch. This prevents P2P communication deadlocks in subsequent training iterations.

hjh0119 · 2026-01-19T10:16:48Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces significant new functionality by adding initial support for Megatron-Core, including training examples for LoRA and MoE, a bridge for weight conversion, and necessary integrations into the twinkle framework. The changes are extensive and well-structured. My review focuses on improving robustness, maintainability, and fixing a few critical bugs. Key areas for improvement include replacing broad exception handling with more specific catches to prevent silent failures, fixing a critical bug in the Ray integration logic, and addressing potential maintainability issues related to internal library usage and hardcoded paths in test scripts.

src/twinkle/infra/__init__.py

src/twinkle/megatron/tuners/lora.py

cookbook/megatron/lora.py

src/twinkle/megatron/model/bridge.py

src/twinkle/megatron/model/initializer.py

gemini-code-assist · 2026-01-19T10:20:34Z

src/twinkle/model/megatron.py

+                    return  # No gradients to sync
+
+                # Coalesced all-reduce for efficiency
+                from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors


Using internal functions from torch._utils like _flatten_dense_tensors and _unflatten_dense_tensors is risky as they are not part of the public API and can be changed or removed without notice in future PyTorch versions. This could break your code unexpectedly. Consider finding an alternative from the public API or vendoring the function if it's critical for performance.

gemini-code-assist · 2026-01-19T10:20:34Z

test_ray_configs.py

+    env["MEGATRON_LM_PATH"] = "/mnt/nas2/hujinghan.hjh/Megatron-LM"
+    env["PYTHONPATH"] = "/mnt/nas2/hujinghan.hjh/Megatron-LM:/mnt/nas2/hujinghan.hjh/twinkle/src:" + env.get("PYTHONPATH", "")


The test script contains hardcoded paths for MEGATRON_LM_PATH and PYTHONPATH. This makes the script not portable and hard to run in different environments. It would be better to make these configurable, for example, by reading them from environment variables with sensible defaults.

gemini-code-assist · 2026-01-19T10:20:34Z

test_ray_configs.py

+
+def cleanup():
+    """Kill any lingering processes."""
+    os.system("pkill -9 -f 'lora.py|MegatronModel|ray' 2>/dev/null")


The pkill command pkill -9 -f 'lora.py|MegatronModel|ray' is quite broad and could unintentionally terminate other unrelated processes that match these patterns. To make it safer, consider making the pattern more specific, for example by including the full path to the script, or by managing process IDs more directly if possible.

hjh0119 · 2026-01-19T10:36:36Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces extensive initial support for Megatron-Core, enabling advanced model parallelism capabilities within the twinkle framework. The changes include a sophisticated bridge for converting HuggingFace model weights to the Megatron format, new training scripts for LoRA and MoE fine-tuning, and a high-level MegatronModel wrapper that abstracts away much of the complexity of distributed training. The implementation is impressive, demonstrating a deep understanding of both Megatron-Core and PEFT. My review identifies a couple of areas for improvement, mainly related to error handling in an example script and ensuring correctness in the MoE weight loading logic under specific parallelism configurations.

src/twinkle/megatron/model/bridge.py

cookbook/megatron/moe_lora.py

hjh0119 added 6 commits January 9, 2026 16:26

init

328d561

local done

0517233

local 4d

7ff8fc3

merge dev

b81b6fc

clean ray

51b436a

clean ray

70ff0ba

gemini-code-assist bot reviewed Jan 13, 2026

View reviewed changes

hjh0119 added 13 commits January 15, 2026 15:47

ep and lora ddp

9085350

wip

94e2791

tenant

fef9424

wip

61a7f0c

merge main

211e46c

fix

059c0f3

Restore megatron.py to working version 70ff0ba

08d2020

The modified forward_backward() was causing hangs and GEMM errors. Reverted to the clean version that works correctly with TP/PP.

Fix save condition to avoid saving at step 0

6a589e9

fix merge

83c82dc

fix

450c204

gemini-code-assist bot reviewed Jan 19, 2026

View reviewed changes

hjh0119 added 2 commits January 19, 2026 18:29

fix

72ed37e

fix

67bb5d9

gemini-code-assist bot reviewed Jan 19, 2026

View reviewed changes

src/twinkle/megatron/model/bridge.py Outdated Show resolved Hide resolved

src/twinkle/megatron/model/bridge.py Outdated Show resolved Hide resolved

cookbook/megatron/moe_lora.py Outdated Show resolved Hide resolved

hjh0119 added 3 commits January 19, 2026 18:42

fix

9954b7c

fix ep

9ebede2

fix demo

2b7b4b8

hjh0119 merged commit 266cf83 into dev Jan 19, 2026

tastelikefeet deleted the megatron branch February 13, 2026 09:44

-    except Exception:
-        pass
+    except Exception as e:
+        logger.warning(f'Error during cleanup, ignoring: {e}')

		env["MEGATRON_LM_PATH"] = "/mnt/nas2/hujinghan.hjh/Megatron-LM"
		env["PYTHONPATH"] = "/mnt/nas2/hujinghan.hjh/Megatron-LM:/mnt/nas2/hujinghan.hjh/twinkle/src:" + env.get("PYTHONPATH", "")

Conversation

hjh0119 commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

hjh0119 commented Jan 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

hjh0119 commented Jan 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hjh0119 commented Jan 13, 2026 •

edited

Loading