Integrating Megatron with MagiAttention

We create a new repository Megatron-LM-MagiAttention, forked from Megatron-LM v0.11.0, to provide an example of training the LLaMA-3-1B model with Megatron-LM + MagiAttention.

For more details on data preparation, checkpoint setup, integration, and experiments, please refer to README, and this PR for code modification.

Convergence Experiments

We compared the loss convergence curves of TE Ring Attention and MagiAttention by training the LLaMA-1B model from scratch.

Configuration	Value
Dataset	OpenWebText
Model Size	LLaMA-1B
Number of Layers	16
Hidden Size	2048
Number of Attention Heads	32
Group Query Attention	Enabled
Number of Query Groups	8
Sequence Length	8192
Context Parallel Size	CP1/2/4/8 (MagiAttention vs. TE Ring Attention) with a global batch size of 16
Training Iterations	100,000

MagiAttention aligns well with TE Ring Attention.

Feel free to open an issue in the Megatron-LM-MagiAttention repository if you have any questions.