We create a new repository Megatron-LM-MagiAttention, forked from Megatron-LM v0.11.0, to provide an example of training the LLaMA-3-1B model with Megatron-LM + MagiAttention.
For more details on data preparation, checkpoint setup, integration, and experiments, please refer to README, and this PR for code modification.
We compared the loss convergence curves of TE Ring Attention and MagiAttention by training the LLaMA-1B model from scratch.
| Env | version |
|---|---|
| docker | ngc25.02-py3 |
| MagiAttention | commit-id: 4a10ea3 |
| Megatron | Tags: core_v0.11.0 |
| Configuration | Value |
|---|---|
| Dataset | OpenWebText |
| Model Size | LLaMA-1B |
| Number of Layers | 16 |
| Hidden Size | 2048 |
| Number of Attention Heads | 32 |
| Group Query Attention | Enabled |
| Number of Query Groups | 8 |
| Sequence Length | 8192 |
| Context Parallel Size | CP1/2/4/8 (MagiAttention vs. TE Ring Attention) with a global batch size of 16 |
| Training Iterations | 100,000 |
MagiAttention aligns well with TE Ring Attention.
Feel free to open an issue in the Megatron-LM-MagiAttention repository if you have any questions.
