This repository contains scripts, configurations, and experiment results for reproducing LLM (Large Language Model) workflows — Pre-training, Fine-tuning, and Inference — on AMD MI210 GPUs, based on the paper:
AMD-LLM-Reproduce/
├── Pretrain/ # Pre-training experiments with DeepSpeed ZeRO strategies
├── Finetune/ # Fine-tuning experiments with LoRA and DeepSpeed
└── Inference/ # Inference benchmarks using vLLM and TGI
Reproduces pre-training performance using DeepSpeed.
Four strategies are benchmarked:
| Strategy | Throughput | Peak Memory |
|---|---|---|
| RQ (Quantization) | 1675.38 tokens/s | 9.23 GB |
| ZeRO-2 + Offload | 123.22 tokens/s | 15.06 GB |
| ZeRO-3 | 976.53 tokens/s | 40.34 GB |
| ZeRO-3 + Offload | 66.05 tokens/s | 4.58 GB |
configs/— DeepSpeed configuration files (ZeRO-2/3 variants)scripts/— Launch scripts for normal and quantized trainingrun/— Core Python scripts (pretrain.py,quantize.py,download.py,utils.py)logs/— Experiment log filessetup/— Environment setup script for AMD platform (amd_setup.sh)
Fine-tunes LLMs using LoRA (PEFT) and DeepSpeed on AMD GPUs.
| Method | Throughputs (token/s) | Peak Memory (GB) |
|---|---|---|
| LoRA | 1020.01 | 17.49 |
| LoRA + ZeRO-2 | 1120.19 | 19.22 |
| LoRA + ZeRO-2 + Offloading | 710.715 | 17.44 |
| QLoRA | 738.845 | 8.475 |
script/train.py— Main training script using HuggingFace Transformers + TRLscript/ds_*.json— DeepSpeed configs: naive, ZeRO-2, ZeRO-2 + Offload, ZeRO-3data/result.csv— Experiment results
Benchmarks LLM inference throughput on a single AMD MI210 using vLLM and TGI (Text Generation Inference) with Llama-3 8B.
| Framework | Throughput |
|---|---|
| vLLM | 371 tokens/s ± 12.4 |
| TGI | 216 tokens/s ± 8.28 |
scripts/— Server/client launch scripts and benchmark scripts for both vLLM and TGIexperiments_data/— Raw result CSVs (vLLM_results.csv,TGI_results.csv)
| Part | Description | Numbers |
|---|---|---|
| CPU | AMD EPYC 9654 2.4GHz | 2 |
| GPU | AMD Instinct MI210 64GB | 3 |
| Memory | Micron DDR5 4800 64GB | 12 |
| Storage | Samsung PM9A3 4TB | 2 |
| Ethernet | Intel 82599ES 10GbE SFP+ Intel I350 1GbE |
2 2 |
| OS | Proxmox Virtual Environment 8.0 | - |