Skip to content

NTHU-SC/AMD-LLM-Reproduce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AMD LLM Reproduce

This repository contains scripts, configurations, and experiment results for reproducing LLM (Large Language Model) workflows — Pre-training, Fine-tuning, and Inference — on AMD MI210 GPUs, based on the paper:

"Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models"


Repository Structure

AMD-LLM-Reproduce/
├── Pretrain/        # Pre-training experiments with DeepSpeed ZeRO strategies
├── Finetune/        # Fine-tuning experiments with LoRA and DeepSpeed
└── Inference/       # Inference benchmarks using vLLM and TGI

Modules

Pretrain

Reproduces pre-training performance using DeepSpeed.

Four strategies are benchmarked:

Strategy Throughput Peak Memory
RQ (Quantization) 1675.38 tokens/s 9.23 GB
ZeRO-2 + Offload 123.22 tokens/s 15.06 GB
ZeRO-3 976.53 tokens/s 40.34 GB
ZeRO-3 + Offload 66.05 tokens/s 4.58 GB
  • configs/ — DeepSpeed configuration files (ZeRO-2/3 variants)
  • scripts/ — Launch scripts for normal and quantized training
  • run/ — Core Python scripts (pretrain.py, quantize.py, download.py, utils.py)
  • logs/ — Experiment log files
  • setup/ — Environment setup script for AMD platform (amd_setup.sh)

Finetune

Fine-tunes LLMs using LoRA (PEFT) and DeepSpeed on AMD GPUs.

Method Throughputs (token/s) Peak Memory (GB)
LoRA 1020.01 17.49
LoRA + ZeRO-2 1120.19 19.22
LoRA + ZeRO-2 + Offloading 710.715 17.44
QLoRA 738.845 8.475
  • script/train.py — Main training script using HuggingFace Transformers + TRL
  • script/ds_*.json — DeepSpeed configs: naive, ZeRO-2, ZeRO-2 + Offload, ZeRO-3
  • data/result.csv — Experiment results

Inference

Benchmarks LLM inference throughput on a single AMD MI210 using vLLM and TGI (Text Generation Inference) with Llama-3 8B.

Framework Throughput
vLLM 371 tokens/s ± 12.4
TGI 216 tokens/s ± 8.28
  • scripts/ — Server/client launch scripts and benchmark scripts for both vLLM and TGI
  • experiments_data/ — Raw result CSVs (vLLM_results.csv, TGI_results.csv)

Server Specification

Part Description Numbers
CPU AMD EPYC 9654 2.4GHz 2
GPU AMD Instinct MI210 64GB 3
Memory Micron DDR5 4800 64GB 12
Storage Samsung PM9A3 4TB 2
Ethernet Intel 82599ES 10GbE SFP+
Intel I350 1GbE
2
2
OS Proxmox Virtual Environment 8.0 -

About

An reproduce experiments on MI210

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors