Skip to content

kevinbazira/llm-rocm-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

LLM Inference Benchmarks on AMD ROCm and vLLM

ROCm vLLM Python License: MIT

This repo provides standalone latency benchmarking pipelines and data visualization scripts for Large Language Models (LLMs) running on AMD Instinct GPUs using the vLLM inference engine and AMD's Model Automation and Dashboarding (MAD) framework.

Benchmark Pipelines

So far I have consolidated two ROCm + vLLM latency benchmarking pipelines that I used in 2025 and 2026 while optimizing production LLM inference. Each pipeline is fully reproducible and includes raw MAD outputs, processed CSV data, and visualization scripts.

2025 Pipeline: Aya Expanse 32B

Model: CohereForAI/aya-expanse-32b
Stack: ROCm 6.3 + vLLM 0.8.5
GPU: AMD Instinct MI210

aya-expanse-32b inference latency bar chart

๐Ÿ“ Explore: benchmarks/aya-expanse-32b/

2026 Pipeline: GPT OSS Safeguard 20B

Model: openai/gpt-oss-safeguard-20b
Stack: ROCm 7.0 + vLLM 0.14
GPU: AMD Instinct MI210

gpt-oss-safeguard-20b inference latency bar chart

๐Ÿ“ Explore: benchmarks/gpt-oss-safeguard-20b/

Final Note

I originally developed these pipelines while working as an MLE at the Wikimedia Foundation, where optimizing LLM inference in production required a clear understanding of how input (prompt) and output (generation) token lengths impact latency.

I'm sharing this repo in the hope that it saves others time when evaluating ROCm + vLLM performance or building their own benchmarking workflows.

If you find it useful, feel free to adapt the pipelines to your environment. Happy benchmarking ๐Ÿš€

About

Standalone LLM inference benchmarking pipelines on AMD GPUs using ROCm, vLLM, MAD, and data visualization scripts.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages