This repo provides standalone latency benchmarking pipelines and data visualization scripts for Large Language Models (LLMs) running on AMD Instinct GPUs using the vLLM inference engine and AMD's Model Automation and Dashboarding (MAD) framework.
So far I have consolidated two ROCm + vLLM latency benchmarking pipelines that I used in 2025 and 2026 while optimizing production LLM inference. Each pipeline is fully reproducible and includes raw MAD outputs, processed CSV data, and visualization scripts.
Model: CohereForAI/aya-expanse-32b
Stack: ROCm 6.3 + vLLM 0.8.5
GPU: AMD Instinct MI210
๐ Explore: benchmarks/aya-expanse-32b/
Model: openai/gpt-oss-safeguard-20b
Stack: ROCm 7.0 + vLLM 0.14
GPU: AMD Instinct MI210
๐ Explore: benchmarks/gpt-oss-safeguard-20b/
I originally developed these pipelines while working as an MLE at the Wikimedia Foundation, where optimizing LLM inference in production required a clear understanding of how input (prompt) and output (generation) token lengths impact latency.
I'm sharing this repo in the hope that it saves others time when evaluating ROCm + vLLM performance or building their own benchmarking workflows.
If you find it useful, feel free to adapt the pipelines to your environment. Happy benchmarking ๐

