Skip to content

kevinbazira/vllm-rocm-debian-images

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

vLLM ROCm Debian Docker Images

Docker Debian ROCm vLLM License: MIT

The ecosystem currently provides Ubuntu-based Docker images for AMD GPUs from both AMD (rocm/vllm:v0.14.0_amd_dev) and upstream vLLM (vllm/vllm-openai-rocm:v0.15.1-rocm). However, for enterprise MLOps environments that prioritize extreme stability, predictability, and minimal OS footprints, Debian is often the preferred choice.

While AMD-maintained images offer deep hardware-specific tuning and patches, the official vLLM images typically track the latest upstream features and bug fixes. By building custom images using the workflow in this repository, you achieve the best of both worlds: pairing the exact ROCm versions and AMD-specific optimizations you need with the newest upstream vLLM commits.

This repo provides optimized, production-ready Dockerfiles to build Debian-based (Bookworm) images for running vLLM on AMD Instinct GPUs.

Key Features

These Dockerfiles are generalized from production configurations used in enterprise ML environments, featuring:

  • Bypassing Docker Registry Limits: Compiling massive AI libraries (like PyTorch with ROCm) creates Docker layers that often exceed the compressed layer limits of enterprise registries (like Harbor or AWS ECR). These images utilize a torch-libs-chunker multi-stage build technique to physically split hipblaslt and rocblas into smaller, manageable layers during compilation. These large ROCm packages are a known unresolved issue upstream: ROCm/ROCm#4224
  • Hardware-Specific Tuning: Includes manual compilation of AMD inference optimizations (like MoRi, FlashAttention, and aiter) natively from source targeting both MI210 (gfx90a) and MI300X (gfx942) AMD GPUs.
  • Minimal Runtime: Uses multi-stage builds to strip out unnecessary build dependencies and static libraries (.a files), resulting in a lean final runtime image.

Supported Versions

Directory vLLM Version ROCm Version PyTorch Version Base OS Ported Dockerfiles
vllm0.14-rocm7.0.0/ 0.14.0 7.0.0 2.10.0+rocm7.0 Bookworm Dockerfile.rocm_base, Dockerfile.rocm
vllm0.8.5-rocm6.3.0/ 0.8.5 6.3.1 2.8.0+rocm6.3 Bookworm Dockerfile.rocm_base, Dockerfile.rocm

Usage Guide

1. Build the image

To build the vllm-rocm-debian image locally, point the Docker build context to the directory of the version you wish to use. (Note: Building from source takes time as it compiles native ROCm kernels).

# Example: Building the 2026 stack (vLLM 0.14 / ROCm 7.0)
$ time docker build --network=host \
  -t vllm-rocm-debian:vllm0.14-rocm7.0.0 \
  ./vllm0.14-rocm7.0.0

...
Removing intermediate container ac3898f0ad3a
 ---> c7340bc54fb5
Successfully built c7340bc54fb5
Successfully tagged vllm-rocm-debian:vllm0.14-rocm7.0.0

real    227m53.934s
user    0m2.780s
sys     0m2.875s

(If you are behind a corporate firewall, you can pass --build-arg http_proxy="http://your-proxy:8080" to the build command).

2. Check uncompressed layer sizes

You can verify the image size and view how the multi-stage chunking kept individual layer sizes optimized:

$ docker images
REPOSITORY               TAG                         IMAGE ID       CREATED         SIZE
vllm-rocm-debian         vllm0.14-rocm7.0.0          c7340bc54fb5   13 hours ago    24.4GB

$ docker history vllm-rocm-debian:vllm0.14-rocm7.0.0
IMAGE          CREATED        CREATED BY                                      SIZE
c7340bc54fb5   13 hours ago   /bin/sh -c #(nop)  ENV RAY_EXPERIMENTAL_NOSE…   0B
0363f7387ac7   13 hours ago   |3 APT_PREF=Package: *\nPin: release o=repo.…   7GB
9e59135af039   13 hours ago   /bin/sh -c #(nop) COPY dir:3b9c1b68912a743d1…   71.1MB
ce5526ab55c2   13 hours ago   /bin/sh -c #(nop) COPY dir:241d1db286a481629…   479MB
d94c2dad5f54   13 hours ago   /bin/sh -c #(nop) COPY dir:291fdb658481ca532…   96.9MB
c632bc3ab4cd   13 hours ago   /bin/sh -c #(nop) COPY dir:d2fbb17f52099596c…   1.24MB
8eac695f6f64   13 hours ago   /bin/sh -c #(nop) COPY dir:438b0fe56fe1a2baa…   677MB
e279115289dd   13 hours ago   /bin/sh -c #(nop) COPY dir:4d23a33af539e2c52…   3.97GB
6be8f5dac5bc   13 hours ago   /bin/sh -c #(nop)  ARG TORCH_LIB_PATH=/srv/v…   0B
39795c13cce8   13 hours ago   /bin/sh -c #(nop) COPY dir:f286e95568f2b137e…   8.54GB
45bfc1a9ca03   13 hours ago   |2 APT_PREF=Package: *\nPin: release o=repo.…   0B
a8c05920e8b2   13 hours ago   /bin/sh -c #(nop)  ENV ROCM_PATH=/opt/rocm-7…   0B
69337e8ea420   13 hours ago   |2 APT_PREF=Package: *\nPin: release o=repo.…   3.43GB
e9d1df9bfc0c   16 hours ago   /bin/sh -c #(nop) WORKDIR /srv                  0B
83b067033467   16 hours ago   |2 APT_PREF=Package: *\nPin: release o=repo.…   1.25kB
a8a7bfe66c2c   16 hours ago   /bin/sh -c #(nop)  ARG APT_PREF=Package: *\n…   0B
40fd403f7a8c   16 hours ago   /bin/sh -c #(nop)  ARG ROCM_VERSION=7.0         0B
630a45a35d11   7 weeks ago    # debian.sh --arch 'amd64' out/ 'bookworm' '…   117MB     debuerreotype 0.17
...

3. Run inference

You can spin up the new Debian image and test it against a lightweight model like facebook/opt-125m to ensure the ROCm drivers and vLLM engine are initialized correctly:

$ docker run --rm --network=host -it \
  --device=/dev/kfd --device=/dev/dri \
  --group-add=$(getent group video | cut -d: -f3) \
  --group-add=$(getent group render | cut -d: -f3) \
  --ipc=host \
  --security-opt seccomp=unconfined \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  vllm-rocm-debian:vllm0.14-rocm7.0.0 /srv/venv/bin/python -c "
from vllm import LLM, SamplingParams; \
llm = LLM('facebook/opt-125m'); \
print(llm.generate('Hello, world!', SamplingParams(max_tokens=5))[0].outputs[0].text)"

Expected Output:

INFO 03-02 06:33:41 [model.py:541] Resolved architecture: OPTForCausalLM
INFO 03-02 06:33:41 [model.py:1561] Using max model len 2048
INFO 03-02 06:33:41 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 03-02 06:33:41 [vllm.py:624] Asynchronous scheduling is enabled.
... [Engine Initialization Logs Omitted] ...
Processed prompts: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.10it/s, est. speed input: 15.52 toks/s, output: 15.52 toks/s]
 That is my dad.

4. Bonus Tip: Optimize inference

Congratulations on successfully building and running the model-server. This is only the first step; maximizing throughput requires further optimizations. As a bonus, below is a generalized, production-tested example config used to achieve massive speedups on an MI300X (gfx942) GPU for prefill-heavy workloads (such as generating high-quality embeddings with Qwen3-Embedding model for: semantic search, document similarity, large-scale vector indexing, etc).

Environment Variables (Container & Build level):

  • VLLM_ROCM_USE_AITER=1: Enables the AI Tensor Engine for ROCm (AITER), which provides massive speedups via MI300X-specific matrix core optimizations.
  • VLLM_USE_TRITON_FLASH_ATTN=0: Disables Triton-based attention to ensure the engine fully utilizes the AITER backend.
  • MAX_JOBS=1: Restricts the ninja build system to a single compilation thread. JIT-compiling AITER kernels is highly memory-intensive; restricting concurrency prevents k8s pods (e.g those with 16Gi RAM limits) from triggering OOM kills during the initial startup build.

vLLM Engine Arguments (vllm serve ...):

  • --max-model-len=32768: Matches the model's sequence length limit, enabling the indexing of entire articles/documents without losing semantic information near the end of the text.
  • --max-num-batched-tokens=32768: Matches the model length to ensure the engine can process at least one full-length article/document in a single pass, or efficiently pack hundreds of shorter search queries together. Leaving this at the lower default can bottleneck throughput.
  • --enable-prefix-caching=False: Disabled because embedding/search workloads typically process highly unique documents with little to no shared prompt overlap. Disabling it avoids unnecessary memory tracking overhead and frees up KV cache space for larger batches.
  • --trust-remote-code=False: A strict security best practice for production environments. Models should be loaded from secure internal object storage (e.g Ceph/Swift) rather than directly downloading and executing arbitrary code from the HuggingFace repos at runtime.

Final Note

I originally developed these Dockerfiles while working as a Machine Learning Engineer at the Wikimedia Foundation. Moving LLM inference workloads into production required resolving strict constraints around container OS standards (Debian) and registry compressed-layer limits, challenges that the default upstream Ubuntu images didn't address natively.

I'm sharing this repo in the hope that it saves other MLOps engineers who face similar enterprise constraints when scaling AMD ROCm and vLLM infrastructure.

If these generalized Dockerfiles help you optimize your deployment pipelines, feel free to adapt and build upon them. Happy LLM inference deploying! 🚀

Contributors