From f37c30a284afe28a4b6ce4edd6e1da19c2550b00 Mon Sep 17 00:00:00 2001 From: haic0 <149741444+haic0@users.noreply.github.com> Date: Wed, 10 Dec 2025 16:11:39 +0800 Subject: [PATCH 1/2] Update Kimi-Linear.md for AMD GPU Signed-off-by: haic0 <149741444+haic0@users.noreply.github.com> Update moonshotai/Kimi-Linear.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: haic0 <149741444+haic0@users.noreply.github.com> Update moonshotai/Kimi-Linear.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: haic0 <149741444+haic0@users.noreply.github.com> Update vLLM ROCm Docker image and run commands Signed-off-by: jiacao-amd add uv launch support Signed-off-by: jiacao-amd --- moonshotai/Kimi-Linear.md | 76 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+) diff --git a/moonshotai/Kimi-Linear.md b/moonshotai/Kimi-Linear.md index 67ddd521..09c09b52 100644 --- a/moonshotai/Kimi-Linear.md +++ b/moonshotai/Kimi-Linear.md @@ -41,3 +41,79 @@ curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"moonshotai/Kimi-Linear-48B-A3B-Instruct","messages":[{"role":"user","content":"Hello!"}]}' ``` + +## AMD GPU Support + +Please follow the steps here to install and run kimi-K2 models on AMD MI300X, MI325X and MI355X.
+You can choose either Option A (Docker) or Option B (install with uv). + +### Option A: Run on Host with uv + > Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images). + ```bash + uv venv + source .venv/bin/activate + uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/ + ``` + +### Option B: Run with Docker +Pull the latest vllm docker: +```shell +docker pull vllm/vllm-openai-rocm:latest +``` +Launch the ROCm vLLM docker: +```shell +docker run -d -it \ + --ipc=host \ + --entrypoint /bin/bash \ + --network=host \ + --privileged \ + --cap-add=CAP_SYS_ADMIN \ + --device=/dev/kfd \ + --device=/dev/dri \ + --device=/dev/mem \ + --group-add video \ + --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + -v /:/work \ + -e SHELL=/bin/bash \ + -p 8000:8000 \ + --name Kimi-Linear-48B-A3B-Instruct \ + vllm/vllm-openai-rocm:latest +``` +### Log in to Hugging Face +Huggingface login +```shell +huggingface-cli login +``` + +### Start the vLLM server + +Run the vllm online serving +Sample Command +```shell +SAFETENSORS_FAST_GPU=1 \ +VLLM_USE_V1=1 \ +VLLM_USE_TRITON_FLASH_ATTN=0 \ +vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \ + --tensor-parallel-size 8 \ + --max-model-len 1048576 \ + --no-enable-prefix-caching \ + --trust-remote-code +``` + + +### Run Benchmark +Open a new terminal and run the following command to execute the benchmark script inside the container. +```shell +docker exec -it Kimi-Linear-48B-A3B-Instruct vllm bench serve \ + --model "moonshotai/Kimi-Linear-48B-A3B-Instruct" \ + --dataset-name random \ + --random-input-len 8192 \ + --random-output-len 1024 \ + --request-rate 10000 \ + --num-prompts 16 \ + --ignore-eos \ + --trust-remote-code +``` + + From 9af505599cadc4d7c724493082734ade2a78886a Mon Sep 17 00:00:00 2001 From: jiacao-amd Date: Wed, 25 Feb 2026 17:27:39 -0800 Subject: [PATCH 2/2] Restructure Kimi-Linear.md to integrate ROCm support with CUDA sections Merged ROCm installation and running instructions from separate AMD GPU Support section into main content with CUDA/ROCm subheaders for better organization and consistency. Signed-off-by: jiacao-amd --- moonshotai/Kimi-Linear.md | 113 ++++++++++++++++---------------------- 1 file changed, 48 insertions(+), 65 deletions(-) diff --git a/moonshotai/Kimi-Linear.md b/moonshotai/Kimi-Linear.md index 09c09b52..3797e2d9 100644 --- a/moonshotai/Kimi-Linear.md +++ b/moonshotai/Kimi-Linear.md @@ -4,63 +4,32 @@ This guide describes how to run moonshotai/Kimi-Linear-48B-A3B-Instruct. ## Installing vLLM +### CUDA + ```bash uv venv source .venv/bin/activate uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly --prerelease=allow ``` -## Running Kimi-Linear - -It's easy to run Kimi-Linear. -The following snippets assume you have 4 or 8 GPUs on a single node. +### ROCm -### 4-GPU tensor parallel -```bash -vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \ - --port 8000 \ - --tensor-parallel-size 4 \ - --max-model-len 1048576 \ - --trust-remote-code -``` - -### 8-GPU tensor parallel -```bash -vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \ - --port 8000 \ - --tensor-parallel-size 8 \ - --max-model-len 1048576 \ - --trust-remote-code -``` - -> If you see OOM, reduce `--max-model-len` (e.g. 65536) or increase `--gpu-memory-utilization` (≤ 0.95). +You can choose either Option A (Docker) or Option B (install with uv). -Once the server is up, test it with: +#### Option A: Run on Host with uv +> Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images). ```bash -curl http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{"model":"moonshotai/Kimi-Linear-48B-A3B-Instruct","messages":[{"role":"user","content":"Hello!"}]}' +uv venv +source .venv/bin/activate +uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/ ``` -## AMD GPU Support - -Please follow the steps here to install and run kimi-K2 models on AMD MI300X, MI325X and MI355X.
-You can choose either Option A (Docker) or Option B (install with uv). - -### Option A: Run on Host with uv - > Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images). - ```bash - uv venv - source .venv/bin/activate - uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/ - ``` - -### Option B: Run with Docker +#### Option B: Run with Docker Pull the latest vllm docker: ```shell docker pull vllm/vllm-openai-rocm:latest ``` -Launch the ROCm vLLM docker: +Launch the ROCm vLLM docker: ```shell docker run -d -it \ --ipc=host \ @@ -80,40 +49,54 @@ docker run -d -it \ --name Kimi-Linear-48B-A3B-Instruct \ vllm/vllm-openai-rocm:latest ``` -### Log in to Hugging Face -Huggingface login + +Log in to your Hugging Face account: ```shell huggingface-cli login ``` -### Start the vLLM server +## Running Kimi-Linear -Run the vllm online serving -Sample Command -```shell -SAFETENSORS_FAST_GPU=1 \ -VLLM_USE_V1=1 \ -VLLM_USE_TRITON_FLASH_ATTN=0 \ +### CUDA + +It's easy to run Kimi-Linear. +The following snippets assume you have 4 or 8 GPUs on a single node. + +#### 4-GPU tensor parallel +```bash vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \ + --port 8000 \ + --te\ + --max-model-len 1048576 \ + --trust-remote-code +``` + +#### 8-GPU tensor parallel +```bash +vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \ + --port 8000 \ --tensor-parallel-size 8 \ --max-model-len 1048576 \ - --no-enable-prefix-caching \ --trust-remote-code ``` +> If you see OOM, reduce `--max-model-len` (e.g. 65536) or increase `--gpu-memory-utilization` (≤ 0.95). -### Run Benchmark -Open a new terminal and run the following command to execute the benchmark script inside the container. -```shell -docker exec -it Kimi-Linear-48B-A3B-Instruct vllm bench serve \ - --model "moonshotai/Kimi-Linear-48B-A3B-Instruct" \ - --dataset-name random \ - --random-input-len 8192 \ - --random-output-len 1024 \ - --request-rate 10000 \ - --num-prompts 16 \ - --ignore-eos \ - --trust-remote-code +Once the server is up, test it with: +```bash +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"model":"moonshotai/Kimi-Linear-48B-A3B-Instruct","messages":[{"role":"user","content":"Hello!"}]}' ``` - +### ROCm + +Run the vllm online serving with this sample command: +```shell +SAFETENSORS_FAST_GPU=1 \ +vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \ + --tensor-parallel-size 8 \ + --max-model-len 1048576 \ + --no-enable-prefix-caching \ + --trust-remote-code +```