diff --git a/Seed/Seed-OSS-36B.md b/Seed/Seed-OSS-36B.md index 5a165d7a..3d46331d 100644 --- a/Seed/Seed-OSS-36B.md +++ b/Seed/Seed-OSS-36B.md @@ -4,6 +4,7 @@ This guide describes how to run Seed-OSS-36B models with vLLM and native BF16 pr ## Installing vLLM +### CUDA Seed-OSS support was recently added to vLLM main branch and is not yet available in any official release: ```bash @@ -20,8 +21,17 @@ You may need to download the latest version of the transformer for compatibility uv pip install git+https://github.com/huggingface/transformers.git@56d68c6706ee052b445e1e476056ed92ac5eb383 ``` +### ROCm +> Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images). Tested hardware: MI300X, MI325X, MI355X +```bash +uv venv +source .venv/bin/activate +uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/ +``` + ## Running Seed-OSS-36B with BF16 +### CUDA There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel or (2) Data-parallel. Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios and data-parallel works better for cases where there is a lot of data with heavy-loads. Run tensor-parallel like this: @@ -40,6 +50,21 @@ vllm serve ByteDance-Seed/Seed-OSS-36B-Instruct \ * vLLM conservatively use 90% of GPU memory, you can set `--gpu-memory-utilization=0.95` to maximize KVCache. * Make sure to follow the command-line instructions to ensure the tool-calling functionality is properly enabled. +### ROCm + +```shell +export SAFETENSORS_FAST_GPU=1 +export VLLM_USE_V1=1 +export VLLM_USE_TRITON_FLASH_ATTN=0 +export VLLM_ROCM_USE_AITER=1 +vllm serve ByteDance-Seed/Seed-OSS-36B-Instruct \ + --tensor-parallel-size 8 \ + --enable-auto-tool-choice \ + --tool-call-parser seed_oss \ + --no-enable-prefix-caching \ + --trust-remote-code +``` + ## Thinking Budget Feature Users can flexibly specify the model's thinking budget. For simpler tasks (such as IFEval), the model's chain of thought (CoT) is shorter, and the score exhibits fluctuations as the thinking budget increases. For more challenging tasks (such as AIME and LiveCodeBench), the model's CoT is longer, and the score improves with an increase in the thinking budget.