Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions Seed/Seed-OSS-36B.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ This guide describes how to run Seed-OSS-36B models with vLLM and native BF16 pr

## Installing vLLM

### CUDA
Seed-OSS support was recently added to vLLM main branch and is not yet available in any official release:

```bash
Expand All @@ -20,8 +21,17 @@ You may need to download the latest version of the transformer for compatibility
uv pip install git+https://github.com/huggingface/transformers.git@56d68c6706ee052b445e1e476056ed92ac5eb383
```

### ROCm
> Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images). Tested hardware: MI300X, MI325X, MI355X
```bash
uv venv
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
```

## Running Seed-OSS-36B with BF16

### CUDA
There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel or (2) Data-parallel. Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios and data-parallel works better for cases where there is a lot of data with heavy-loads.

Run tensor-parallel like this:
Expand All @@ -40,6 +50,21 @@ vllm serve ByteDance-Seed/Seed-OSS-36B-Instruct \
* vLLM conservatively use 90% of GPU memory, you can set `--gpu-memory-utilization=0.95` to maximize KVCache.
* Make sure to follow the command-line instructions to ensure the tool-calling functionality is properly enabled.

### ROCm

```shell
export SAFETENSORS_FAST_GPU=1
export VLLM_USE_V1=1
export VLLM_USE_TRITON_FLASH_ATTN=0
export VLLM_ROCM_USE_AITER=1
vllm serve ByteDance-Seed/Seed-OSS-36B-Instruct \
--tensor-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser seed_oss \
--no-enable-prefix-caching \
--trust-remote-code
```

## Thinking Budget Feature

Users can flexibly specify the model's thinking budget. For simpler tasks (such as IFEval), the model's chain of thought (CoT) is shorter, and the score exhibits fluctuations as the thinking budget increases. For more challenging tasks (such as AIME and LiveCodeBench), the model's CoT is longer, and the score improves with an increase in the thinking budget.
Expand Down