Skip to content

shamily/vllm-gb10

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vLLM Host — NVIDIA DGX Spark / ASUS Ascent GX10 (NVIDIA GB10)

Self-hosted vLLM OpenAI-compatible inference server for the NVIDIA DGX Spark or ASUS Ascent GX10 = NVIDIA GB10 Grace Blackwell Superchip (ARM64, 128 GB unified memory, SM 12.1).


Hardware

Machine ASUS Ascent GX10 (same platform as NVIDIA DGX Spark)
Chip NVIDIA GB10 Grace Blackwell Superchip
GPU architecture Blackwell SM 12.1 — requires CUDA 13.x (CUDA 12.x incompatible)
Memory 128 GB LPDDR5X unified — CPU and GPU share the same pool
Memory bandwidth ~273 GB/s (LPDDR5X vs 3.3 TB/s HBM3 on H100)
CPU 20-core ARM (10× Cortex-X925 + 10× Cortex-A725)
OS NVIDIA DGX OS (Ubuntu-based)

Current model

Base model Qwen/Qwen3-Coder-Next
Architecture MoE — 512 experts total, 10+1 active per token
Total parameters 80B
Active parameters per token ~3B
Context window 256K tokens
Quantization AWQ 4-bit (compressed-tensors / Marlin MoE backend)
Disk size ~40 GB
License Apache 2.0
vLLM requirement ≥ 0.15.0
Optimized for Agentic coding, tool calling, Claude Code / Cline / Qwen Code

Benchmark results

All benchmarks: ASUS Ascent GX10, vllm/vllm-openai:v0.18.0-cu130, FLASH_ATTN backend, 512 max output tokens.

Single-request throughput

Model Architecture Quant tok/s TTFT p50
Qwen2.5-Coder-7B-Instruct Dense, 7B BF16 13 81 ms
Qwen2.5-Coder-7B-Instruct-AWQ Dense, 7B AWQ 4-bit 46 34 ms
Qwen3-Coder-Next-AWQ-4bit MoE, 80B / 3B active AWQ 4-bit 33.7 110 ms

The 7B AWQ has higher raw throughput because LPDDR5X bandwidth is the bottleneck — smaller model = faster decode. The 80B MoE model is slower per token but delivers dramatically higher quality; it activates only 3B params per token so it's competitive with models many times smaller.

Concurrent throughput — Qwen2.5-Coder-7B-Instruct-AWQ

Concurrent users Aggregate tok/s P50 latency
1 46 10.7 s
4 192.8 10.1 s
8 376.2 10.4 s
16 700.4 11.2 s
32 1,179.7 13.3 s

Concurrent throughput — Qwen3-Coder-Next-AWQ-4bit ← current

Concurrent users Aggregate tok/s P50 latency
1 33.7 15.2 s
4 126.3 16.2 s
8 210.0 19.5 s
16 370.9 22.1 s
32 542.0 30.2 s

vLLM batches all concurrent requests together. --max-num-seqs is set to 128 in docker-compose.yml.


Quick start

# 1. Clone
git clone https://github.com/shamily/vllmhost.git
cd vllmhost

# 2. Add HuggingFace token (only needed for gated models — Qwen3-Coder-Next is open)
cp .env.example .env
# Edit .env if you need HF_TOKEN

# 3. Start (first run downloads ~40 GB of model weights)
chmod +x start.sh stop.sh
./start.sh

# 4. Test
cd test && pip install -r requirements.txt
pytest test_vllm.py -v   # 13/13 tests pass

# 5. Benchmark
cd benchmark && pip install -r requirements.txt
python3 benchmark.py --concurrency 8

The server exposes an OpenAI-compatible API on port 8000:

  • Local: http://localhost:8000/v1
  • LAN: http://<your-ip>:8000/v1

Calling from other machines

The container uses network_mode: host — vLLM binds directly to your machine's real IP. No port forwarding needed.

from openai import OpenAI

client = OpenAI(
    base_url="http://<ascent-gx10-ip>:8000/v1",
    api_key="not-required",
)

response = client.chat.completions.create(
    model="coding-model",
    messages=[{"role": "user", "content": "Write bubble sort in Python."}],
    max_tokens=2048,
)
print(response.choices[0].message.content)

Compatible with any OpenAI-compatible client: openai Python SDK, curl, Continue.dev, Open WebUI, Cline, Claude Code, etc.


Configuration

All non-secret configuration is in docker-compose.yml — edit it directly. Only HF_TOKEN goes in .env (gitignored).

Key flags in the command: section:

Flag Current value Notes
model cyankiwi/Qwen3-Coder-Next-AWQ-4bit HuggingFace model ID
--served-model-name coding-model Name used in API calls
--gpu-memory-utilization 0.75 Keep ≤ 0.80 — unified memory OOM freezes the whole system
--max-num-seqs 128 Max concurrent sequences in the scheduler
--max-model-len 32768 Max context length (prompt + output). Model supports 256K — raise if needed
--port 8000 API port

Changing the model

Edit the model name in docker-compose.yml and restart:

./stop.sh && ./start.sh

Other models that fit in 128 GB

Model Architecture Quant HuggingFace
Qwen2.5-Coder-7B-Instruct-AWQ Dense, 7B AWQ 4-bit link
Qwen2.5-Coder-14B-Instruct-AWQ Dense, 14B AWQ 4-bit link
Qwen2.5-Coder-32B-Instruct-AWQ Dense, 32B AWQ 4-bit link
DeepSeek-Coder-V2-Lite-Instruct MoE 16B / 2.4B active BF16 link

Docker image options

Image vLLM CUDA Notes
vllm/vllm-openai:v0.18.0-cu130 0.18.0 13.0 Current — supports Qwen3-Coder-Next (requires ≥ 0.15.0)
nvcr.io/nvidia/vllm:26.01-py3 0.13.0 13.1.1 NVIDIA NGC official, stable for older models
vllm/vllm-openai:cu130-nightly-aarch64 nightly 13.0 Upstream daily builds
scitrera/dgx-spark-vllm:0.14.1-t4 0.14.1 13.1.0 Avoid — FlashInfer non_blocking=None bug crashes on startup

Entrypoint note: vllm/vllm-openai images use vllm serve as their entrypoint. The command: in docker-compose.yml must pass only the model + flags, not vllm serve.


Testing

cd test
pip install -r requirements.txt

# Run all tests against local server
pytest test_vllm.py -v

# Run only the bubble sort suite
pytest test_vllm.py::TestBubbleSort -v

# Target a remote server
VLLM_BASE_URL=http://192.168.1.10:8000 pytest test_vllm.py -v

The test suite (13 tests):

  1. Health/health endpoint, model listing, model name availability
  2. Basic completion — chat, streaming, token count reporting
  3. Bubble sort — generates bubble_sort(), extracts and executes the code, verifies correctness on 5 input cases including empty list, already-sorted, reverse-sorted, and strings

Benchmarking

cd benchmark
pip install -r requirements.txt

# Single-request benchmark across 5 prompt sizes
python3 benchmark.py

# With concurrency test
python3 benchmark.py --concurrency 16 --max-tokens 512

# Save JSON results
python3 benchmark.py --concurrency 8 --output results.json

# Target a remote server
python3 benchmark.py --url http://192.168.1.10:8000 --concurrency 4

Metrics: TTFT (mean/p50/p95), output tok/s, aggregate tok/s, P50 latency under load.


Useful commands

# Start / stop
./start.sh
./stop.sh

# Live logs (model loading progress, errors)
docker compose logs -f

# Container shell
docker compose exec vllm bash

# GPU and memory usage
nvidia-smi
free -h

# List served models
curl http://localhost:8000/v1/models | python3 -m json.tool

# Quick chat via curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"coding-model","messages":[{"role":"user","content":"Write bubble sort in Python."}],"max_tokens":1024}' \
  | python3 -m json.tool

Cleanup and recovery

# Stop and remove the container (weights stay cached)
docker compose down

# Remove the Docker image (~20 GB) to free disk space
docker rmi vllm/vllm-openai:v0.18.0-cu130

# Remove cached model weights (~40 GB for Qwen3-Coder-Next)
rm -rf ~/.cache/huggingface/hub/models--cyankiwi--Qwen3-Coder-Next-AWQ-4bit

# Full recovery from scratch
git clone https://github.com/shamily/vllmhost.git
cd vllmhost && cp .env.example .env
./start.sh   # re-downloads image and model weights automatically

Troubleshooting

num_gpu_blocks=0 with num_gpu_blocks_override=256

The KV cache profiler returned 0 blocks (model is large relative to memory budget). vLLM overrides to 256 minimum. To get more KV cache, raise --gpu-memory-utilization to 0.85 in docker-compose.yml — with 128 GB there is headroom.

System freezes / OOM

Lower --gpu-memory-utilization (try 0.65). GB10 unified memory OOM affects the whole system, not just the container.

CUDA error: no kernel image or SM 12.1 errors

You're using a CUDA 12.x image. GB10 is SM 12.1 and requires CUDA 13.x. Switch to vllm/vllm-openai:v0.18.0-cu130 or nvcr.io/nvidia/vllm:26.01-py3.

unrecognized arguments: serve <model>

The vllm/vllm-openai image already has vllm serve as its entrypoint. The command: in docker-compose.yml must NOT include vllm serve — only the model ID and flags.

Tests fail with "Model 'coding-model' not found"

The API name is set by --served-model-name coding-model in docker-compose.yml. Override with VLLM_MODEL=<name> when running tests if you changed it.

Slow throughput on dense FP16 models

Dense FP16 models are bottlenecked by LPDDR5X bandwidth (~273 GB/s). A 7B BF16 model tops out at ~19 tok/s theoretical. Use AWQ 4-bit quantization or MoE models to improve throughput.


Architecture notes

  • SM 12.1 ≠ datacenter Blackwell (SM 10.x): The GB10 is a consumer/edge Blackwell variant. B100/B200 are SM 10.0/10.1 and use different software paths.
  • Unified memory: 128 GB is shared between ARM CPU and GPU via NVLink-C2C. There is no separate VRAM pool. nvidia-smi will not show a dedicated VRAM number.
  • FLASH_ATTN backend: vLLM 0.18.0 auto-selects FLASH_ATTN on GB10 (FlashAttention 2). FLASHINFER is listed as an option but has a known bug in the community builds (non_blocking=None TypeError on warmup).
  • MoE on GB10: Only the active expert weights (~3B for Qwen3-Coder-Next) are hot in the compute path per token, but all 40 GB must fit in memory. The unified memory pool handles this cleanly.

About

Setup of a coding model to vLLM Docker container on NVIDIA DGX Spark / Asus Ascent GX10

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors