Self-hosted vLLM OpenAI-compatible inference server for the NVIDIA DGX Spark or ASUS Ascent GX10 = NVIDIA GB10 Grace Blackwell Superchip (ARM64, 128 GB unified memory, SM 12.1).
| Machine | ASUS Ascent GX10 (same platform as NVIDIA DGX Spark) |
| Chip | NVIDIA GB10 Grace Blackwell Superchip |
| GPU architecture | Blackwell SM 12.1 — requires CUDA 13.x (CUDA 12.x incompatible) |
| Memory | 128 GB LPDDR5X unified — CPU and GPU share the same pool |
| Memory bandwidth | ~273 GB/s (LPDDR5X vs 3.3 TB/s HBM3 on H100) |
| CPU | 20-core ARM (10× Cortex-X925 + 10× Cortex-A725) |
| OS | NVIDIA DGX OS (Ubuntu-based) |
| Base model | Qwen/Qwen3-Coder-Next |
| Architecture | MoE — 512 experts total, 10+1 active per token |
| Total parameters | 80B |
| Active parameters per token | ~3B |
| Context window | 256K tokens |
| Quantization | AWQ 4-bit (compressed-tensors / Marlin MoE backend) |
| Disk size | ~40 GB |
| License | Apache 2.0 |
| vLLM requirement | ≥ 0.15.0 |
| Optimized for | Agentic coding, tool calling, Claude Code / Cline / Qwen Code |
All benchmarks: ASUS Ascent GX10, vllm/vllm-openai:v0.18.0-cu130, FLASH_ATTN backend, 512 max output tokens.
| Model | Architecture | Quant | tok/s | TTFT p50 |
|---|---|---|---|---|
| Qwen2.5-Coder-7B-Instruct | Dense, 7B | BF16 | 13 | 81 ms |
| Qwen2.5-Coder-7B-Instruct-AWQ | Dense, 7B | AWQ 4-bit | 46 | 34 ms |
| Qwen3-Coder-Next-AWQ-4bit | MoE, 80B / 3B active | AWQ 4-bit | 33.7 | 110 ms |
The 7B AWQ has higher raw throughput because LPDDR5X bandwidth is the bottleneck — smaller model = faster decode. The 80B MoE model is slower per token but delivers dramatically higher quality; it activates only 3B params per token so it's competitive with models many times smaller.
Concurrent throughput — Qwen2.5-Coder-7B-Instruct-AWQ
| Concurrent users | Aggregate tok/s | P50 latency |
|---|---|---|
| 1 | 46 | 10.7 s |
| 4 | 192.8 | 10.1 s |
| 8 | 376.2 | 10.4 s |
| 16 | 700.4 | 11.2 s |
| 32 | 1,179.7 | 13.3 s |
Concurrent throughput — Qwen3-Coder-Next-AWQ-4bit ← current
| Concurrent users | Aggregate tok/s | P50 latency |
|---|---|---|
| 1 | 33.7 | 15.2 s |
| 4 | 126.3 | 16.2 s |
| 8 | 210.0 | 19.5 s |
| 16 | 370.9 | 22.1 s |
| 32 | 542.0 | 30.2 s |
vLLM batches all concurrent requests together. --max-num-seqs is set to 128 in docker-compose.yml.
# 1. Clone
git clone https://github.com/shamily/vllmhost.git
cd vllmhost
# 2. Add HuggingFace token (only needed for gated models — Qwen3-Coder-Next is open)
cp .env.example .env
# Edit .env if you need HF_TOKEN
# 3. Start (first run downloads ~40 GB of model weights)
chmod +x start.sh stop.sh
./start.sh
# 4. Test
cd test && pip install -r requirements.txt
pytest test_vllm.py -v # 13/13 tests pass
# 5. Benchmark
cd benchmark && pip install -r requirements.txt
python3 benchmark.py --concurrency 8The server exposes an OpenAI-compatible API on port 8000:
- Local:
http://localhost:8000/v1 - LAN:
http://<your-ip>:8000/v1
The container uses network_mode: host — vLLM binds directly to your machine's real IP. No port forwarding needed.
from openai import OpenAI
client = OpenAI(
base_url="http://<ascent-gx10-ip>:8000/v1",
api_key="not-required",
)
response = client.chat.completions.create(
model="coding-model",
messages=[{"role": "user", "content": "Write bubble sort in Python."}],
max_tokens=2048,
)
print(response.choices[0].message.content)Compatible with any OpenAI-compatible client: openai Python SDK, curl, Continue.dev, Open WebUI, Cline, Claude Code, etc.
All non-secret configuration is in docker-compose.yml — edit it directly. Only HF_TOKEN goes in .env (gitignored).
Key flags in the command: section:
| Flag | Current value | Notes |
|---|---|---|
| model | cyankiwi/Qwen3-Coder-Next-AWQ-4bit |
HuggingFace model ID |
--served-model-name |
coding-model |
Name used in API calls |
--gpu-memory-utilization |
0.75 |
Keep ≤ 0.80 — unified memory OOM freezes the whole system |
--max-num-seqs |
128 |
Max concurrent sequences in the scheduler |
--max-model-len |
32768 |
Max context length (prompt + output). Model supports 256K — raise if needed |
--port |
8000 |
API port |
Edit the model name in docker-compose.yml and restart:
./stop.sh && ./start.sh| Model | Architecture | Quant | HuggingFace |
|---|---|---|---|
| Qwen2.5-Coder-7B-Instruct-AWQ | Dense, 7B | AWQ 4-bit | link |
| Qwen2.5-Coder-14B-Instruct-AWQ | Dense, 14B | AWQ 4-bit | link |
| Qwen2.5-Coder-32B-Instruct-AWQ | Dense, 32B | AWQ 4-bit | link |
| DeepSeek-Coder-V2-Lite-Instruct | MoE 16B / 2.4B active | BF16 | link |
| Image | vLLM | CUDA | Notes |
|---|---|---|---|
vllm/vllm-openai:v0.18.0-cu130 |
0.18.0 | 13.0 | Current — supports Qwen3-Coder-Next (requires ≥ 0.15.0) |
nvcr.io/nvidia/vllm:26.01-py3 |
0.13.0 | 13.1.1 | NVIDIA NGC official, stable for older models |
vllm/vllm-openai:cu130-nightly-aarch64 |
nightly | 13.0 | Upstream daily builds |
scitrera/dgx-spark-vllm:0.14.1-t4 |
0.14.1 | 13.1.0 | Avoid — FlashInfer non_blocking=None bug crashes on startup |
Entrypoint note:
vllm/vllm-openaiimages usevllm serveas their entrypoint. Thecommand:indocker-compose.ymlmust pass only the model + flags, notvllm serve.
cd test
pip install -r requirements.txt
# Run all tests against local server
pytest test_vllm.py -v
# Run only the bubble sort suite
pytest test_vllm.py::TestBubbleSort -v
# Target a remote server
VLLM_BASE_URL=http://192.168.1.10:8000 pytest test_vllm.py -vThe test suite (13 tests):
- Health —
/healthendpoint, model listing, model name availability - Basic completion — chat, streaming, token count reporting
- Bubble sort — generates
bubble_sort(), extracts and executes the code, verifies correctness on 5 input cases including empty list, already-sorted, reverse-sorted, and strings
cd benchmark
pip install -r requirements.txt
# Single-request benchmark across 5 prompt sizes
python3 benchmark.py
# With concurrency test
python3 benchmark.py --concurrency 16 --max-tokens 512
# Save JSON results
python3 benchmark.py --concurrency 8 --output results.json
# Target a remote server
python3 benchmark.py --url http://192.168.1.10:8000 --concurrency 4Metrics: TTFT (mean/p50/p95), output tok/s, aggregate tok/s, P50 latency under load.
# Start / stop
./start.sh
./stop.sh
# Live logs (model loading progress, errors)
docker compose logs -f
# Container shell
docker compose exec vllm bash
# GPU and memory usage
nvidia-smi
free -h
# List served models
curl http://localhost:8000/v1/models | python3 -m json.tool
# Quick chat via curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"coding-model","messages":[{"role":"user","content":"Write bubble sort in Python."}],"max_tokens":1024}' \
| python3 -m json.tool# Stop and remove the container (weights stay cached)
docker compose down
# Remove the Docker image (~20 GB) to free disk space
docker rmi vllm/vllm-openai:v0.18.0-cu130
# Remove cached model weights (~40 GB for Qwen3-Coder-Next)
rm -rf ~/.cache/huggingface/hub/models--cyankiwi--Qwen3-Coder-Next-AWQ-4bit
# Full recovery from scratch
git clone https://github.com/shamily/vllmhost.git
cd vllmhost && cp .env.example .env
./start.sh # re-downloads image and model weights automaticallyThe KV cache profiler returned 0 blocks (model is large relative to memory budget). vLLM overrides to 256 minimum. To get more KV cache, raise --gpu-memory-utilization to 0.85 in docker-compose.yml — with 128 GB there is headroom.
Lower --gpu-memory-utilization (try 0.65). GB10 unified memory OOM affects the whole system, not just the container.
You're using a CUDA 12.x image. GB10 is SM 12.1 and requires CUDA 13.x. Switch to vllm/vllm-openai:v0.18.0-cu130 or nvcr.io/nvidia/vllm:26.01-py3.
The vllm/vllm-openai image already has vllm serve as its entrypoint. The command: in docker-compose.yml must NOT include vllm serve — only the model ID and flags.
The API name is set by --served-model-name coding-model in docker-compose.yml. Override with VLLM_MODEL=<name> when running tests if you changed it.
Dense FP16 models are bottlenecked by LPDDR5X bandwidth (~273 GB/s). A 7B BF16 model tops out at ~19 tok/s theoretical. Use AWQ 4-bit quantization or MoE models to improve throughput.
- SM 12.1 ≠ datacenter Blackwell (SM 10.x): The GB10 is a consumer/edge Blackwell variant. B100/B200 are SM 10.0/10.1 and use different software paths.
- Unified memory: 128 GB is shared between ARM CPU and GPU via NVLink-C2C. There is no separate VRAM pool.
nvidia-smiwill not show a dedicated VRAM number. - FLASH_ATTN backend: vLLM 0.18.0 auto-selects
FLASH_ATTNon GB10 (FlashAttention 2). FLASHINFER is listed as an option but has a known bug in the community builds (non_blocking=NoneTypeError on warmup). - MoE on GB10: Only the active expert weights (~3B for Qwen3-Coder-Next) are hot in the compute path per token, but all 40 GB must fit in memory. The unified memory pool handles this cleanly.