vLLM Host — NVIDIA DGX Spark / ASUS Ascent GX10 (NVIDIA GB10)

Self-hosted vLLM OpenAI-compatible inference server for the NVIDIA DGX Spark or ASUS Ascent GX10 = NVIDIA GB10 Grace Blackwell Superchip (ARM64, 128 GB unified memory, SM 12.1).

Hardware


Machine	ASUS Ascent GX10 (same platform as NVIDIA DGX Spark)
Chip	NVIDIA GB10 Grace Blackwell Superchip
GPU architecture	Blackwell SM 12.1 — requires CUDA 13.x (CUDA 12.x incompatible)
Memory	128 GB LPDDR5X unified — CPU and GPU share the same pool
Memory bandwidth	~273 GB/s (LPDDR5X vs 3.3 TB/s HBM3 on H100)
CPU	20-core ARM (10× Cortex-X925 + 10× Cortex-A725)
OS	NVIDIA DGX OS (Ubuntu-based)

Current model

cyankiwi/Qwen3-Coder-Next-AWQ-4bit


Base model	Qwen/Qwen3-Coder-Next
Architecture	MoE — 512 experts total, 10+1 active per token
Total parameters	80B
Active parameters per token	~3B
Context window	256K tokens
Quantization	AWQ 4-bit (compressed-tensors / Marlin MoE backend)
Disk size	~40 GB
License	Apache 2.0
vLLM requirement	≥ 0.15.0
Optimized for	Agentic coding, tool calling, Claude Code / Cline / Qwen Code

Benchmark results

All benchmarks: ASUS Ascent GX10, vllm/vllm-openai:v0.18.0-cu130, FLASH_ATTN backend, 512 max output tokens.

Single-request throughput

Model	Architecture	Quant	tok/s	TTFT p50
Qwen2.5-Coder-7B-Instruct	Dense, 7B	BF16	13	81 ms
Qwen2.5-Coder-7B-Instruct-AWQ	Dense, 7B	AWQ 4-bit	46	34 ms
Qwen3-Coder-Next-AWQ-4bit	MoE, 80B / 3B active	AWQ 4-bit	33.7	110 ms

The 7B AWQ has higher raw throughput because LPDDR5X bandwidth is the bottleneck — smaller model = faster decode. The 80B MoE model is slower per token but delivers dramatically higher quality; it activates only 3B params per token so it's competitive with models many times smaller.

Concurrent throughput — Qwen2.5-Coder-7B-Instruct-AWQ

Concurrent users	Aggregate tok/s	P50 latency
1	46	10.7 s
4	192.8	10.1 s
8	376.2	10.4 s
16	700.4	11.2 s
32	1,179.7	13.3 s

Concurrent throughput — Qwen3-Coder-Next-AWQ-4bit ← current

Concurrent users	Aggregate tok/s	P50 latency
1	33.7	15.2 s
4	126.3	16.2 s
8	210.0	19.5 s
16	370.9	22.1 s
32	542.0	30.2 s

vLLM batches all concurrent requests together. --max-num-seqs is set to 128 in docker-compose.yml.

Quick start

# 1. Clone
git clone https://github.com/shamily/vllmhost.git
cd vllmhost

# 2. Add HuggingFace token (only needed for gated models — Qwen3-Coder-Next is open)
cp .env.example .env
# Edit .env if you need HF_TOKEN

# 3. Start (first run downloads ~40 GB of model weights)
chmod +x start.sh stop.sh
./start.sh

# 4. Test
cd test && pip install -r requirements.txt
pytest test_vllm.py -v   # 13/13 tests pass

# 5. Benchmark
cd benchmark && pip install -r requirements.txt
python3 benchmark.py --concurrency 8

The server exposes an OpenAI-compatible API on port 8000:

Local: http://localhost:8000/v1
LAN: http://<your-ip>:8000/v1

Calling from other machines

The container uses network_mode: host — vLLM binds directly to your machine's real IP. No port forwarding needed.

from openai import OpenAI

client = OpenAI(
    base_url="http://<ascent-gx10-ip>:8000/v1",
    api_key="not-required",
)

response = client.chat.completions.create(
    model="coding-model",
    messages=[{"role": "user", "content": "Write bubble sort in Python."}],
    max_tokens=2048,
)
print(response.choices[0].message.content)

Compatible with any OpenAI-compatible client: openai Python SDK, curl, Continue.dev, Open WebUI, Cline, Claude Code, etc.

Configuration

All non-secret configuration is in docker-compose.yml — edit it directly. Only HF_TOKEN goes in .env (gitignored).

Key flags in the command: section:

Flag	Current value	Notes
model	`cyankiwi/Qwen3-Coder-Next-AWQ-4bit`	HuggingFace model ID
`--served-model-name`	`coding-model`	Name used in API calls
`--gpu-memory-utilization`	`0.75`	Keep ≤ 0.80 — unified memory OOM freezes the whole system
`--max-num-seqs`	`128`	Max concurrent sequences in the scheduler
`--max-model-len`	`32768`	Max context length (prompt + output). Model supports 256K — raise if needed
`--port`	`8000`	API port

Changing the model

Edit the model name in docker-compose.yml and restart:

./stop.sh && ./start.sh

Other models that fit in 128 GB

Model	Architecture	Quant	HuggingFace
Qwen2.5-Coder-7B-Instruct-AWQ	Dense, 7B	AWQ 4-bit	link
Qwen2.5-Coder-14B-Instruct-AWQ	Dense, 14B	AWQ 4-bit	link
Qwen2.5-Coder-32B-Instruct-AWQ	Dense, 32B	AWQ 4-bit	link
DeepSeek-Coder-V2-Lite-Instruct	MoE 16B / 2.4B active	BF16	link

Docker image options

Image	vLLM	CUDA	Notes
`vllm/vllm-openai:v0.18.0-cu130`	0.18.0	13.0	Current — supports Qwen3-Coder-Next (requires ≥ 0.15.0)
`nvcr.io/nvidia/vllm:26.01-py3`	0.13.0	13.1.1	NVIDIA NGC official, stable for older models
`vllm/vllm-openai:cu130-nightly-aarch64`	nightly	13.0	Upstream daily builds
`scitrera/dgx-spark-vllm:0.14.1-t4`	0.14.1	13.1.0	Avoid — FlashInfer `non_blocking=None` bug crashes on startup

Entrypoint note: vllm/vllm-openai images use vllm serve as their entrypoint. The command: in docker-compose.yml must pass only the model + flags, not vllm serve.

Testing

cd test
pip install -r requirements.txt

# Run all tests against local server
pytest test_vllm.py -v

# Run only the bubble sort suite
pytest test_vllm.py::TestBubbleSort -v

# Target a remote server
VLLM_BASE_URL=http://192.168.1.10:8000 pytest test_vllm.py -v

The test suite (13 tests):

Health — /health endpoint, model listing, model name availability
Basic completion — chat, streaming, token count reporting
Bubble sort — generates bubble_sort(), extracts and executes the code, verifies correctness on 5 input cases including empty list, already-sorted, reverse-sorted, and strings

Benchmarking

cd benchmark
pip install -r requirements.txt

# Single-request benchmark across 5 prompt sizes
python3 benchmark.py

# With concurrency test
python3 benchmark.py --concurrency 16 --max-tokens 512

# Save JSON results
python3 benchmark.py --concurrency 8 --output results.json

# Target a remote server
python3 benchmark.py --url http://192.168.1.10:8000 --concurrency 4

Metrics: TTFT (mean/p50/p95), output tok/s, aggregate tok/s, P50 latency under load.

Useful commands

# Start / stop
./start.sh
./stop.sh

# Live logs (model loading progress, errors)
docker compose logs -f

# Container shell
docker compose exec vllm bash

# GPU and memory usage
nvidia-smi
free -h

# List served models
curl http://localhost:8000/v1/models | python3 -m json.tool

# Quick chat via curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"coding-model","messages":[{"role":"user","content":"Write bubble sort in Python."}],"max_tokens":1024}' \
  | python3 -m json.tool

Cleanup and recovery

# Stop and remove the container (weights stay cached)
docker compose down

# Remove the Docker image (~20 GB) to free disk space
docker rmi vllm/vllm-openai:v0.18.0-cu130

# Remove cached model weights (~40 GB for Qwen3-Coder-Next)
rm -rf ~/.cache/huggingface/hub/models--cyankiwi--Qwen3-Coder-Next-AWQ-4bit

# Full recovery from scratch
git clone https://github.com/shamily/vllmhost.git
cd vllmhost && cp .env.example .env
./start.sh   # re-downloads image and model weights automatically

Troubleshooting

`num_gpu_blocks=0 with num_gpu_blocks_override=256`

The KV cache profiler returned 0 blocks (model is large relative to memory budget). vLLM overrides to 256 minimum. To get more KV cache, raise --gpu-memory-utilization to 0.85 in docker-compose.yml — with 128 GB there is headroom.

System freezes / OOM

Lower --gpu-memory-utilization (try 0.65). GB10 unified memory OOM affects the whole system, not just the container.

`CUDA error: no kernel image` or SM 12.1 errors

You're using a CUDA 12.x image. GB10 is SM 12.1 and requires CUDA 13.x. Switch to vllm/vllm-openai:v0.18.0-cu130 or nvcr.io/nvidia/vllm:26.01-py3.

`unrecognized arguments: serve <model>`

The vllm/vllm-openai image already has vllm serve as its entrypoint. The command: in docker-compose.yml must NOT include vllm serve — only the model ID and flags.

Tests fail with "Model 'coding-model' not found"

The API name is set by --served-model-name coding-model in docker-compose.yml. Override with VLLM_MODEL=<name> when running tests if you changed it.

Slow throughput on dense FP16 models

Dense FP16 models are bottlenecked by LPDDR5X bandwidth (~273 GB/s). A 7B BF16 model tops out at ~19 tok/s theoretical. Use AWQ 4-bit quantization or MoE models to improve throughput.

Architecture notes

SM 12.1 ≠ datacenter Blackwell (SM 10.x): The GB10 is a consumer/edge Blackwell variant. B100/B200 are SM 10.0/10.1 and use different software paths.
Unified memory: 128 GB is shared between ARM CPU and GPU via NVLink-C2C. There is no separate VRAM pool. nvidia-smi will not show a dedicated VRAM number.
FLASH_ATTN backend: vLLM 0.18.0 auto-selects FLASH_ATTN on GB10 (FlashAttention 2). FLASHINFER is listed as an option but has a known bug in the community builds (non_blocking=None TypeError on warmup).
MoE on GB10: Only the active expert weights (~3B for Qwen3-Coder-Next) are hot in the compute path per token, but all 40 GB must fit in memory. The unified memory pool handles this cleanly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM Host — NVIDIA DGX Spark / ASUS Ascent GX10 (NVIDIA GB10)

Hardware

Current model

cyankiwi/Qwen3-Coder-Next-AWQ-4bit

Benchmark results

Single-request throughput

Concurrent throughput — Qwen2.5-Coder-7B-Instruct-AWQ

Concurrent throughput — Qwen3-Coder-Next-AWQ-4bit ← current

Quick start

Calling from other machines

Configuration

Changing the model

Other models that fit in 128 GB

Docker image options

Testing

Benchmarking

Useful commands

Cleanup and recovery

Troubleshooting

`num_gpu_blocks=0 with num_gpu_blocks_override=256`

System freezes / OOM

`CUDA error: no kernel image` or SM 12.1 errors

`unrecognized arguments: serve <model>`

Tests fail with "Model 'coding-model' not found"

Slow throughput on dense FP16 models

Architecture notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
benchmark		benchmark
test		test
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
start.sh		start.sh
stop.sh		stop.sh

Folders and files

Latest commit

History

Repository files navigation

vLLM Host — NVIDIA DGX Spark / ASUS Ascent GX10 (NVIDIA GB10)

Hardware

Current model

cyankiwi/Qwen3-Coder-Next-AWQ-4bit

Benchmark results

Single-request throughput

Concurrent throughput — Qwen2.5-Coder-7B-Instruct-AWQ

Concurrent throughput — Qwen3-Coder-Next-AWQ-4bit ← current

Quick start

Calling from other machines

Configuration

Changing the model

Other models that fit in 128 GB

Docker image options

Testing

Benchmarking

Useful commands

Cleanup and recovery

Troubleshooting

num_gpu_blocks=0 with num_gpu_blocks_override=256

System freezes / OOM

CUDA error: no kernel image or SM 12.1 errors

unrecognized arguments: serve <model>

Tests fail with "Model 'coding-model' not found"

Slow throughput on dense FP16 models

Architecture notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`num_gpu_blocks=0 with num_gpu_blocks_override=256`

`CUDA error: no kernel image` or SM 12.1 errors

`unrecognized arguments: serve <model>`

Packages