-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Description
When benchmarking llama-server using draftbench, TPS is intermittently reported as 0.0 even though 512 tokens are generated successfully.
This happens both in baseline mode and when using draft models.
The final averaged result shows non-zero tok/s, but per-request TPS is often 0.0.
Environment
GPU: NVIDIA Tesla P40 (24GB)
llama.cpp built
Server command:
/home/mabe/llama.cpp/build/bin/llama-server
-m /home/mabe/llama.cpp/models/base/Qwen3-14B-Q4_K_M.gguf
--host 127.0.0.1
--port 8080
-ngl 99
-c 32768
Models
Base:
Qwen3-14B-Q4_K_M.gguf
Draft models tested ('Qwen")
4B Q4_K_M
1.7B Q4_K_M
0.6B Q4_K_M
Observed Behavior
Example output (baseline):
[14B Q4_K_M baseline] request 1/9 ttft=21.310s tps=0.0 tokens=512
[14B Q4_K_M baseline] request 2/9 ttft=21.978s tps=0.0 tokens=512
[14B Q4_K_M baseline] request 3/9 ttft=22.509s tps=0.0 tokens=512
[14B Q4_K_M baseline] request 4/9 ttft=22.750s tps=0.0 tokens=512
[14B Q4_K_M baseline] request 5/9 ttft=16.623s tps=84.5 tokens=512
[14B Q4_K_M baseline] request 6/9 ttft=22.388s tps=0.0 tokens=512
...
Result: 9.38 tok/s
Speculative decoding example (14B + 1.7B):
[14B Q4_K_M + 1.7B Q4_K_M] request 2/9 ttft=24.521s tps=0.0 tokens=512
[14B Q4_K_M + 1.7B Q4_K_M] request 5/9 ttft=20.797s tps=104.3 tokens=512
...
Occasionally, a request shows realistic TPS (e.g., 84.5 or 104.3).
Additional Context
Context size: 32768
-ngl 99 (full GPU offload)
GPU utilization appears normal.
No CUDA errors.
No OOM conditions.
Questions
Is TPS computed from usage["completion_tokens"]?
Does llama-server reliably provide usage during streaming?
Could the benchmark be miscounting tokens when usage is absent?
Is there a known incompatibility between draftbench and llama-server streaming responses?