TPS intermittently reported as 0.0 when benchmarking llama-server (Tesla P40, Qwen3 14B + Draft models)

**Description**

When benchmarking llama-server using draftbench, TPS is intermittently reported as 0.0 even though 512 tokens are generated successfully.

This happens both in baseline mode and when using draft models.

The final averaged result shows non-zero tok/s, but per-request TPS is often 0.0.

Environment

GPU: NVIDIA Tesla P40 (24GB)

llama.cpp built

Server command:

/home/mabe/llama.cpp/build/bin/llama-server \
  -m /home/mabe/llama.cpp/models/base/Qwen3-14B-Q4_K_M.gguf \
  --host 127.0.0.1 \
  --port 8080 \
  -ngl 99 \
  -c 32768
Models

Base:

Qwen3-14B-Q4_K_M.gguf

Draft models tested ('Qwen")

4B Q4_K_M

1.7B Q4_K_M

0.6B Q4_K_M

Observed Behavior

Example output (baseline):

[14B Q4_K_M baseline] request 1/9  ttft=21.310s  tps=0.0  tokens=512
[14B Q4_K_M baseline] request 2/9  ttft=21.978s  tps=0.0  tokens=512
[14B Q4_K_M baseline] request 3/9  ttft=22.509s  tps=0.0  tokens=512
[14B Q4_K_M baseline] request 4/9  ttft=22.750s  tps=0.0  tokens=512
[14B Q4_K_M baseline] request 5/9  ttft=16.623s  tps=84.5  tokens=512
[14B Q4_K_M baseline] request 6/9  ttft=22.388s  tps=0.0  tokens=512
...
Result: 9.38 tok/s

Speculative decoding example (14B + 1.7B):

[14B Q4_K_M + 1.7B Q4_K_M] request 2/9  ttft=24.521s  tps=0.0  tokens=512
[14B Q4_K_M + 1.7B Q4_K_M] request 5/9  ttft=20.797s  tps=104.3  tokens=512
...

Occasionally, a request shows realistic TPS (e.g., 84.5 or 104.3).


Additional Context

Context size: 32768

-ngl 99 (full GPU offload)

GPU utilization appears normal.

No CUDA errors.

No OOM conditions.

Questions

Is TPS computed from usage["completion_tokens"]?

Does llama-server reliably provide usage during streaming?

Could the benchmark be miscounting tokens when usage is absent?

Is there a known incompatibility between draftbench and llama-server streaming responses?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPS intermittently reported as 0.0 when benchmarking llama-server (Tesla P40, Qwen3 14B + Draft models) #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

TPS intermittently reported as 0.0 when benchmarking llama-server (Tesla P40, Qwen3 14B + Draft models) #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions