Skip to content

TPS intermittently reported as 0.0 when benchmarking llama-server (Tesla P40, Qwen3 14B + Draft models) #3

@Mabe75

Description

@Mabe75

Description

When benchmarking llama-server using draftbench, TPS is intermittently reported as 0.0 even though 512 tokens are generated successfully.

This happens both in baseline mode and when using draft models.

The final averaged result shows non-zero tok/s, but per-request TPS is often 0.0.

Environment

GPU: NVIDIA Tesla P40 (24GB)

llama.cpp built

Server command:

/home/mabe/llama.cpp/build/bin/llama-server
-m /home/mabe/llama.cpp/models/base/Qwen3-14B-Q4_K_M.gguf
--host 127.0.0.1
--port 8080
-ngl 99
-c 32768
Models

Base:

Qwen3-14B-Q4_K_M.gguf

Draft models tested ('Qwen")

4B Q4_K_M

1.7B Q4_K_M

0.6B Q4_K_M

Observed Behavior

Example output (baseline):

[14B Q4_K_M baseline] request 1/9 ttft=21.310s tps=0.0 tokens=512
[14B Q4_K_M baseline] request 2/9 ttft=21.978s tps=0.0 tokens=512
[14B Q4_K_M baseline] request 3/9 ttft=22.509s tps=0.0 tokens=512
[14B Q4_K_M baseline] request 4/9 ttft=22.750s tps=0.0 tokens=512
[14B Q4_K_M baseline] request 5/9 ttft=16.623s tps=84.5 tokens=512
[14B Q4_K_M baseline] request 6/9 ttft=22.388s tps=0.0 tokens=512
...
Result: 9.38 tok/s

Speculative decoding example (14B + 1.7B):

[14B Q4_K_M + 1.7B Q4_K_M] request 2/9 ttft=24.521s tps=0.0 tokens=512
[14B Q4_K_M + 1.7B Q4_K_M] request 5/9 ttft=20.797s tps=104.3 tokens=512
...

Occasionally, a request shows realistic TPS (e.g., 84.5 or 104.3).

Additional Context

Context size: 32768

-ngl 99 (full GPU offload)

GPU utilization appears normal.

No CUDA errors.

No OOM conditions.

Questions

Is TPS computed from usage["completion_tokens"]?

Does llama-server reliably provide usage during streaming?

Could the benchmark be miscounting tokens when usage is absent?

Is there a known incompatibility between draftbench and llama-server streaming responses?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions