Skip to content

Implement more verbose inference metrics #5321

@gyliu513

Description

@gyliu513

🚀 Describe the new functionality needed

currently completions and chat/completions have a few metrics, but we need to add some of the following:

llama_stack_tokens_per_second{...}

llama_stack_inference_duration_seconds{...}

llama_stack_time_to_first_token_seconds{...}

💡 Why is this needed? What if we don't build it?

  1. llama_stack_inference_duration_seconds End-to-end latency from request arrival to completion. This is different from the generic request_duration_seconds because it measures only the inference time, excluding routing, auth, and middleware overhead. Critical for SLA monitoring and provider performance comparison.

  2. llama_stack_time_to_first_token_seconds (streaming only) The time a user waits before seeing any output. This is the primary UX metric for streaming applications — a request might complete in 5s total but if the first token arrives in 200ms, the experience feels responsive. Without this, operators can't distinguish between "slow start + fast generation" vs "fast start + slow generation."

  3. llama_stack_tokens_per_second Output throughput per request. This directly measures generation speed and is the standard metric for comparing inference backends. Combined with the model and provider attributes, it lets operators answer "which provider gives the best throughput for model X?" and detect throughput degradation.

All three metrics carry model, provider, stream, and status attributes, enabling per-model, per-provider, and streaming-vs-non-streaming breakdowns in dashboards like Grafana.

Other thoughts

No response

Related: #2596

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions