-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
🚀 Describe the new functionality needed
currently completions and chat/completions have a few metrics, but we need to add some of the following:
llama_stack_tokens_per_second{...}
llama_stack_inference_duration_seconds{...}
llama_stack_time_to_first_token_seconds{...}
💡 Why is this needed? What if we don't build it?
-
llama_stack_inference_duration_secondsEnd-to-end latency from request arrival to completion. This is different from the genericrequest_duration_secondsbecause it measures only the inference time, excluding routing, auth, and middleware overhead. Critical for SLA monitoring and provider performance comparison. -
llama_stack_time_to_first_token_seconds(streaming only) The time a user waits before seeing any output. This is the primary UX metric for streaming applications — a request might complete in 5s total but if the first token arrives in 200ms, the experience feels responsive. Without this, operators can't distinguish between "slow start + fast generation" vs "fast start + slow generation." -
llama_stack_tokens_per_secondOutput throughput per request. This directly measures generation speed and is the standard metric for comparing inference backends. Combined with themodelandproviderattributes, it lets operators answer "which provider gives the best throughput for model X?" and detect throughput degradation.
All three metrics carry model, provider, stream, and status attributes, enabling per-model, per-provider, and streaming-vs-non-streaming breakdowns in dashboards like Grafana.
Other thoughts
No response
Related: #2596