Implement more verbose inference metrics

### 🚀 Describe the new functionality needed

currently completions and chat/completions have a few metrics, but we need to add some of the following: 

```
llama_stack_tokens_per_second{...}

llama_stack_inference_duration_seconds{...}

llama_stack_time_to_first_token_seconds{...}
```

### 💡 Why is this needed? What if we don't build it?


1. **`llama_stack_inference_duration_seconds`** End-to-end latency from request arrival to completion. This is different from the generic `request_duration_seconds` because it measures only the inference time, excluding routing, auth, and middleware overhead. Critical for SLA monitoring and provider performance comparison.

2. **`llama_stack_time_to_first_token_seconds`** (streaming only) The time a user waits before seeing any output. This is the primary UX metric for streaming applications — a request might complete in 5s total but if the first token arrives in 200ms, the experience feels responsive. Without this, operators can't distinguish between "slow start + fast generation" vs "fast start + slow generation."

3. **`llama_stack_tokens_per_second`** Output throughput per request. This directly measures generation speed and is the standard metric for comparing inference backends. Combined with the `model` and `provider` attributes, it lets operators answer "which provider gives the best throughput for model X?" and detect throughput degradation.

All three metrics carry `model`, `provider`, `stream`, and `status` attributes, enabling per-model, per-provider, and streaming-vs-non-streaming breakdowns in dashboards like Grafana.

### Other thoughts

_No response_

Related: https://github.com/llamastack/llama-stack/issues/2596

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement more verbose inference metrics #5321

🚀 Describe the new functionality needed

💡 Why is this needed? What if we don't build it?

Other thoughts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement more verbose inference metrics #5321

Description

🚀 Describe the new functionality needed

💡 Why is this needed? What if we don't build it?

Other thoughts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions