Skip to content

feat: Add inference metrics#5320

Open
gyliu513 wants to merge 1 commit intollamastack:mainfrom
gyliu513:inference
Open

feat: Add inference metrics#5320
gyliu513 wants to merge 1 commit intollamastack:mainfrom
gyliu513:inference

Conversation

@gyliu513
Copy link
Copy Markdown
Contributor

@gyliu513 gyliu513 commented Mar 26, 2026

What does this PR do?

Fixed #5321

Summary

  • Add three new OpenTelemetry inference metrics to track LLM serving performance:
    • llama_stack.inference.duration_seconds end-to-end inference latency (streaming and non-streaming)
    • llama_stack.inference.time_to_first_token_seconds time to first content token (streaming only)
    • llama_stack.inference.tokens_per_second output token throughput (completion_tokens / duration)
  • All metrics carry model, provider, stream, and status attributes
  • Add Grafana dashboard and kind cluster deployment script

Query Examples

# P50 Tokens Per Second
curl -s --get 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=histogram_quantile(0.50, sum by (le) (rate(llama_stack_llama_stack_inference_tokens_per_second_bucket[5m])))'

# P95 Inference Duration
curl -s --get 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=histogram_quantile(0.95, sum by (le) (rate(llama_stack_llama_stack_inference_duration_seconds_bucket[5m])))'

# P95 Time to First Token
curl -s --get 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=histogram_quantile(0.95, sum by (le) (rate(llama_stack_llama_stack_inference_time_to_first_token_seconds_bucket{stream="true"}[5m])))'

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 26, 2026
@gyliu513 gyliu513 marked this pull request as draft March 26, 2026 15:46
@gyliu513 gyliu513 marked this pull request as ready for review March 26, 2026 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement more verbose inference metrics

1 participant