Skip to content

Add P50/P95/P99 Latency Metrics #4

@DevilsAutumn

Description

@DevilsAutumn

Description

Add support for P50, P95, and P99 latency metrics in the benchmarking pipeline. These percentile latencies will provide a clearer understanding of how LLM models perform under varying loads and request patterns. Since LLM response times can vary significantly depending on prompt length, model size, and hardware, percentile metrics are essential for accurate and fair evaluation.

Why This Matters

  • P50 represents typical generation latency for common workloads.
  • P95 captures slower responses due to edge cases (long prompts, caching misses, cold starts).
  • P99 exposes tail latency, which is critical for production use-cases where worst-case performance impacts user experience.

Scope

  • Collect raw latency data per model inference/generation call.
  • Compute P50, P95, P99 percentiles using a stable statistical approach.
  • Expose metrics in the benchmark results object and output reports (JSON/CSV/Markdown).
  • Add documentation explaining these metrics and how to interpret them.
  • CLI outputs to display percentile latencies.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions