-
Notifications
You must be signed in to change notification settings - Fork 26
[Feature]: Add inference performance validation with AIPerf #448
Description
Feature Summary
Add an inference-throughput performance validator that benchmarks Dynamo vLLM inference endpoints using AIPerf, complementing the existing nccl-all-reduce-bw training performance validator.
Problem/Use Case
AICR currently supports training performance validation via NCCL bandwidth tests (aicr validate --phase performance), but has no equivalent for inference workloads. Users deploying Dynamo-based inference stacks cannot validate that their inference endpoints meet throughput and latency requirements as part of the AICR validation pipeline.
Without inference performance validation:
- Broken GPU drivers, misconfigured DRA, or CUDA errors in inference deployments go undetected
- No automated go/no-go gate for inference stack readiness
- No baseline performance numbers for comparison across deployments
Proposed Solution
Add an inference-throughput check to the performance phase that:
-
Discovers or deploys an inference workload:
- If a Dynamo frontend service is already running, benchmarks against it (scoped to DynamoGraphDeployment namespaces to avoid benchmarking the wrong service on shared clusters)
- If no endpoint exists, auto-deploys a
DynamoGraphDeploymentwith Qwen/Qwen3-0.6B (1 worker per GPU, single node), benchmarks, then cleans up
-
Runs AIPerf as a K8s Job with dynamic concurrency (
16 × worker_count), measuring:- Output token throughput (tokens/sec)
- Time to first token p99 (ms)
-
Evaluates constraints from the recipe overlay (with 10% tolerance):
validation: performance: checks: - inference-throughput constraints: - name: inference-throughput value: ">= 5000" - name: inference-ttft-p99 value: "<= 200"
Tested on EKS (H100, Qwen/Qwen3-0.6B)
| Scenario | Workers | Throughput (tok/s) | TTFT p99 (ms) | Result |
|---|---|---|---|---|
| 1 GPU, auto-deploy | 1 | 5,667 | 84 | PASS |
| 1 GPU, existing workload | 1 | 6,039 | 58 | PASS |
| 8 GPUs, auto-deploy (single node) | 8 | 37,961 | 146 | PASS |
| 16 GPUs, auto-deploy (2 nodes) | 16 | 74,927 | 120 | PASS |
Success Criteria
aicr validate --phase performanceruns inference-throughput check for inference+dynamo recipes- Auto-deploy path creates workload, benchmarks, cleans up (idempotent, handles partial failures)
- Existing workload path discovers and benchmarks scoped to the correct service
- Default constraints (>= 5000 tok/s, <= 200ms TTFT p99) catch broken deployments
- Near-linear GPU scaling observed (8.1x with 8 GPUs)
Alternatives Considered
- inference-perf (separate load generator) — similar to AIPerf but less integrated with Dynamo ecosystem
- Manual benchmarking — not automated, not part of validation pipeline
- Larger model (Llama-3.1-8B) — considered for more representative benchmarks, but Qwen3-0.6B is preferred for smoke testing (fast model load, small image, matches Dynamo deploy template defaults)
Implementation
Branch: feat/inference-perf-validator (yuan fork)
Files: 8 changed, +1043 lines
validators/performance/inference_throughput.go— CheckFuncvalidators/performance/inference_throughput_constraint.go— Core pipelinevalidators/performance/testdata/inference/{dynamo-deployment,queue}.yaml— Templatesrecipes/validators/catalog.yaml— Catalog entryrecipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml— Overlay constraintspkg/defaults/timeouts.go— Timeout constantsvalidators/performance/main.go— Registration