Skip to content

[Feature]: Add inference performance validation with AIPerf #448

@yuanchen8911

Description

@yuanchen8911

Feature Summary

Add an inference-throughput performance validator that benchmarks Dynamo vLLM inference endpoints using AIPerf, complementing the existing nccl-all-reduce-bw training performance validator.

Problem/Use Case

AICR currently supports training performance validation via NCCL bandwidth tests (aicr validate --phase performance), but has no equivalent for inference workloads. Users deploying Dynamo-based inference stacks cannot validate that their inference endpoints meet throughput and latency requirements as part of the AICR validation pipeline.

Without inference performance validation:

  • Broken GPU drivers, misconfigured DRA, or CUDA errors in inference deployments go undetected
  • No automated go/no-go gate for inference stack readiness
  • No baseline performance numbers for comparison across deployments

Proposed Solution

Add an inference-throughput check to the performance phase that:

  1. Discovers or deploys an inference workload:

    • If a Dynamo frontend service is already running, benchmarks against it (scoped to DynamoGraphDeployment namespaces to avoid benchmarking the wrong service on shared clusters)
    • If no endpoint exists, auto-deploys a DynamoGraphDeployment with Qwen/Qwen3-0.6B (1 worker per GPU, single node), benchmarks, then cleans up
  2. Runs AIPerf as a K8s Job with dynamic concurrency (16 × worker_count), measuring:

    • Output token throughput (tokens/sec)
    • Time to first token p99 (ms)
  3. Evaluates constraints from the recipe overlay (with 10% tolerance):

    validation:
      performance:
        checks:
          - inference-throughput
        constraints:
          - name: inference-throughput
            value: ">= 5000"
          - name: inference-ttft-p99
            value: "<= 200"

Tested on EKS (H100, Qwen/Qwen3-0.6B)

Scenario Workers Throughput (tok/s) TTFT p99 (ms) Result
1 GPU, auto-deploy 1 5,667 84 PASS
1 GPU, existing workload 1 6,039 58 PASS
8 GPUs, auto-deploy (single node) 8 37,961 146 PASS
16 GPUs, auto-deploy (2 nodes) 16 74,927 120 PASS

Success Criteria

  • aicr validate --phase performance runs inference-throughput check for inference+dynamo recipes
  • Auto-deploy path creates workload, benchmarks, cleans up (idempotent, handles partial failures)
  • Existing workload path discovers and benchmarks scoped to the correct service
  • Default constraints (>= 5000 tok/s, <= 200ms TTFT p99) catch broken deployments
  • Near-linear GPU scaling observed (8.1x with 8 GPUs)

Alternatives Considered

  • inference-perf (separate load generator) — similar to AIPerf but less integrated with Dynamo ecosystem
  • Manual benchmarking — not automated, not part of validation pipeline
  • Larger model (Llama-3.1-8B) — considered for more representative benchmarks, but Qwen3-0.6B is preferred for smoke testing (fast model load, small image, matches Dynamo deploy template defaults)

Implementation

Branch: feat/inference-perf-validator (yuan fork)

Files: 8 changed, +1043 lines

  • validators/performance/inference_throughput.go — CheckFunc
  • validators/performance/inference_throughput_constraint.go — Core pipeline
  • validators/performance/testdata/inference/{dynamo-deployment,queue}.yaml — Templates
  • recipes/validators/catalog.yaml — Catalog entry
  • recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml — Overlay constraints
  • pkg/defaults/timeouts.go — Timeout constants
  • validators/performance/main.go — Registration

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions