[Feature]: Add inference performance validation with AIPerf

## Feature Summary

Add an `inference-throughput` performance validator that benchmarks Dynamo vLLM inference endpoints using [AIPerf](https://github.com/ai-dynamo/aiperf), complementing the existing `nccl-all-reduce-bw` training performance validator.

## Problem/Use Case

AICR currently supports training performance validation via NCCL bandwidth tests (`aicr validate --phase performance`), but has no equivalent for inference workloads. Users deploying Dynamo-based inference stacks cannot validate that their inference endpoints meet throughput and latency requirements as part of the AICR validation pipeline.

Without inference performance validation:
- Broken GPU drivers, misconfigured DRA, or CUDA errors in inference deployments go undetected
- No automated go/no-go gate for inference stack readiness
- No baseline performance numbers for comparison across deployments

## Proposed Solution

Add an `inference-throughput` check to the performance phase that:

1. **Discovers or deploys** an inference workload:
   - If a Dynamo frontend service is already running, benchmarks against it (scoped to DynamoGraphDeployment namespaces to avoid benchmarking the wrong service on shared clusters)
   - If no endpoint exists, auto-deploys a `DynamoGraphDeployment` with Qwen/Qwen3-0.6B (1 worker per GPU, single node), benchmarks, then cleans up

2. **Runs AIPerf** as a K8s Job with dynamic concurrency (`16 × worker_count`), measuring:
   - Output token throughput (tokens/sec)
   - Time to first token p99 (ms)

3. **Evaluates constraints** from the recipe overlay (with 10% tolerance):
   ```yaml
   validation:
     performance:
       checks:
         - inference-throughput
       constraints:
         - name: inference-throughput
           value: ">= 5000"
         - name: inference-ttft-p99
           value: "<= 200"
   ```

### Tested on EKS (H100, Qwen/Qwen3-0.6B)

| Scenario | Workers | Throughput (tok/s) | TTFT p99 (ms) | Result |
|----------|---------|-------------------|---------------|--------|
| 1 GPU, auto-deploy | 1 | 5,667 | 84 | PASS |
| 1 GPU, existing workload | 1 | 6,039 | 58 | PASS |
| 8 GPUs, auto-deploy (single node) | 8 | 37,961 | 146 | PASS |
| 16 GPUs, auto-deploy (2 nodes) | 16 | 74,927 | 120 | PASS |

## Success Criteria

- `aicr validate --phase performance` runs inference-throughput check for inference+dynamo recipes
- Auto-deploy path creates workload, benchmarks, cleans up (idempotent, handles partial failures)
- Existing workload path discovers and benchmarks scoped to the correct service
- Default constraints (>= 5000 tok/s, <= 200ms TTFT p99) catch broken deployments
- Near-linear GPU scaling observed (8.1x with 8 GPUs)

## Alternatives Considered

- **inference-perf** (separate load generator) — similar to AIPerf but less integrated with Dynamo ecosystem
- **Manual benchmarking** — not automated, not part of validation pipeline
- **Larger model (Llama-3.1-8B)** — considered for more representative benchmarks, but Qwen3-0.6B is preferred for smoke testing (fast model load, small image, matches Dynamo deploy template defaults)

## Implementation

Branch: `feat/inference-perf-validator` ([yuan fork](https://github.com/yuanchen8911/aicr/tree/feat/inference-perf-validator))

Files: 8 changed, +1043 lines
- `validators/performance/inference_throughput.go` — CheckFunc
- `validators/performance/inference_throughput_constraint.go` — Core pipeline
- `validators/performance/testdata/inference/{dynamo-deployment,queue}.yaml` — Templates
- `recipes/validators/catalog.yaml` — Catalog entry
- `recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml` — Overlay constraints
- `pkg/defaults/timeouts.go` — Timeout constants
- `validators/performance/main.go` — Registration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add inference performance validation with AIPerf #448

Feature Summary

Problem/Use Case

Proposed Solution

Tested on EKS (H100, Qwen/Qwen3-0.6B)

Success Criteria

Alternatives Considered

Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scenario	Workers	Throughput (tok/s)	TTFT p99 (ms)	Result
1 GPU, auto-deploy	1	5,667	84	PASS
1 GPU, existing workload	1	6,039	58	PASS
8 GPUs, auto-deploy (single node)	8	37,961	146	PASS
16 GPUs, auto-deploy (2 nodes)	16	74,927	120	PASS

[Feature]: Add inference performance validation with AIPerf #448

Description

Feature Summary

Problem/Use Case

Proposed Solution

Tested on EKS (H100, Qwen/Qwen3-0.6B)

Success Criteria

Alternatives Considered

Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions