bench.py fails on H200s

Hello! I downloaded and followed the quick-start guide to benchmark the default matmul kernel, and it fails on fp32. Here is the output. Any thoughts on what might be the issue?

```
(base) jmschrei@js029:~/github/autokernel$ uv run bench.py
============================================================
AutoKernel Benchmark Harness
============================================================
kernel_type: matmul
kernel_module: kernel.py loaded successfully

=== GPU INFO ===
gpu_name: NVIDIA H200 NVL
gpu_sm_count: 132
gpu_memory_gb: 139.8
gpu_peak_tflops_fp16: 120.63743999999998
gpu_peak_tflops_bf16: 120.63743999999998
gpu_peak_tflops_fp32: 60.31871999999999
gpu_peak_bandwidth_gb_s: 500.0
gpu_l2_cache_mb: 60.0
gpu_compute_capability: 9.0

=== CORRECTNESS ===

--- Stage 1: Smoke Test ---
  PASS (max_abs_error=0.000000e+00)

--- Stage 2: Shape Sweep ---
  PASS: tiny torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: tiny torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: tiny torch.float32 -> max_abs_error=3.657150e-02 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: small torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: small torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: small torch.float32 -> max_abs_error=9.220123e-02 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: medium torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: medium torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: medium torch.float32 -> max_abs_error=1.228333e-01 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: large torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: large torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: large torch.float32 -> max_abs_error=1.887665e-01 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: xlarge torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: xlarge torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: xlarge torch.float32 -> max_abs_error=2.584534e-01 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: tall torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: tall torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: tall torch.float32 -> max_abs_error=1.418457e-01 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: wide torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: wide torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: wide torch.float32 -> max_abs_error=1.322327e-01 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: deep_k torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: deep_k torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: deep_k torch.float32 -> max_abs_error=3.471069e-01 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: llm_qkv torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: llm_qkv torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: llm_qkv torch.float32 -> max_abs_error=9.362793e-02 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: llm_mlp torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: llm_mlp torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: llm_mlp torch.float32 -> max_abs_error=2.850037e-01 exceeds tol(atol=0.0001, rtol=0.0001)
  shape_sweep: FAIL (10/30 failed)

--- Stage 3: Numerical Stability ---
  PASS: near_max -> both have NaN/Inf (expected overflow)
  PASS: near_zero (max_err=0.00e+00)
  PASS: mixed_scale -> both have NaN/Inf (expected overflow)
  PASS: all_zeros (max_err=0.00e+00)
  PASS: all_same (max_err=0.00e+00)
  numerical_stability: PASS

--- Stage 4: Determinism ---
  PASS: 3 runs are bitwise identical

--- Stage 5: Edge Cases ---
  PASS: edge_1023 (max_err=6.25e-02)
  PASS: edge_4097 (max_err=6.25e-02)
  FAIL: edge_1537 -> max_abs_error=1.250000e-01 exceeds tol(atol=0.01, rtol=0.01)
  edge_cases: FAIL

correctness: FAIL

--- Correctness Summary ---
smoke_test: PASS
shape_sweep: FAIL (10/30 failed)
numerical_stability: PASS
determinism: PASS
edge_cases: FAIL
correctness: FAIL

=== PERFORMANCE (large: M=2048, N=2048, K=2048, dtype=torch.float16) ===

  Benchmarking: tiny ...
    kernel: 5.76 us | pytorch: 5.79 us | speedup: 1.004x | 0.728 TFLOPS | 0.6% peak

  Benchmarking: small ...
    kernel: 10.46 us | pytorch: 7.16 us | speedup: 0.685x | 25.660 TFLOPS | 21.3% peak

  Benchmarking: medium ...
    kernel: 17.52 us | pytorch: 9.79 us | speedup: 0.559x | 122.553 TFLOPS | 101.6% peak

  Benchmarking: large ...
    kernel: 53.87 us | pytorch: 30.12 us | speedup: 0.559x | 318.921 TFLOPS | 264.4% peak

  Benchmarking: xlarge ...
    kernel: 452.00 us | pytorch: 224.21 us | speedup: 0.496x | 304.070 TFLOPS | 252.1% peak

  Benchmarking: tall ...
    kernel: 57.60 us | pytorch: 32.20 us | speedup: 0.559x | 298.253 TFLOPS | 247.2% peak

  Benchmarking: wide ...
    kernel: 58.67 us | pytorch: 32.40 us | speedup: 0.552x | 292.807 TFLOPS | 242.7% peak

  Benchmarking: deep_k ...
    kernel: 102.48 us | pytorch: 32.65 us | speedup: 0.319x | 167.641 TFLOPS | 139.0% peak

  Benchmarking: llm_qkv ...
    kernel: 60.55 us | pytorch: 36.62 us | speedup: 0.605x | 283.752 TFLOPS | 235.2% peak

  Benchmarking: llm_mlp ...
    kernel: 1228.29 us | pytorch: 636.74 us | speedup: 0.518x | 300.717 TFLOPS | 249.3% peak

--- Performance Summary (primary: large) ---
latency_us: 53.87
latency_ms: 0.0539
throughput_tflops: 318.921
bandwidth_gb_s: 467.2
pct_peak_compute: 264.4%
pct_peak_bandwidth: 93.4%
arithmetic_intensity: 682.67
ridge_point: 241.27
bottleneck: compute_bound
flops: 17179869184
bytes: 25165824
peak_vram_mb: 492.0

=== COMPARISON VS PYTORCH ===
pytorch_latency_us: 30.12
pytorch_latency_ms: 0.0301
kernel_latency_us: 53.87
kernel_latency_ms: 0.0539
speedup_vs_pytorch: 0.559x
pytorch_tflops: 570.400
kernel_tflops: 318.921

=== SIZE SWEEP ===
size            kernel_us   pytorch_us    speedup     tflops    %peak
------------------------------------------------------------------
tiny                 5.76         5.79     1.004x      0.728     0.6%
small               10.46         7.16     0.685x     25.660    21.3%
medium              17.52         9.79     0.559x    122.553   101.6%
large               53.87        30.12     0.559x    318.921   264.4%
xlarge             452.00       224.21     0.496x    304.070   252.1%
tall                57.60        32.20     0.559x    298.253   247.2%
wide                58.67        32.40     0.552x    292.807   242.7%
deep_k             102.48        32.65     0.319x    167.641   139.0%
llm_qkv             60.55        36.62     0.605x    283.752   235.2%
llm_mlp           1228.29       636.74     0.518x    300.717   249.3%

=== FINAL ===
kernel_type: matmul
correctness: FAIL
throughput_tflops: 318.921
speedup_vs_pytorch: 0.559x
pct_peak_compute: 264.4%
bench_time_seconds: 4.1
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench.py fails on H200s #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bench.py fails on H200s #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions