Skip to content

bench.py fails on H200s #5

@jmschrei

Description

@jmschrei

Hello! I downloaded and followed the quick-start guide to benchmark the default matmul kernel, and it fails on fp32. Here is the output. Any thoughts on what might be the issue?

(base) jmschrei@js029:~/github/autokernel$ uv run bench.py
============================================================
AutoKernel Benchmark Harness
============================================================
kernel_type: matmul
kernel_module: kernel.py loaded successfully

=== GPU INFO ===
gpu_name: NVIDIA H200 NVL
gpu_sm_count: 132
gpu_memory_gb: 139.8
gpu_peak_tflops_fp16: 120.63743999999998
gpu_peak_tflops_bf16: 120.63743999999998
gpu_peak_tflops_fp32: 60.31871999999999
gpu_peak_bandwidth_gb_s: 500.0
gpu_l2_cache_mb: 60.0
gpu_compute_capability: 9.0

=== CORRECTNESS ===

--- Stage 1: Smoke Test ---
  PASS (max_abs_error=0.000000e+00)

--- Stage 2: Shape Sweep ---
  PASS: tiny torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: tiny torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: tiny torch.float32 -> max_abs_error=3.657150e-02 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: small torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: small torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: small torch.float32 -> max_abs_error=9.220123e-02 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: medium torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: medium torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: medium torch.float32 -> max_abs_error=1.228333e-01 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: large torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: large torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: large torch.float32 -> max_abs_error=1.887665e-01 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: xlarge torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: xlarge torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: xlarge torch.float32 -> max_abs_error=2.584534e-01 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: tall torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: tall torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: tall torch.float32 -> max_abs_error=1.418457e-01 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: wide torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: wide torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: wide torch.float32 -> max_abs_error=1.322327e-01 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: deep_k torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: deep_k torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: deep_k torch.float32 -> max_abs_error=3.471069e-01 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: llm_qkv torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: llm_qkv torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: llm_qkv torch.float32 -> max_abs_error=9.362793e-02 exceeds tol(atol=0.0001, rtol=0.0001)
  PASS: llm_mlp torch.float16 (max_err=0.00e+00, within_tol=100.0%)
  PASS: llm_mlp torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
  FAIL: llm_mlp torch.float32 -> max_abs_error=2.850037e-01 exceeds tol(atol=0.0001, rtol=0.0001)
  shape_sweep: FAIL (10/30 failed)

--- Stage 3: Numerical Stability ---
  PASS: near_max -> both have NaN/Inf (expected overflow)
  PASS: near_zero (max_err=0.00e+00)
  PASS: mixed_scale -> both have NaN/Inf (expected overflow)
  PASS: all_zeros (max_err=0.00e+00)
  PASS: all_same (max_err=0.00e+00)
  numerical_stability: PASS

--- Stage 4: Determinism ---
  PASS: 3 runs are bitwise identical

--- Stage 5: Edge Cases ---
  PASS: edge_1023 (max_err=6.25e-02)
  PASS: edge_4097 (max_err=6.25e-02)
  FAIL: edge_1537 -> max_abs_error=1.250000e-01 exceeds tol(atol=0.01, rtol=0.01)
  edge_cases: FAIL

correctness: FAIL

--- Correctness Summary ---
smoke_test: PASS
shape_sweep: FAIL (10/30 failed)
numerical_stability: PASS
determinism: PASS
edge_cases: FAIL
correctness: FAIL

=== PERFORMANCE (large: M=2048, N=2048, K=2048, dtype=torch.float16) ===

  Benchmarking: tiny ...
    kernel: 5.76 us | pytorch: 5.79 us | speedup: 1.004x | 0.728 TFLOPS | 0.6% peak

  Benchmarking: small ...
    kernel: 10.46 us | pytorch: 7.16 us | speedup: 0.685x | 25.660 TFLOPS | 21.3% peak

  Benchmarking: medium ...
    kernel: 17.52 us | pytorch: 9.79 us | speedup: 0.559x | 122.553 TFLOPS | 101.6% peak

  Benchmarking: large ...
    kernel: 53.87 us | pytorch: 30.12 us | speedup: 0.559x | 318.921 TFLOPS | 264.4% peak

  Benchmarking: xlarge ...
    kernel: 452.00 us | pytorch: 224.21 us | speedup: 0.496x | 304.070 TFLOPS | 252.1% peak

  Benchmarking: tall ...
    kernel: 57.60 us | pytorch: 32.20 us | speedup: 0.559x | 298.253 TFLOPS | 247.2% peak

  Benchmarking: wide ...
    kernel: 58.67 us | pytorch: 32.40 us | speedup: 0.552x | 292.807 TFLOPS | 242.7% peak

  Benchmarking: deep_k ...
    kernel: 102.48 us | pytorch: 32.65 us | speedup: 0.319x | 167.641 TFLOPS | 139.0% peak

  Benchmarking: llm_qkv ...
    kernel: 60.55 us | pytorch: 36.62 us | speedup: 0.605x | 283.752 TFLOPS | 235.2% peak

  Benchmarking: llm_mlp ...
    kernel: 1228.29 us | pytorch: 636.74 us | speedup: 0.518x | 300.717 TFLOPS | 249.3% peak

--- Performance Summary (primary: large) ---
latency_us: 53.87
latency_ms: 0.0539
throughput_tflops: 318.921
bandwidth_gb_s: 467.2
pct_peak_compute: 264.4%
pct_peak_bandwidth: 93.4%
arithmetic_intensity: 682.67
ridge_point: 241.27
bottleneck: compute_bound
flops: 17179869184
bytes: 25165824
peak_vram_mb: 492.0

=== COMPARISON VS PYTORCH ===
pytorch_latency_us: 30.12
pytorch_latency_ms: 0.0301
kernel_latency_us: 53.87
kernel_latency_ms: 0.0539
speedup_vs_pytorch: 0.559x
pytorch_tflops: 570.400
kernel_tflops: 318.921

=== SIZE SWEEP ===
size            kernel_us   pytorch_us    speedup     tflops    %peak
------------------------------------------------------------------
tiny                 5.76         5.79     1.004x      0.728     0.6%
small               10.46         7.16     0.685x     25.660    21.3%
medium              17.52         9.79     0.559x    122.553   101.6%
large               53.87        30.12     0.559x    318.921   264.4%
xlarge             452.00       224.21     0.496x    304.070   252.1%
tall                57.60        32.20     0.559x    298.253   247.2%
wide                58.67        32.40     0.552x    292.807   242.7%
deep_k             102.48        32.65     0.319x    167.641   139.0%
llm_qkv             60.55        36.62     0.605x    283.752   235.2%
llm_mlp           1228.29       636.74     0.518x    300.717   249.3%

=== FINAL ===
kernel_type: matmul
correctness: FAIL
throughput_tflops: 318.921
speedup_vs_pytorch: 0.559x
pct_peak_compute: 264.4%
bench_time_seconds: 4.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions