-
Notifications
You must be signed in to change notification settings - Fork 48
Open
Description
Hello! I downloaded and followed the quick-start guide to benchmark the default matmul kernel, and it fails on fp32. Here is the output. Any thoughts on what might be the issue?
(base) jmschrei@js029:~/github/autokernel$ uv run bench.py
============================================================
AutoKernel Benchmark Harness
============================================================
kernel_type: matmul
kernel_module: kernel.py loaded successfully
=== GPU INFO ===
gpu_name: NVIDIA H200 NVL
gpu_sm_count: 132
gpu_memory_gb: 139.8
gpu_peak_tflops_fp16: 120.63743999999998
gpu_peak_tflops_bf16: 120.63743999999998
gpu_peak_tflops_fp32: 60.31871999999999
gpu_peak_bandwidth_gb_s: 500.0
gpu_l2_cache_mb: 60.0
gpu_compute_capability: 9.0
=== CORRECTNESS ===
--- Stage 1: Smoke Test ---
PASS (max_abs_error=0.000000e+00)
--- Stage 2: Shape Sweep ---
PASS: tiny torch.float16 (max_err=0.00e+00, within_tol=100.0%)
PASS: tiny torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
FAIL: tiny torch.float32 -> max_abs_error=3.657150e-02 exceeds tol(atol=0.0001, rtol=0.0001)
PASS: small torch.float16 (max_err=0.00e+00, within_tol=100.0%)
PASS: small torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
FAIL: small torch.float32 -> max_abs_error=9.220123e-02 exceeds tol(atol=0.0001, rtol=0.0001)
PASS: medium torch.float16 (max_err=0.00e+00, within_tol=100.0%)
PASS: medium torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
FAIL: medium torch.float32 -> max_abs_error=1.228333e-01 exceeds tol(atol=0.0001, rtol=0.0001)
PASS: large torch.float16 (max_err=0.00e+00, within_tol=100.0%)
PASS: large torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
FAIL: large torch.float32 -> max_abs_error=1.887665e-01 exceeds tol(atol=0.0001, rtol=0.0001)
PASS: xlarge torch.float16 (max_err=0.00e+00, within_tol=100.0%)
PASS: xlarge torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
FAIL: xlarge torch.float32 -> max_abs_error=2.584534e-01 exceeds tol(atol=0.0001, rtol=0.0001)
PASS: tall torch.float16 (max_err=0.00e+00, within_tol=100.0%)
PASS: tall torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
FAIL: tall torch.float32 -> max_abs_error=1.418457e-01 exceeds tol(atol=0.0001, rtol=0.0001)
PASS: wide torch.float16 (max_err=0.00e+00, within_tol=100.0%)
PASS: wide torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
FAIL: wide torch.float32 -> max_abs_error=1.322327e-01 exceeds tol(atol=0.0001, rtol=0.0001)
PASS: deep_k torch.float16 (max_err=0.00e+00, within_tol=100.0%)
PASS: deep_k torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
FAIL: deep_k torch.float32 -> max_abs_error=3.471069e-01 exceeds tol(atol=0.0001, rtol=0.0001)
PASS: llm_qkv torch.float16 (max_err=0.00e+00, within_tol=100.0%)
PASS: llm_qkv torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
FAIL: llm_qkv torch.float32 -> max_abs_error=9.362793e-02 exceeds tol(atol=0.0001, rtol=0.0001)
PASS: llm_mlp torch.float16 (max_err=0.00e+00, within_tol=100.0%)
PASS: llm_mlp torch.bfloat16 (max_err=0.00e+00, within_tol=100.0%)
FAIL: llm_mlp torch.float32 -> max_abs_error=2.850037e-01 exceeds tol(atol=0.0001, rtol=0.0001)
shape_sweep: FAIL (10/30 failed)
--- Stage 3: Numerical Stability ---
PASS: near_max -> both have NaN/Inf (expected overflow)
PASS: near_zero (max_err=0.00e+00)
PASS: mixed_scale -> both have NaN/Inf (expected overflow)
PASS: all_zeros (max_err=0.00e+00)
PASS: all_same (max_err=0.00e+00)
numerical_stability: PASS
--- Stage 4: Determinism ---
PASS: 3 runs are bitwise identical
--- Stage 5: Edge Cases ---
PASS: edge_1023 (max_err=6.25e-02)
PASS: edge_4097 (max_err=6.25e-02)
FAIL: edge_1537 -> max_abs_error=1.250000e-01 exceeds tol(atol=0.01, rtol=0.01)
edge_cases: FAIL
correctness: FAIL
--- Correctness Summary ---
smoke_test: PASS
shape_sweep: FAIL (10/30 failed)
numerical_stability: PASS
determinism: PASS
edge_cases: FAIL
correctness: FAIL
=== PERFORMANCE (large: M=2048, N=2048, K=2048, dtype=torch.float16) ===
Benchmarking: tiny ...
kernel: 5.76 us | pytorch: 5.79 us | speedup: 1.004x | 0.728 TFLOPS | 0.6% peak
Benchmarking: small ...
kernel: 10.46 us | pytorch: 7.16 us | speedup: 0.685x | 25.660 TFLOPS | 21.3% peak
Benchmarking: medium ...
kernel: 17.52 us | pytorch: 9.79 us | speedup: 0.559x | 122.553 TFLOPS | 101.6% peak
Benchmarking: large ...
kernel: 53.87 us | pytorch: 30.12 us | speedup: 0.559x | 318.921 TFLOPS | 264.4% peak
Benchmarking: xlarge ...
kernel: 452.00 us | pytorch: 224.21 us | speedup: 0.496x | 304.070 TFLOPS | 252.1% peak
Benchmarking: tall ...
kernel: 57.60 us | pytorch: 32.20 us | speedup: 0.559x | 298.253 TFLOPS | 247.2% peak
Benchmarking: wide ...
kernel: 58.67 us | pytorch: 32.40 us | speedup: 0.552x | 292.807 TFLOPS | 242.7% peak
Benchmarking: deep_k ...
kernel: 102.48 us | pytorch: 32.65 us | speedup: 0.319x | 167.641 TFLOPS | 139.0% peak
Benchmarking: llm_qkv ...
kernel: 60.55 us | pytorch: 36.62 us | speedup: 0.605x | 283.752 TFLOPS | 235.2% peak
Benchmarking: llm_mlp ...
kernel: 1228.29 us | pytorch: 636.74 us | speedup: 0.518x | 300.717 TFLOPS | 249.3% peak
--- Performance Summary (primary: large) ---
latency_us: 53.87
latency_ms: 0.0539
throughput_tflops: 318.921
bandwidth_gb_s: 467.2
pct_peak_compute: 264.4%
pct_peak_bandwidth: 93.4%
arithmetic_intensity: 682.67
ridge_point: 241.27
bottleneck: compute_bound
flops: 17179869184
bytes: 25165824
peak_vram_mb: 492.0
=== COMPARISON VS PYTORCH ===
pytorch_latency_us: 30.12
pytorch_latency_ms: 0.0301
kernel_latency_us: 53.87
kernel_latency_ms: 0.0539
speedup_vs_pytorch: 0.559x
pytorch_tflops: 570.400
kernel_tflops: 318.921
=== SIZE SWEEP ===
size kernel_us pytorch_us speedup tflops %peak
------------------------------------------------------------------
tiny 5.76 5.79 1.004x 0.728 0.6%
small 10.46 7.16 0.685x 25.660 21.3%
medium 17.52 9.79 0.559x 122.553 101.6%
large 53.87 30.12 0.559x 318.921 264.4%
xlarge 452.00 224.21 0.496x 304.070 252.1%
tall 57.60 32.20 0.559x 298.253 247.2%
wide 58.67 32.40 0.552x 292.807 242.7%
deep_k 102.48 32.65 0.319x 167.641 139.0%
llm_qkv 60.55 36.62 0.605x 283.752 235.2%
llm_mlp 1228.29 636.74 0.518x 300.717 249.3%
=== FINAL ===
kernel_type: matmul
correctness: FAIL
throughput_tflops: 318.921
speedup_vs_pytorch: 0.559x
pct_peak_compute: 264.4%
bench_time_seconds: 4.1
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels