Conversation
- Add native CuTile RMSNorm implementation (rms_norm_cutile.py) with dual-strategy forward (gather/scatter + TMA) and autotuning - Add CuTile RMSNorm unit tests and kernel-level comparison benchmark - Add Qwen3 8B FSDP multi-GPU benchmark script - Fix pyproject.toml: use stable cu130 index with explicit source pinning to prevent markupsafe/triton resolution failures on cp312 - Fix e2e/__init__.py: remove references to deleted modules (qwen_8b_liger, geglu) that caused ModuleNotFoundError on import - Disable CuTile rope in comparison_small (tileiras compiler fails on B200 sm_100 architecture)
Rewrote CuTile RMSNorm with aggressive Blackwell B200 optimizations: Forward kernels: - Static-persistent 2D TMA: grid-stride loop over multi-row tiles, weight pre-loaded once, allow_tma=False stores (+30%), latency hints - Gather/scatter 1D: tiled column loop for non-power-of-2 N (no padding waste) - TMA 1-row: bulk DMA load for small M with power-of-2 N Key optimizations: - Heuristic config selection tuned per M/N regime (TMA occ=4 for small M, persistent tile_m=8 for medium M, persistent tile_m=4 for large M) - High-padding routing: >20% waste routes to gather kernel (fixes N=5120) - Tile size capping: max(2, min(8, 32768//TILE_N)) avoids register spills - Python dispatch minimization: cached NUM_SMS, stream, dtype as dict key - Rstd saved from persistent kernel for backward correctness Benchmark results (B200, bf16, vs Quack cuteDSL): - Average: 1.06x slower (down from ~1.75x initial) - Best: CuTile beats Quack on 5/16 configs (M=8192: 1.06x faster) - Worst: 1.24x slower (M=2048 N=5120, was 1.45x) Also adds kernel-level benchmark: rms_norm_quack_vs_cutile.py
… update benchmarks - Rename rms_norm_cutile.py → rms_norm.py (replaces old Quack-based version) - Update all imports in __init__.py, autotune.py, and tests - Delete temporary experiment scripts (bench_bwd_*, ncu_bwd_*, profile_*, test_dw_*) - Update E2E benchmark charts (PyTorch vs Liger vs Bastile on Qwen3-8B) E2E results on B200 (Qwen3-8B, bf16, batch_size=1): seq=1024: Bastile +18.2% throughput vs PyTorch seq=2048: Bastile +20.7% throughput vs PyTorch seq=4096: Bastile +27.1% throughput vs PyTorch seq=8192: Bastile +32.1% throughput vs PyTorch, 34GB less memory seq=16384: Bastile runs, PyTorch OOMs Bastile beats Liger by 1.1-1.9x across all configs
…to charts Deleted: - CONTEXT.md (outdated session briefing) - test_minimal_kernel.py (temp debug script) - tests/ops/test_geglu.py (imports nonexistent gpt_oss_moe) - tests/ops/test_rms_norm_cutile.py (stale duplicate of test_rms_norm.py) - tests/benchmarks/e2e/comparison_small.py (superseded by qwen_8b_seqlen.py) - tests/benchmarks/e2e/profile_kernels.py (outdated small model profiler) Cleaned: - configs.py: removed unused GEGLUConfig, RMSNormConfig, SwiGLUConfig - tests/ops/run_all.py: removed dead test_geglu reference - tests/benchmarks/e2e/__init__.py: removed deleted file references - tests/benchmarks/run_all.py: fixed recursive self-import bug - tests/benchmarks/kernel/rms_norm.py: fixed outdated 4-arg call to rms_norm() Added: - tests/benchmarks/kernel/bench_fused_lce.py: fused LCE kernel benchmark - qwen_8b_seqlen.py: auto-generates bar charts to assets/ after each run Updated: - fused_linear_cross_entropy.py: removed _ce_pytorch fallback, CuTile-only path - Updated benchmark charts with latest results
- CuTile fused MoE GEMM kernel for GPT-OSS-20B (128 experts, top-4) - Fuses token routing gather + MMA + bias + routing weight multiply - Optimized backward: vectorized scatter + bmm for weight grads (2.8x faster than re-forward + autograd approach) - CuTile gate kernel with fused forward/backward (fast sigmoid) - Autotuned tile sizes with disk caching - Forward: ~0.94x vs CUTLASS _grouped_mm (near parity) - Forward+Backward: ~1.07x faster than HF grouped_mm baseline - Includes benchmarks, profiling scripts, and correctness tests
- Add CuTile fused load-balancing loss kernel that eliminates torch.cat over all layers' router logits + full softmax(128) + topk + one_hot, replacing with per-layer fused softmax+topk accumulation via atomic_add - Add FusedRouterFunction (autograd): linear + topk + softmax with proper backward (softmax jacobian + scatter + linear backward) - Fix dotted target_attr handling in _apply_patch/_reset_patch so registry patches like "GptOssForCausalLM.forward" work correctly - Remove dead moe_gate patch from apply() -- the moe_experts kernel already includes the gate internally via moe_gate_forward_cutile - Add profiling tools (profile_gpt_oss.py, profile_router.py, quick_bench.py) - Add correctness tests for fused router forward/backward/loss - Update benchmark charts (4-layer GPT-OSS-20B results)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.