Gpt by aghilann · Pull Request #4 · aghilann/bastile

aghilann · 2026-02-16T00:36:52Z

No description provided.

- Add native CuTile RMSNorm implementation (rms_norm_cutile.py) with dual-strategy forward (gather/scatter + TMA) and autotuning - Add CuTile RMSNorm unit tests and kernel-level comparison benchmark - Add Qwen3 8B FSDP multi-GPU benchmark script - Fix pyproject.toml: use stable cu130 index with explicit source pinning to prevent markupsafe/triton resolution failures on cp312 - Fix e2e/__init__.py: remove references to deleted modules (qwen_8b_liger, geglu) that caused ModuleNotFoundError on import - Disable CuTile rope in comparison_small (tileiras compiler fails on B200 sm_100 architecture)

Rewrote CuTile RMSNorm with aggressive Blackwell B200 optimizations: Forward kernels: - Static-persistent 2D TMA: grid-stride loop over multi-row tiles, weight pre-loaded once, allow_tma=False stores (+30%), latency hints - Gather/scatter 1D: tiled column loop for non-power-of-2 N (no padding waste) - TMA 1-row: bulk DMA load for small M with power-of-2 N Key optimizations: - Heuristic config selection tuned per M/N regime (TMA occ=4 for small M, persistent tile_m=8 for medium M, persistent tile_m=4 for large M) - High-padding routing: >20% waste routes to gather kernel (fixes N=5120) - Tile size capping: max(2, min(8, 32768//TILE_N)) avoids register spills - Python dispatch minimization: cached NUM_SMS, stream, dtype as dict key - Rstd saved from persistent kernel for backward correctness Benchmark results (B200, bf16, vs Quack cuteDSL): - Average: 1.06x slower (down from ~1.75x initial) - Best: CuTile beats Quack on 5/16 configs (M=8192: 1.06x faster) - Worst: 1.24x slower (M=2048 N=5120, was 1.45x) Also adds kernel-level benchmark: rms_norm_quack_vs_cutile.py

… update benchmarks - Rename rms_norm_cutile.py → rms_norm.py (replaces old Quack-based version) - Update all imports in __init__.py, autotune.py, and tests - Delete temporary experiment scripts (bench_bwd_*, ncu_bwd_*, profile_*, test_dw_*) - Update E2E benchmark charts (PyTorch vs Liger vs Bastile on Qwen3-8B) E2E results on B200 (Qwen3-8B, bf16, batch_size=1): seq=1024: Bastile +18.2% throughput vs PyTorch seq=2048: Bastile +20.7% throughput vs PyTorch seq=4096: Bastile +27.1% throughput vs PyTorch seq=8192: Bastile +32.1% throughput vs PyTorch, 34GB less memory seq=16384: Bastile runs, PyTorch OOMs Bastile beats Liger by 1.1-1.9x across all configs

…to charts Deleted: - CONTEXT.md (outdated session briefing) - test_minimal_kernel.py (temp debug script) - tests/ops/test_geglu.py (imports nonexistent gpt_oss_moe) - tests/ops/test_rms_norm_cutile.py (stale duplicate of test_rms_norm.py) - tests/benchmarks/e2e/comparison_small.py (superseded by qwen_8b_seqlen.py) - tests/benchmarks/e2e/profile_kernels.py (outdated small model profiler) Cleaned: - configs.py: removed unused GEGLUConfig, RMSNormConfig, SwiGLUConfig - tests/ops/run_all.py: removed dead test_geglu reference - tests/benchmarks/e2e/__init__.py: removed deleted file references - tests/benchmarks/run_all.py: fixed recursive self-import bug - tests/benchmarks/kernel/rms_norm.py: fixed outdated 4-arg call to rms_norm() Added: - tests/benchmarks/kernel/bench_fused_lce.py: fused LCE kernel benchmark - qwen_8b_seqlen.py: auto-generates bar charts to assets/ after each run Updated: - fused_linear_cross_entropy.py: removed _ce_pytorch fallback, CuTile-only path - Updated benchmark charts with latest results

- CuTile fused MoE GEMM kernel for GPT-OSS-20B (128 experts, top-4) - Fuses token routing gather + MMA + bias + routing weight multiply - Optimized backward: vectorized scatter + bmm for weight grads (2.8x faster than re-forward + autograd approach) - CuTile gate kernel with fused forward/backward (fast sigmoid) - Autotuned tile sizes with disk caching - Forward: ~0.94x vs CUTLASS _grouped_mm (near parity) - Forward+Backward: ~1.07x faster than HF grouped_mm baseline - Includes benchmarks, profiling scripts, and correctness tests

- Add CuTile fused load-balancing loss kernel that eliminates torch.cat over all layers' router logits + full softmax(128) + topk + one_hot, replacing with per-layer fused softmax+topk accumulation via atomic_add - Add FusedRouterFunction (autograd): linear + topk + softmax with proper backward (softmax jacobian + scatter + linear backward) - Fix dotted target_attr handling in _apply_patch/_reset_patch so registry patches like "GptOssForCausalLM.forward" work correctly - Remove dead moe_gate patch from apply() -- the moe_experts kernel already includes the gate internally via moe_gate_forward_cutile - Add profiling tools (profile_gpt_oss.py, profile_router.py, quick_bench.py) - Add correctness tests for fused router forward/backward/loss - Update benchmark charts (4-layer GPT-OSS-20B results)

root and others added 7 commits February 14, 2026 10:50

feat(bastile): replace rms norm with cutile version

a38d1e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gpt#4

Gpt#4
aghilann wants to merge 7 commits intomainfrom
gpt

aghilann commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aghilann commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant