-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
featurePlanned feature requestPlanned feature request
Description
🚀 Feature: Fused Attention Kernel
Overview
Implement a fused attention kernel to improve performance by combining
QKV projection, attention score computation, softmax, and output projection
into a single optimized kernel.
Motivation
- Reduce memory bandwidth overhead
- Minimize kernel launch latency
- Improve end-to-end attention throughput
- Align with FlashAttention-style optimizations
Scope
- Single-GPU support initially
- FP16 / BF16 (FP8 optional later)
- CUDA kernel (Triton or CUDA C++)
- Drop-in replacement for existing attention path
Planned Approach
- Implement fused QKV + attention + output
- Use shared memory / SRAM tiling
- Optimize for RTX 3060 (SM86)
- Benchmark against baseline attention
Acceptance Criteria
- Correctness matches reference attention
- Lower latency than unfused implementation
- Reduced memory reads/writes
- Benchmark CSV generated
Out of Scope (for now)
- Multi-GPU
- Backward pass
- Training support
Notes
This is a planned feature and will be implemented in a future PR.
Metadata
Metadata
Assignees
Labels
featurePlanned feature requestPlanned feature request