Implement optimized RoPE kernels for attention, supporting both prefill
and decode paths.
Focus on minimizing compute overhead and improving data locality during
positional embedding application.
Planned Benchmarks
- RoPE overhead relative to attention
- Prefill vs decode performance
- Kernel fusion opportunities
Learning Objectives
- Positional encoding math
- Kernel fusion with attention
- Register and shared memory usage patterns