-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Implement a fused decode kernel combining logits processing, sampling,
and token selection into a single GPU launch.
This feature aims to reduce kernel launch overhead and improve
token-level throughput during autoregressive decoding.
Planned Benchmarks
- Tokens/sec improvement
- Kernel launch count reduction
- Latency per decode step
Learning Objectives
- Kernel fusion techniques
- Launch overhead analysis
- Decode path optimization