-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Implement custom token sampling kernels for inference, including
top-k, top-p, and temperature-based sampling.
The goal is to reduce CPU-GPU synchronization and avoid framework-level
sampling overhead.
Planned Benchmarks
- Sampling latency per token
- End-to-end decode latency
- Comparison with framework sampling
Learning Objectives
- Parallel prefix sums
- Warp-level reductions
- Sampling stability and numerical behavior