Investigate and implement techniques to reduce kernel launch overhead,
including kernel fusion and persistent kernels.
Focus on inference workloads with small batch sizes.
Planned Benchmarks
- Kernel launch count
- CPU-side overhead
- End-to-end latency impact
Learning Objectives
- Launch overhead sources
- Persistent kernel design
- Scheduling strategies