Feature: Decode Fusion Kernel

Implement a fused decode kernel combining logits processing, sampling,
and token selection into a single GPU launch.

This feature aims to reduce kernel launch overhead and improve
token-level throughput during autoregressive decoding.

Planned Benchmarks
- Tokens/sec improvement
- Kernel launch count reduction
- Latency per decode step

Learning Objectives
- Kernel fusion techniques
- Launch overhead analysis
- Decode path optimization


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Decode Fusion Kernel #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature: Decode Fusion Kernel #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions