Skip to content

Feature: Decode Fusion Kernel #5

@ShlokVFX

Description

@ShlokVFX

Implement a fused decode kernel combining logits processing, sampling,
and token selection into a single GPU launch.

This feature aims to reduce kernel launch overhead and improve
token-level throughput during autoregressive decoding.

Planned Benchmarks

  • Tokens/sec improvement
  • Kernel launch count reduction
  • Latency per decode step

Learning Objectives

  • Kernel fusion techniques
  • Launch overhead analysis
  • Decode path optimization

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions