Skip to content

Feature: Custom Sampling Kernels #4

@ShlokVFX

Description

@ShlokVFX

Implement custom token sampling kernels for inference, including
top-k, top-p, and temperature-based sampling.

The goal is to reduce CPU-GPU synchronization and avoid framework-level
sampling overhead.

Planned Benchmarks

  • Sampling latency per token
  • End-to-end decode latency
  • Comparison with framework sampling

Learning Objectives

  • Parallel prefix sums
  • Warp-level reductions
  • Sampling stability and numerical behavior

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions