Skip to content

Feature: Paged Attention Kernel #2

@ShlokVFX

Description

@ShlokVFX

Implement a paged attention kernel to efficiently support long-context
inference by paging key/value caches.

The kernel should minimize memory fragmentation and reduce GPU memory
pressure while maintaining low latency.

Planned Benchmarks

  • Latency vs sequence length
  • Memory usage vs KV cache size
  • Comparison against baseline attention

Learning Objectives

  • KV cache paging strategies
  • Memory locality and access patterns
  • Tradeoffs between fragmentation and throughput

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurePlanned feature request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions