-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
featurePlanned feature requestPlanned feature request
Description
Implement a paged attention kernel to efficiently support long-context
inference by paging key/value caches.
The kernel should minimize memory fragmentation and reduce GPU memory
pressure while maintaining low latency.
Planned Benchmarks
- Latency vs sequence length
- Memory usage vs KV cache size
- Comparison against baseline attention
Learning Objectives
- KV cache paging strategies
- Memory locality and access patterns
- Tradeoffs between fragmentation and throughput
Metadata
Metadata
Assignees
Labels
featurePlanned feature requestPlanned feature request