Skip to content

sunbagel/Paged-Attention

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploring PagedAttention

Run in Google Colab

This was a small project to better understand how key/value (KV) caches work in transformer attention. I focused on comparing two strategies:

  • A paged attention cache, like the one used in vLLM (2023), where memory is allocated in small blocks as needed.
  • A contiguous cache, where memory is preallocated up front in one large chunk.

I decided to limit the current scope of the project to just the cache and attention modules so I could focus on learning about these parts.

What I Built

I wrote a self-attention layer from scratch that supports incremental decoding. This made it easier to experiment with cache behavior in isolation and actually see how memory usage changes as tokens are processed.

In this project, I compared two attention modules, one that supported paged KV caching, and another that used a contiguous block of memory for its KV cache. I wrote a benchmarking script that could compare metrics like latency, memory used, and memory allocated.

attention metrics

In this run, both attention layers processed the same 256-token sequence. The paged version used significantly less memory and avoided the waste of preallocated but unused space in the contiguous cache. Note that the contiguous attention module uses a preallocated 2048 token block, so there'd always be fragmentation with a smaller input sequence. Another thing to note is that the latency of my paged module was higher, likely due to my implementation of PagedAttention. Since PyTorch libraries expect contiguous blocks of memory, I had to reassemble my paged KV blocks into a single block each time; to avoid this, I would have to write a custom kernel that can take the disjoint blocks directly, like in the vLLM paper.

From this project, I got to learn more about transformers, how attention layers work under the hood, and KV caching. I found this to be a fun project, because I got to have a feel for how systems concepts like memory allocation, fragmentation, and reuse show up in ML infrastructure.

If I Were to Continue

I tried to keep the scope of this project minimal, but if I were to continue, I'd try to match more of the features that vLLM offers.

  • Integrating this attention layer into a full transformer decoder block
  • Add support for multiple concurrent sequences (e.g. batched requests)
  • Write custom kernels for attention and KV appends to avoid relying on torch.cat to reconstruct the blocks

I'd also love to take a crack at some of the more realistic needs for an LLM system, like scheduling, parallel sampling, and swapping to disk.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages