Exploring PagedAttention

This was a small project to better understand how key/value (KV) caches work in transformer attention. I focused on comparing two strategies:

A paged attention cache, like the one used in vLLM (2023), where memory is allocated in small blocks as needed.
A contiguous cache, where memory is preallocated up front in one large chunk.

I decided to limit the current scope of the project to just the cache and attention modules so I could focus on learning about these parts.

What I Built

I wrote a self-attention layer from scratch that supports incremental decoding. This made it easier to experiment with cache behavior in isolation and actually see how memory usage changes as tokens are processed.

In this project, I compared two attention modules, one that supported paged KV caching, and another that used a contiguous block of memory for its KV cache. I wrote a benchmarking script that could compare metrics like latency, memory used, and memory allocated.

In this run, both attention layers processed the same 256-token sequence. The paged version used significantly less memory and avoided the waste of preallocated but unused space in the contiguous cache. Note that the contiguous attention module uses a preallocated 2048 token block, so there'd always be fragmentation with a smaller input sequence. Another thing to note is that the latency of my paged module was higher, likely due to my implementation of PagedAttention. Since PyTorch libraries expect contiguous blocks of memory, I had to reassemble my paged KV blocks into a single block each time; to avoid this, I would have to write a custom kernel that can take the disjoint blocks directly, like in the vLLM paper.

From this project, I got to learn more about transformers, how attention layers work under the hood, and KV caching. I found this to be a fun project, because I got to have a feel for how systems concepts like memory allocation, fragmentation, and reuse show up in ML infrastructure.

If I Were to Continue

I tried to keep the scope of this project minimal, but if I were to continue, I'd try to match more of the features that vLLM offers.

Integrating this attention layer into a full transformer decoder block
Add support for multiple concurrent sequences (e.g. batched requests)
Write custom kernels for attention and KV appends to avoid relying on torch.cat to reconstruct the blocks

I'd also love to take a crack at some of the more realistic needs for an LLM system, like scheduling, parallel sampling, and swapping to disk.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
256-token-run.png		256-token-run.png
README.md		README.md
attention_layer.py		attention_layer.py
llminference.py		llminference.py
paged_attention.py		paged_attention.py
paged_kv.py		paged_kv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring PagedAttention

What I Built

If I Were to Continue

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Exploring PagedAttention

What I Built

If I Were to Continue

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages