Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
205 changes: 205 additions & 0 deletions analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# CPU Memory Leak Analysis (GitHub Issue #28726)

## Summary

vLLM suffers from continuous CPU memory growth when serving multimodal (VLM)
models with prefix caching enabled (the default). The EngineCore subprocess
RSS grows by ~1.5 GB per 1000 requests and never stabilizes, eventually
causing OOM. The issue appeared between v0.11.0 and v0.11.1.

**Root cause**: A reference cycle in `Request` objects prevents Python's
reference counting from freeing them. A GC optimization introduced in v0.11.1
reduces how often the cyclic garbage collector runs, causing these cyclic
`Request` objects (each holding megabytes of multimodal feature data) to
accumulate far faster than the GC can reclaim them.

## The Reference Cycle

In `vllm/v1/request.py`, when prefix caching is enabled, each `Request` binds
itself into a `functools.partial`:

```python
# vllm/v1/request.py:167-170 (main branch)
self.get_hash_new_full_blocks: Callable[[], list[BlockHash]] | None = None
if block_hasher is not None:
self.get_hash_new_full_blocks = partial(block_hasher, self) # <-- cycle
self.block_hashes = self.get_hash_new_full_blocks()
```

This creates a **reference cycle**:

```

Check failure on line 31 in analysis.md

View workflow job for this annotation

GitHub Actions / pre-commit

Fenced code blocks should have a language specified [Context: "```"]
Request ──(self.get_hash_new_full_blocks)──> partial object
^ │
└──────────(partial stores self as arg)────────┘
```

### Why this matters

Python uses two garbage collection mechanisms:

1. **Reference counting** (immediate): When an object's reference count drops
to zero, it is freed instantly. This is the fast path.

2. **Cyclic garbage collector** (deferred): Periodically scans for groups of
objects that reference each other but are unreachable from the rest of the
program. This is the slow path, and it runs based on heuristic thresholds.

The reference cycle means that when the scheduler finishes a request and does
`del self.requests[request_id]`, the `Request` object's reference count
**does not drop to zero** -- the `partial` still holds a reference. The
`partial`'s count doesn't drop to zero either -- the `Request` still holds
it. Both objects are unreachable from the program, but neither can be freed
by reference counting. They become **cyclic garbage**, waiting for the cyclic
GC to detect and collect them.

### Why it only affects prefix caching

When prefix caching is **disabled**, `block_hasher` is `None`, so the
`partial` is never created. There is no cycle. `Request` objects are freed
immediately by reference counting when the scheduler removes them. This is
why `--no-enable-prefix-caching` prevents the leak.

### Why it only affects multimodal models visibly

Each `Request` object holds a `mm_features: list[MultiModalFeatureSpec]`
field. For vision-language models, this contains the **processed image
feature tensors** -- several megabytes per image. A text-only request has
empty `mm_features` and is only a few kilobytes. When cyclic garbage
accumulates:

- **Text-only**: 100 leaked Request objects ~ a few MB (invisible)
- **VLM with images**: 100 leaked Request objects ~ hundreds of MB to GBs
(causes OOM)

## Why It Became a Problem in v0.11.1

The reference cycle existed since prefix caching was introduced. In v0.11.0,
it was harmless because the cyclic GC ran frequently enough to clean it up.
Two changes in v0.11.1 broke this equilibrium:

### Change 1: Fewer GC-tracked objects per request (primary cause)

**Commit `acaa2c0a4`** -- *"Reuse empty block lists whenever possible in
KVCacheBlocks to mitigate GC costs"*

This optimization replaced empty `list` objects (`[]`) with empty `tuple`
objects (`()`) in `KVCacheBlocks`. Empty tuples are **not tracked by the
cyclic GC** (CPython optimization), while empty lists are. This means each
request cycle creates fewer GC-tracked objects.

Python's cyclic GC uses a generational scheme with thresholds (default:
`(700, 10, 10)`):

- **Generation 0** collection triggers when 700+ new tracked objects
accumulate since the last gen-0 collection.
- **Generation 1** triggers every 10 gen-0 collections.
- **Generation 2** triggers every 10 gen-1 collections (every 100 gen-0's).

With fewer tracked objects created per request, it takes longer for the
generation-0 threshold (700 objects) to be reached. This means:
- Gen-0 collections happen less often

Check failure on line 101 in analysis.md

View workflow job for this annotation

GitHub Actions / pre-commit

Lists should be surrounded by blank lines [Context: "- Gen-0 collections happen les..."]
- Gen-1 and gen-2 collections happen much less often
- Cyclic garbage from `Request` objects accumulates longer before being swept

In v0.11.0, the extra `list` objects from `KVCacheBlocks` kept the GC
running frequently. Gen-2 collections (which sweep long-lived cyclic
garbage) ran often enough that the leaked `Request` memory stabilized.
In v0.11.1, gen-2 collections became too infrequent, and memory grew
without bound.

### Change 2: Earlier gc.freeze() (secondary contributor)

**Commit `b30372cbd`** -- *"Move gc.freeze logic from EngineCoreProc to
EngineCore for better coverage"*

`gc.freeze()` moves all currently tracked objects into a permanent
generation that the GC never scans. This was moved from the end of
`EngineCoreProc.__init__()` to the end of `EngineCore.__init__()`,
freezing objects earlier. While this doesn't directly prevent collection
of new `Request` objects, the different freeze timing subtly changes the
GC's generation accounting, further reducing the frequency of collections
on unfrozen objects.

## Reproduction Results

Using `Qwen/Qwen2.5-VL-3B-Instruct` with the `lmarena-ai/VisionArena-Chat`
dataset (real user-uploaded images), 1000 prompts per round, prefix caching
enabled:

### main branch (leak present)

```

Check failure on line 132 in analysis.md

View workflow job for this annotation

GitHub Actions / pre-commit

Fenced code blocks should have a language specified [Context: "```"]
Round Reqs Total(GB) EC(GB) EC delta EC round Time
----------------------------------------------------------------------
idle 0 3.63 3.63 --- ---
1 1000 10.97 10.97 +7.33 +7.33 66s
2 2000 14.34 14.34 +10.71 +3.37 57s
3 3000 15.94 15.94 +12.31 +1.60 55s
4 4000 16.91 16.91 +13.28 +0.97 58s
5 5000 17.38 17.38 +13.74 +0.47 58s

EngineCore final RSS: 14.70 GB (started at 2.40 GB)
Growth rate: +1.60 GB/round average -- NEVER STABILIZES
```

### fix-cpu-leak branch (leak fixed)

```

Check failure on line 148 in analysis.md

View workflow job for this annotation

GitHub Actions / pre-commit

Fenced code blocks should have a language specified [Context: "```"]
Round Reqs Total(GB) EC(GB) EC delta EC round Time
----------------------------------------------------------------------
idle 0 3.63 3.63 --- ---
1 1000 9.86 9.86 +6.22 +6.22 67s
2 2000 10.50 10.50 +6.86 +0.64 56s
3 3000 10.55 10.55 +6.91 +0.05 57s
4 4000 10.55 10.55 +6.92 +0.01 56s
5 5000 10.64 10.64 +7.01 +0.09 56s

EngineCore final RSS: 7.51 GB (started at 2.41 GB)
Growth rate: +0.20 GB/round average -- STABLE after round 1
```

The fix reduces EngineCore memory by **half** (7.51 GB vs 14.70 GB) after
5000 multimodal requests.

## The Fix

**Break the reference cycle** by storing `block_hasher` directly without
`partial`, and passing `self` explicitly at call sites:

```python
# BEFORE (creates cycle):
self.get_hash_new_full_blocks = partial(block_hasher, self)
self.block_hashes = self.get_hash_new_full_blocks()

# AFTER (no cycle):
self._block_hasher = block_hasher
self.block_hashes = self._block_hasher(self)
```

Without the cycle, `Request` objects are freed **immediately** by reference
counting when the scheduler removes them -- no cyclic GC needed. This
eliminates the leak regardless of GC frequency or `gc.freeze()` behavior.

### Files changed

- **`vllm/v1/request.py`** -- Store `_block_hasher` instead of
`partial(block_hasher, self)`. Update `append_output_token_ids()` to call
`self._block_hasher(self)`.
- **`vllm/v1/core/sched/scheduler.py`** -- Update session block hash call
site from `session.get_hash_new_full_blocks()` to
`session._block_hasher(session)`.
- **`tests/v1/core/test_async_scheduler.py`** -- Update test call site.

### Verification (unit test)

With cyclic GC disabled (`gc.disable()`), create 100 Request objects with
prefix caching and delete all external references:

| | main (cycle) | fix (no cycle) |
|---|---|---|
| Objects alive after `del` | **100** (all leaked) | **0** (all freed) |
| Freed by `gc.collect()` | 100 | 0 (nothing to collect) |

All 137 existing tests pass (86 scheduler + 43 kv_cache_utils + 8 async
scheduler).
Loading
Loading