[Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors#36549
[Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors#36549ZhanqiuHu wants to merge 2 commits intovllm-project:mainfrom
Conversation
…ectors MultiConnector.request_finished calls c.request_finished() on all sub-connectors, but NixlConnector (SupportsHMA) only overrides request_finished_all_groups. The call falls through to the base class no-op, so kv_transfer_params is never generated and NIXL transfers silently fail. Check if sub-connector implements SupportsHMA and dispatch to request_finished_all_groups accordingly. Fixes vllm-project#36547 Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
There was a problem hiding this comment.
Code Review
This pull request fixes a bug where MultiConnector failed to correctly handle SupportsHMA sub-connectors like NixlConnector, causing silent failures. The change correctly dispatches to request_finished_all_groups for HMA-aware connectors. However, this fix is incomplete for models with multiple KV cache groups, as MultiConnector does not implement the SupportsHMA interface and can only handle a single block group. I've left a critical comment with a suggestion for a more robust implementation to prevent potential data loss or resource leaks in multi-group scenarios.
| if isinstance(c, SupportsHMA): | ||
| async_save, txfer_params = c.request_finished_all_groups( | ||
| request, (blocks,) | ||
| ) | ||
| else: | ||
| async_save, txfer_params = c.request_finished(request, blocks) |
There was a problem hiding this comment.
This change correctly dispatches to request_finished_all_groups for SupportsHMA sub-connectors, but it only handles a single group of KV cache blocks. The request_finished method's signature (blocks: list[int]) means it can only receive one group.
For models with multiple KV cache groups (e.g., using sliding window attention), this implementation will only process one group. The KV cache for other groups will not be saved or handled by HMA-aware connectors, which can lead to silent data loss or resource leaks.
A more robust solution is for MultiConnector to implement the SupportsHMA interface. This would allow it to receive all block groups from the scheduler and delegate them correctly. This would involve:
- Adding
SupportsHMAtoMultiConnector's base classes. - Implementing
request_finished_all_groupsinMultiConnectorto handletuple[list[int], ...], delegating to sub-connectors appropriately. - Updating
request_finishedto delegate to the new method for backward compatibility.
Here is an example of a more robust implementation:
# In vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py
# Add SupportsHMA to class definition
class MultiConnector(KVConnectorBase_V1, SupportsHMA):
# ... (rest of the class)
# Keep existing request_finished for backward compatibility
def request_finished(
self,
request: "Request",
blocks: list[int],
) -> tuple[bool, dict[str, Any] | None]:
return self.request_finished_all_groups(request, (blocks,))
# Implement request_finished_all_groups
def request_finished_all_groups(
self,
request: "Request",
block_ids: tuple[list[int], ...],
) -> tuple[bool, dict[str, Any] | None]:
async_saves = 0
kv_txfer_params = None
for c in self._connectors:
if isinstance(c, SupportsHMA):
async_save, txfer_params = c.request_finished_all_groups(
request, block_ids
)
else:
# For non-HMA connectors, pass only the first group of blocks.
async_save, txfer_params = c.request_finished(request, block_ids[0] if block_ids else [])
if async_save:
async_saves += 1
if txfer_params is not None:
if kv_txfer_params is not None:
raise RuntimeError(
"Only one connector can produce KV transfer params"
)
kv_txfer_params = txfer_params
if async_saves > 1:
self._extra_async_saves[request.request_id] = async_saves - 1
# Clean up other state for this request.
self._requests_to_connector.pop(request.request_id, None)
return async_saves > 0, kv_txfer_paramsAdd request_finished shim to NixlConnector that wraps block_ids and delegates to request_finished_all_groups. This ensures MultiConnector (and any other caller) can use the standard request_finished interface. Reverts the MultiConnector isinstance check in favor of this approach. Fixes vllm-project#36547 Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
b5972ff to
29820cc
Compare
|
Hi @NickLucche, I just updated the fix. Added a |
Summary
Fixes #36547
MultiConnector.request_finishedcallsc.request_finished()on all sub-connectors, butNixlConnector(SupportsHMA) only overridesrequest_finished_all_groups. The call falls through to the base class no-op, sokv_transfer_paramsis never generated and NIXL transfers silently fail.Fix: Check if sub-connector implements SupportsHMA and dispatch to
request_finished_all_groupsaccordingly.Test plan
See #36547 for reproduction steps, test script, and detailed output. Validated with a 2-GPU prefill/decode setup using
Qwen/Qwen3-0.6B.Test result (Are CI test cases needed for this?)
Before fix:
external_kv_transfer = 0after proxy request (NIXL never fires)local_computeAfter fix:
external_kv_transfer = 38(all prompt tokens transferred via NIXL)local_compute = 1(only the recomputed last token)Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.