Skip to content

[Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors#36549

Open
ZhanqiuHu wants to merge 2 commits intovllm-project:mainfrom
ZhanqiuHu:fix/multi-connector-hma-request-finished
Open

[Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors#36549
ZhanqiuHu wants to merge 2 commits intovllm-project:mainfrom
ZhanqiuHu:fix/multi-connector-hma-request-finished

Conversation

@ZhanqiuHu
Copy link
Contributor

@ZhanqiuHu ZhanqiuHu commented Mar 9, 2026

Summary

Fixes #36547

MultiConnector.request_finished calls c.request_finished() on all sub-connectors, but NixlConnector (SupportsHMA) only overrides request_finished_all_groups. The call falls through to the base class no-op, so kv_transfer_params is never generated and NIXL transfers silently fail.

Fix: Check if sub-connector implements SupportsHMA and dispatch to request_finished_all_groups accordingly.

Test plan

See #36547 for reproduction steps, test script, and detailed output. Validated with a 2-GPU prefill/decode setup using Qwen/Qwen3-0.6B.

Test result (Are CI test cases needed for this?)

Before fix:

  • external_kv_transfer = 0 after proxy request (NIXL never fires)
  • All prompt tokens fall back to local_compute

After fix:

  • external_kv_transfer = 38 (all prompt tokens transferred via NIXL)
  • local_compute = 1 (only the recomputed last token)

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…ectors

MultiConnector.request_finished calls c.request_finished() on all
sub-connectors, but NixlConnector (SupportsHMA) only overrides
request_finished_all_groups. The call falls through to the base class
no-op, so kv_transfer_params is never generated and NIXL transfers
silently fail.

Check if sub-connector implements SupportsHMA and dispatch to
request_finished_all_groups accordingly.

Fixes vllm-project#36547

Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
@ZhanqiuHu ZhanqiuHu changed the title [Bugfix] Fix MultiConnector.request_finished for SupportsHMA sub-connectors [Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors Mar 9, 2026
@mergify mergify bot added bug Something isn't working kv-connector labels Mar 9, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a bug where MultiConnector failed to correctly handle SupportsHMA sub-connectors like NixlConnector, causing silent failures. The change correctly dispatches to request_finished_all_groups for HMA-aware connectors. However, this fix is incomplete for models with multiple KV cache groups, as MultiConnector does not implement the SupportsHMA interface and can only handle a single block group. I've left a critical comment with a suggestion for a more robust implementation to prevent potential data loss or resource leaks in multi-group scenarios.

Comment on lines +397 to +402
if isinstance(c, SupportsHMA):
async_save, txfer_params = c.request_finished_all_groups(
request, (blocks,)
)
else:
async_save, txfer_params = c.request_finished(request, blocks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change correctly dispatches to request_finished_all_groups for SupportsHMA sub-connectors, but it only handles a single group of KV cache blocks. The request_finished method's signature (blocks: list[int]) means it can only receive one group.

For models with multiple KV cache groups (e.g., using sliding window attention), this implementation will only process one group. The KV cache for other groups will not be saved or handled by HMA-aware connectors, which can lead to silent data loss or resource leaks.

A more robust solution is for MultiConnector to implement the SupportsHMA interface. This would allow it to receive all block groups from the scheduler and delegate them correctly. This would involve:

  1. Adding SupportsHMA to MultiConnector's base classes.
  2. Implementing request_finished_all_groups in MultiConnector to handle tuple[list[int], ...], delegating to sub-connectors appropriately.
  3. Updating request_finished to delegate to the new method for backward compatibility.

Here is an example of a more robust implementation:

# In vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py

# Add SupportsHMA to class definition
class MultiConnector(KVConnectorBase_V1, SupportsHMA):
    # ... (rest of the class)

    # Keep existing request_finished for backward compatibility
    def request_finished(
        self,
        request: "Request",
        blocks: list[int],
    ) -> tuple[bool, dict[str, Any] | None]:
        return self.request_finished_all_groups(request, (blocks,))

    # Implement request_finished_all_groups
    def request_finished_all_groups(
        self,
        request: "Request",
        block_ids: tuple[list[int], ...],
    ) -> tuple[bool, dict[str, Any] | None]:
        async_saves = 0
        kv_txfer_params = None
        for c in self._connectors:
            if isinstance(c, SupportsHMA):
                async_save, txfer_params = c.request_finished_all_groups(
                    request, block_ids
                )
            else:
                # For non-HMA connectors, pass only the first group of blocks.
                async_save, txfer_params = c.request_finished(request, block_ids[0] if block_ids else [])
            
            if async_save:
                async_saves += 1
            if txfer_params is not None:
                if kv_txfer_params is not None:
                    raise RuntimeError(
                        "Only one connector can produce KV transfer params"
                    )
                kv_txfer_params = txfer_params
        
        if async_saves > 1:
            self._extra_async_saves[request.request_id] = async_saves - 1

        # Clean up other state for this request.
        self._requests_to_connector.pop(request.request_id, None)

        return async_saves > 0, kv_txfer_params

Add request_finished shim to NixlConnector that wraps block_ids and
delegates to request_finished_all_groups. This ensures MultiConnector
(and any other caller) can use the standard request_finished interface.

Reverts the MultiConnector isinstance check in favor of this approach.

Fixes vllm-project#36547

Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
@ZhanqiuHu ZhanqiuHu force-pushed the fix/multi-connector-hma-request-finished branch from b5972ff to 29820cc Compare March 10, 2026 16:05
@ZhanqiuHu
Copy link
Contributor Author

Hi @NickLucche, I just updated the fix. Added a request_finished() funciton in nixl_connector.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working kv-connector

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] MultiConnector: NIXL transfers silently broken after HMA migration

1 participant