Add tensor IPC transfer mechanism for multimodal data#32104
Add tensor IPC transfer mechanism for multimodal data#32104brandonpelfrey wants to merge 48 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a significant feature: an IPC/SHMEM pathway for multimodal tensors to improve performance in multi-GPU setups. The changes are extensive, touching configuration, argument parsing, engine core logic, and serialization utilities. The addition of comprehensive tests for the new IPC queue functionality is commendable. I've identified a critical issue in the engine core logic that could break data parallelism for non-MoE models, along with a couple of important bug fixes for handling tensors on CUDA devices in multimodal data processing.
|
Hi @brandonpelfrey, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @brandonpelfrey, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Just to be sure , the benchmarks in the PR description are run without GPU preprocessing right? |
Correct. There is no GPU preprocessing on the API Server. Note, I'm resolving some of the bot-identified issues at the moment, formatting etc. |
8ead3dd to
96e9bcf
Compare
|
Force-pushed purely to resolve DCO. |
|
Documentation preview: https://vllm--32104.org.readthedocs.build/en/32104/ |
|
Apologies for pinging many reviewers. I have made a mistake in my merge from main. I will resolve this then request the correct reviewers again. |
Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>
|
@njhill I've addressed your concerns. Please let me know if I've missed or misunderstood anything. As mentioned above, there was a mistake which has added a number of other reviewers, they could/should be removed. |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>
njhill
left a comment
There was a problem hiding this comment.
@brandonpelfrey apologies again for taking so long to re-review. Again it's quite a large PR with nontrivial changes to core parts of the code, so needed to dedicate some time to it.
Generally it would be appreciated if you could spend more time on self-review.
And it would still be good to understand how we are thinking about reconciling this with the existing mm shm-based tensor propagation (ie from #20452).
There was a problem hiding this comment.
I still feel that this can be better decoupled. For example, the encoder/decoder should not need to know about the request id, tensor id, etc. that can all be handled by the TensorIpcSender/TensorIpcReciever.
The handling in these should be very similar to how aux_buffers are already handled, just that this list of buffers is consumed/provided by the ipc sender/receivers (possibly via callbacks), and all their logic can live in the other file.
Also similar to the interface for pickle PEP 574 https://peps.python.org/pep-0574/
| if isinstance(arr, TensorIpcHandle): | ||
| return self._decode_ipc_queue_tensor(arr) | ||
| # Check if this is a dict that represents a TensorIpcHandle | ||
| # (msgspec serializes dataclasses as dicts without type info) | ||
| if ( | ||
| isinstance(arr, dict) | ||
| and "tensor_id" in arr | ||
| and "shape" in arr | ||
| and "dtype" in arr | ||
| and "device" in arr | ||
| ): | ||
| # Convert dict to TensorIpcHandle and decode it | ||
| handle = TensorIpcHandle(**arr) | ||
| return self._decode_ipc_queue_tensor(handle) |
There was a problem hiding this comment.
I don't think these will ever happen. TensorIpcHandle will get serialized by msgspec as a list.
There was a problem hiding this comment.
action item: I'll use a unit test + coverage to ensure the logic is actually exercised.
| if isinstance(obj, TensorIpcHandle): | ||
| return self._decode_ipc_queue_tensor(obj) | ||
| # Check if this is a dict that represents a TensorIpcHandle | ||
| # (msgspec serializes dataclasses as dicts without type info | ||
| # in nested structures) | ||
| if ( | ||
| isinstance(obj, dict) | ||
| and "tensor_id" in obj | ||
| and "shape" in obj | ||
| and "dtype" in obj | ||
| and "device" in obj | ||
| ): | ||
| # Convert dict to TensorIpcHandle and decode it | ||
| # Handle both new format (with request_id) and old format (without) | ||
| handle = TensorIpcHandle(**obj) | ||
| return self._decode_ipc_queue_tensor(handle) |
There was a problem hiding this comment.
Same comment as above, I don't think this will actually happen. But it doesn't look like you cover the list case here?
| def cleanup_request_tensors(self, request_id: str) -> int: | ||
| """Remove all orphaned tensors associated with a request. | ||
|
|
||
| Pass-through to the TensorIpcReceiver. Returns 0 if no receiver. | ||
| """ | ||
| if self.tensor_ipc_receiver is None: | ||
| return 0 | ||
| return self.tensor_ipc_receiver.cleanup_request_tensors(request_id) |
There was a problem hiding this comment.
this should be handled outside of this class (per other comment this class shouldn't need to know about request_id)
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>
|
Hi @brandonpelfrey, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>
|
Hi @brandonpelfrey, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>
Introduce Multimodal Content Tensor IPC/SHMEM Data Path
Following on from a request to break down the RFC/PR in #31925 , this PR introduces a IPC/SHMEM pathway for sending multimodal content from API Server -> CoreEngine processes via multiprocessing Queues. Part of the intention of this change is to reduce the number of changes in the original PR and introduce easier-to-review components which are required for the complete solution.
Note that this pathway is only used when the multimodal processing cache is disabled. Functionally this is because the cache mechanism replaces tensors with integers which causes the tensors to not go over this new multiprocessing queue mechanism. While this is a known limitation, this is still useful in many situations where a single input prompt task, e.g. video captioning, is needed.
Purpose
In the above-mentioned RFC/PR, we have demonstrated a method for enable multi-GPU scaling in video-decode heavy workloads. This requires a means of passing HW video decode results (sitting in VRAM in the API Server process) to CoreEngine process(es). When utilized with CUDA-device tensors, this IPC/SHMEM mechanism can provide a fast mechanism for data transfer and avoids any GPU->CPU->GPU copies.
Test Plan
For testing, I am depending on existing CI testing. Functional testing includes both a vllm serve + vllm bench combination (see below) as well as utilizing this PR with the above-mentioned PR to demonstrate that it will also work in the GPU zero-copy case.
Serve command and Bench Commands
Test Results
CPU: AMD EPYC 9124 16-Core Processor
GPU: H100
Memory: 512GB
uname -r:6.14.0-37-genericWithout IPC Tensor Datapath enabled
With IPC Tensor Datapath enabled
After multiple runs, it appears that the performance is approximately identical (within noise).
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
Implements zero-copy IPC for multimodal tensors and wires it through engine/client startup with configs and tests.
TensorIpcData/TensorIpcHandleand updatesMsgpackEncoder/MsgpackDecoderto send/receive CUDA and CPU tensors viatorch.multiprocessing.Queue(per-engine queues), routed byset_target_engine; falls back to standard serialization when disabled/unavailablevllm.v1.engine.utils, included in handshake metadata (index only), passed toEngineCoreProc/DPEngineCoreProc, and used by input decoders;CoreEngineClientconfigures encoder with queues and IPC settingMultiModalConfiggainsmax_concurrent_videosandmultimodal_tensor_ipc;ModelConfig/arg parsing expose--maximum-concurrent-videosand--enable/--disable-multimodal-tensor-ipc; new envVLLM_MULTIMODAL_TENSOR_IPC(default True)tensor_queuesviaAPIServerProcessManager; CLIservepasses queuestests/v1/test_tensor_ipc_queue.pyfor CUDA/CPU IPC, multiple producers, buffer management, and IPC disablementWritten by Cursor Bugbot for commit d527841933b4dcd62a95d7e1ab58455f9b0cc88f. This will update automatically on new commits. Configure here.
Note
Enables zero-copy transfer of multimodal tensors between API servers and engine cores using per-engine
torch.multiprocessing.Queue.TensorIpcData/TensorIpcHandleand updatesMsgpackEncoder/MsgpackDecoderto route CPU/CUDA tensors via IPC queues (with fallback to standard serialization)vllm.v1.engine.utils, inclusion of queue index in handshake, passed intoEngineCoreProc/DPEngineCoreProc, and used by core input decoders; client sets target engine for routingMultiModalConfiggainsmax_concurrent_videosandmultimodal_tensor_ipc;ModelConfig/args expose--maximum-concurrent-videosand--enable/--disable-multimodal-tensor-ipc; new envVLLM_MULTIMODAL_TENSOR_IPCtensor_queuesviaAPIServerProcessManager; CLIservepasses queues to workerstests/v1/test_tensor_ipc_queue.pycovering CUDA/CPU IPC, multiple producers, buffering, and IPC disablementWritten by Cursor Bugbot for commit 1a1460b05b9cc3c4b695682ef36aa3d5c5c959cc. This will update automatically on new commits. Configure here.
Note
Introduces a shared-memory IPC path for multimodal tensors to avoid serialization and GPU↔CPU copies.
TensorIpcData/TensorIpcHandleand extendsMsgpackEncoder/MsgpackDecoderto route CPU/CUDA tensors via per-enginetorch.multiprocessing.Queue(fallback to standard serialization when disabled/unavailable)EngineCoreProc/DPEngineCoreProc, used by core input decoders; client sets target engine for routing and initializes encoder with queuesMultiModalConfiggainsmax_concurrent_videosandmultimodal_tensor_ipc; surfaced viaModelConfigand args--maximum-concurrent-videosand--[en|dis]able-multimodal-tensor-ipc; new envVLLM_MULTIMODAL_TENSOR_IPCtensor_queuesto workerstests/v1/test_tensor_ipc_queue.pycovering CUDA/CPU IPC, multiple producers, buffering, and IPC disablementWritten by Cursor Bugbot for commit 29a2efb85b1118a8e306489139ca0315c7162049. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 326ce4ad86e3928bc9fedaf9458ae2835aaece31. Configure here.
Note
Introduces a shared-memory IPC path for multimodal tensors to avoid serialization and GPU↔CPU copies.
TensorIpcData/TensorIpcHandleand updatesMsgpackEncoder/MsgpackDecoderto route tensors via per-enginetorch.multiprocessing.Queueusingset_target_engine, with fallback to standard serializationvllm/v1/engine/utils.py, queue index included in handshake, passed toEngineCoreProc/DPEngineCoreProc, and used by core input decoders; client configures encoder with queues and IPC setting invllm/v1/engine/core_client.pyMultiModalConfiggainsmax_concurrent_videosandmultimodal_tensor_ipc;ModelConfigwires through; adds--maximum-concurrent-videosand--enable/--disable-multimodal-tensor-ipc; new envVLLM_MULTIMODAL_TENSOR_IPCtensor_queuesviaAPIServerProcessManager; CLIservepasses queuestests/v1/test_tensor_ipc_queue.pycovering CUDA/CPU IPC, multiple producers, buffering, and IPC disablementWritten by Cursor Bugbot for commit 326ce4ad86e3928bc9fedaf9458ae2835aaece31. This will update automatically on new commits. Configure here.
Note
Introduces a zero-copy IPC path for multimodal tensors routed over per-engine
torch.multiprocessing.Queues and integrates it across client/engine startup and request handling.TensorIpcData/TensorIpcHandleand extendsMsgpackEncoder/MsgpackDecoderto send/receive CPU and CUDA tensors via IPC queues (with fallback to standard serialization); supports request context and buffer cleanupvllm.v1.engine.utils, includes queue index in handshake, passes queues intoEngineCoreProc/DPEngineCoreProc, and uses them in input decoders; client sets target engine, request context, and initializes encoder with queuesabort_requestscleanup of orphaned tensors in engine coreMultiModalConfiggainsmax_concurrent_videosandmultimodal_tensor_ipc;ModelConfigwires through; CLI adds--maximum-concurrent-videosand--enable/--disable-multimodal-tensor-ipc; env varVLLM_MULTIMODAL_TENSOR_IPC(default True)tensor_queuesviaAPIServerProcessManager; CLIservepasses queuestests/v1/test_tensor_ipc_queue.pycovering CUDA/CPU IPC, multiple producers, buffering, cleanup, and IPC disablementWritten by Cursor Bugbot for commit 8ead3dd36fced6545f72d60b73be7df26aee9670. This will update automatically on new commits. Configure here.
Note
Enables zero-copy transfer of multimodal tensors between API servers and engine cores using per-engine
torch.multiprocessing.Queue.TensorIpcData/TensorIpcHandleand extendsMsgpackEncoder/MsgpackDecoderto route CPU/CUDA tensors via IPC queues, with request context, buffering, and cleanup fallback to standard serializationvllm/v1/engine/utils.py, queue index included in handshake, passed toEngineCoreProc/DPEngineCoreProc, and used by input decoders; client sets target engine and configures encoderMultiModalConfiggainsmax_concurrent_videosandmultimodal_tensor_ipc;ModelConfigwires through; CLI adds--maximum-concurrent-videosand--enable/--disable-multimodal-tensor-ipc; envVLLM_MULTIMODAL_TENSOR_IPCtensor_queuesviaAPIServerProcessManager;servepasses queuestests/v1/test_tensor_ipc_queue.pycovering CUDA/CPU IPC, multiple producers, buffering/cleanup, and IPC disablementWritten by Cursor Bugbot for commit 96e9bcf. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 053716e. Configure here.
Note
Introduces a shared-memory IPC path for multimodal tensors and wires it through engine/client startup and request handling.
TensorIpcData/TensorIpcHandleand extendsMsgpackEncoder/MsgpackDecoderto route CPU/CUDA tensors via per-enginetorch.multiprocessing.Queue(zero-copy), with request context, buffering, and fallback to standard serializationtensor_queuesin engine startup; includes queue index in handshake; passes queues intoEngineCoreProc/DPEngineCoreProc; input decoders consume from queues; engine abort cleans orphaned tensorsservepass sharedtensor_queuesto workersmultimodal_tensor_ipcinMultiModalConfig/ModelConfig, CLI flags--enable/--disable-multimodal-tensor-ipc, and envVLLM_MULTIMODAL_TENSOR_IPC(default True)tests/v1/test_tensor_ipc_queue.pycover CUDA/CPU IPC, multi-producer queueing, buffer management/cleanup, and IPC disablementWritten by Cursor Bugbot for commit 053716e. This will update automatically on new commits. Configure here.
Note
Introduces a zero-copy IPC pathway for multimodal tensors and integrates it across the engine/client lifecycle.
TensorIpcData/TensorIpcHandleand updatesMsgpackEncoder/MsgpackDecoderto route CPU/CUDA tensors via per-enginetorch.multiprocessing.Queuewith request context, buffering, and cleanup; falls back to standard serialization when disabled/unavailabletensor_queuesthrough startup/handshake: created per DP engine invllm.v1.engine.utils, queue index included in handshake, passed toEngineCoreProc/DPEngineCoreProc, and consumed by core input decoders; client sets target engine and request context during encodingEngineCoreProc.abort_requestsmultimodal_tensor_ipcinMultiModalConfig/ModelConfig, CLI--enable/--disable-multimodal-tensor-ipc, and envVLLM_MULTIMODAL_TENSOR_IPC;serveandAPIServerProcessManagerpasstensor_queuesto workerstests/v1/test_tensor_ipc_queue.pycover CUDA/CPU IPC, multi-producer behavior, buffer management/cleanup, and IPC disablementWritten by Cursor Bugbot for commit eefe86b. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit c756621. Configure here.
Note
Introduces a shared-memory IPC pathway for multimodal tensors and integrates it across the client/engine lifecycle.
TensorIpcData/TensorIpcHandleand extendsMsgpackEncoder/MsgpackDecoderto route CPU/CUDA tensors via per-enginetorch.multiprocessing.Queue(zero-copy), with request context, buffering, and cleanup; falls back to standard serializationtensor_queuesthrough startup/handshake and engine core: queues created per DP engine, queue index included in handshake, passed intoEngineCoreProc/DPEngineCoreProc, and consumed by input decoders; client sets target engine/request context during encoding;abort_requestsnow cleans orphaned tensorsmultimodal_tensor_ipcinMultiModalConfig/ModelConfig, CLI--enable/--disable-multimodal-tensor-ipc, and envVLLM_MULTIMODAL_TENSOR_IPC;serveandAPIServerProcessManagerpasstensor_queuesto API workerstorch.Tensorfields when IPC is enabled by handlingTensorIpcHandledicts in_decode_tensortests/v1/test_tensor_ipc_queue.py, new cases intest_serial_utils.py) covering CPU/CUDA IPC, multi-producer behavior, buffer management/cleanup, IPC disablement, and encoder request contextWritten by Cursor Bugbot for commit c756621. This will update automatically on new commits. Configure here.
Note
Introduces a shared-memory IPC path for multimodal tensors to avoid serialization and GPU↔CPU copies.
TensorIpcData/TensorIpcHandleand extendsMsgpackEncoder/MsgpackDecoderto route CPU/CUDA tensors via per-enginetorch.multiprocessing.Queue, with request context, buffering, and cleanup (fallback to standard serialization)EngineZmqAddresses,EngineHandshakeMetadata), passes intoEngineCoreProc/DPEngineCoreProc, and uses them in core input decoders; client sets target engine and request context during encoding; engine abort path cleans orphaned tensorsMultiModalConfig.multimodal_tensor_ipc,ModelConfigplumbing,--[enable|disable]-multimodal-tensor-ipcflag, andVLLM_MULTIMODAL_TENSOR_IPCenv; CLIserve/API server manager passtensor_queuesWritten by Cursor Bugbot for commit c756621. This will update automatically on new commits. Configure here.
Note
Implements a shared-memory IPC path for multimodal tensors and wires it through engine/client startup, request handling, and cleanup.
TensorIpcData/TensorIpcHandleand updatesMsgpackEncoder/MsgpackDecoderto route CPU/CUDA tensors via per-enginetorch.multiprocessing.Queue(with request context, buffering, and abort-time cleanup); falls back to standard serialization when disabledtensor_queues, passes queue index in handshake, and plumbs queues intoEngineCoreProc/DPEngineCoreProc; core input decoders use the queues, andabort_requestsremoves orphaned tensorsmultimodal_tensor_ipcinMultiModalConfig/ModelConfig,--enable/--disable-multimodal-tensor-ipc, andVLLM_MULTIMODAL_TENSOR_IPC;serveandAPIServerProcessManagerpasstensor_queuesto API workerstorch.Tensorfields when IPC is enabledtests/v1/test_tensor_ipc_queue.py, new cases intest_serial_utils.py) covering CPU/CUDA IPC, multi-producer behavior, buffer management/cleanup, and IPC disablementWritten by Cursor Bugbot for commit c756621. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit c756621. Configure here.
Note
Introduces a shared-memory IPC path for multimodal tensors and integrates it across engine/client lifecycle.
TensorIpcData/TensorIpcHandleand updatesMsgpackEncoder/MsgpackDecoderto send/receive tensors via per-enginetorch.multiprocessing.Queue(routing viaset_target_engine, request context, buffering, and cleanup; fallback to standard serialization). Fixes decoding of non-multimodaltorch.Tensorfields when IPC is enabled.vllm/v1/engine/utils.py, queue index included in handshake, passed intoEngineCoreProc/DPEngineCoreProc, and consumed by core input decoders; client configures encoder with queues and uses a request-scoped context.EngineCoreProc.abort_requestsand passestensor_queuesto API workers inserve/APIServerProcessManager.multimodal_tensor_ipcinMultiModalConfig/ModelConfig, CLI--enable/--disable-multimodal-tensor-ipc, and envVLLM_MULTIMODAL_TENSOR_IPC.tests/v1/test_tensor_ipc_queue.py, plus cases intest_serial_utils.py) cover CPU/CUDA IPC, multi-producer behavior, buffer management/cleanup, IPC disabled mode, and encoder request context.Written by Cursor Bugbot for commit c756621. This will update automatically on new commits. Configure here.