[Fix] Fix SGLang tensor parallel rank bug and TP IPC broadcast logic#234
[Fix] Fix SGLang tensor parallel rank bug and TP IPC broadcast logic#234rishabhsinha17 wants to merge 2 commits intoovg-project:mainfrom
Conversation
This PR fixes a critical bug in SGLang's tensor parallel support and improves overall code quality: ## Bug Fix - Fix `start_worker_listener_thread()` call to use `tp_rank` instead of `torch.cuda.current_device()` in SGLang integration. The socket path is based on rank, not device ID, so using the wrong value causes IPC communication failures in multi-GPU setups. - Add `is_worker` parameter to SGLang's `init_kvcached()` to match vLLM's behavior - only workers should start the listener thread. - Add `_tp_size` global state to SGLang interfaces and pass it to `KVCacheManager` for proper tensor parallel coordination. - Update SGLang patches to detect `tp_rank` and `tp_size` from SGLang's distributed state via `get_tp_group()`. ## Code Quality Improvements - Replace bare `except` clauses in `autopatch.py` with specific exception handling (ImportError vs other exceptions) and add debug/warning logging. - Add comprehensive docstrings to all public API functions in both vLLM and SGLang integration modules. - Add type hints to global module state variables. - Add input validation for `limit-percent` CLI command (0-100 range). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit fixes critical bugs in the tensor parallel IPC logic: ## Bug Fixes 1. **Broadcast functions skip rank 0**: The broadcast_* functions now correctly send messages only to worker ranks (1 to tp_size-1), not to rank 0 which has no listener thread. Previously, the code was iterating `range(tp_size)` instead of `range(1, tp_size)`. 2. **Local + remote check for kv_tensors_created**: The coordinator now checks locally first, then broadcasts to workers. Previously in TP mode, it only checked remote workers without checking locally. 3. **Map/unmap always runs locally first**: The page allocator now always calls map_to_kv_tensors/unmap_from_kv_tensors locally on rank 0, then broadcasts to workers if tp_size > 1. The original code had an either/or structure that was incorrect. ## Improvements - Add debug logging when SGLang TP detection falls back to single-GPU mode to aid debugging of TP configuration issues. - Reset _tp_size in shutdown_kvcached() for proper cleanup (vLLM). - Add type hints to global module state variables (vLLM).
There was a problem hiding this comment.
Pull request overview
This PR fixes tensor-parallel (TP) coordination bugs by ensuring rank 0 acts as the coordinator (local operations first) and by preventing IPC broadcasts from incorrectly targeting rank 0, plus it improves integration APIs and error handling/logging.
Changes:
- Fix TP IPC broadcasts to target only worker ranks (1..tp_size-1) and update coordinator logic to always do local work first.
- Update KVCacheManager initialization gating to check KV tensor readiness locally and across workers.
- Improve SGLang/vLLM integration APIs (add
is_worker, track_tp_size, reset state on shutdown) and tighten CLI/autopatch behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| kvcached/tp_ipc_util.py | Broadcast helpers now skip rank 0 and handle tp_size<=1 early. |
| kvcached/page_allocator.py | Coordinator now maps/unmaps locally first, then broadcasts to workers. |
| kvcached/kv_cache_manager.py | KV tensor readiness check now combines local + worker readiness in TP mode. |
| kvcached/integration/vllm/interfaces.py | Adds docstrings/type hints, supports is_worker, tracks _tp_size, resets on shutdown. |
| kvcached/integration/sglang/patches.py | Detects TP rank/size from SGLang and passes is_worker/TP params into init. |
| kvcached/integration/sglang/interfaces.py | Adds is_worker + _tp_size plumbing and cleanup parity with vLLM. |
| kvcached/cli/kvctl.py | Adds range validation for limit-percent and improves error output. |
| kvcached/autopatch.py | Adds module docstring and replaces bare except with structured logging. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| offsets = [pid * self.page_size for pid in page_ids] | ||
| if self.tp_size > 1: # map pages across all tensor parallel workers. | ||
| # Always map locally on rank 0 (coordinator) | ||
| map_to_kv_tensors(offsets) |
There was a problem hiding this comment.
map_to_kv_tensors() returns a boolean success flag (see csrc/torch_bindings.cpp), but the result is ignored here. If mapping fails on rank 0 (e.g., KV tensors not created yet), this will silently continue and then potentially broadcast, leaving ranks in an inconsistent state. Consider checking the return value and raising a RuntimeError on failure before broadcasting.
| map_to_kv_tensors(offsets) | |
| success = map_to_kv_tensors(offsets) | |
| if not success: | |
| raise RuntimeError( | |
| f"Failed to map KV tensors for {len(offsets)} page offset(s) on coordinator rank 0" | |
| ) |
| # Always unmap locally on rank 0 (coordinator) | ||
| if self.async_sched: | ||
| torch.cuda.synchronize() | ||
| unmap_from_kv_tensors(offsets) |
There was a problem hiding this comment.
unmap_from_kv_tensors() returns a boolean success flag (see csrc/torch_bindings.cpp), but the result is ignored here. If unmapping fails locally, the code will still broadcast the unmap and proceed as if successful. Consider checking the return value and raising a RuntimeError on failure before broadcasting.
| unmap_from_kv_tensors(offsets) | |
| success = unmap_from_kv_tensors(offsets) | |
| if not success: | |
| raise RuntimeError(f"Failed to unmap KV tensors for offsets: {offsets}") |
| def _check_kv_tensors_created(): | ||
| # Always check locally on rank 0 (coordinator) | ||
| local_created = kv_tensors_created() | ||
| if self.tp_size > 1: | ||
| return broadcast_kv_tensors_created(self.tp_size) | ||
| else: | ||
| return kv_tensors_created() | ||
| # Also check all workers (ranks 1..tp_size-1) | ||
| workers_created = broadcast_kv_tensors_created(self.tp_size) | ||
| return local_created and workers_created | ||
| return local_created |
There was a problem hiding this comment.
This changes the TP readiness check semantics (local + workers) but there’s no test coverage ensuring the coordinator combines local_created and workers_created correctly (especially for tp_size>1). Consider adding a unit test in tests/test_kvcache_manager.py that mocks kv_tensors_created and broadcast_kv_tensors_created to assert the AND behavior and that the local check is performed.
| import logging | ||
| from importlib import import_module | ||
|
|
||
| logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
This module uses logging.getLogger(name) while the rest of the codebase consistently uses get_kvcached_logger() (e.g., kvcached/utils.py:get_kvcached_logger, kvcached/integration/*). Using the standard logger may result in missing handlers/formatting and unintended propagation into host application logging. Consider switching to get_kvcached_logger("kvcached") (or similar) for consistent configuration and to avoid log duplication.
Summary
This PR fixes critical bugs in SGLang's tensor parallel support and the core TP IPC broadcast logic.
Bug Fixes
SGLang TP rank bug: Fix
start_worker_listener_thread()to usetp_rankinstead oftorch.cuda.current_device(). The socket path is based on rank, not device ID, causing IPC failures in multi-GPU setups.Broadcast functions sending to rank 0: The
broadcast_*functions were iteratingrange(tp_size)instead ofrange(1, tp_size), sending messages to rank 0 which has no listener thread.Local + remote check for kv_tensors_created: In TP mode, the coordinator now checks locally first, then broadcasts to workers. Previously it only checked remote workers.
Map/unmap always runs locally first: The page allocator now always calls
map_to_kv_tensors/unmap_from_kv_tensorslocally on rank 0, then broadcasts to workers iftp_size > 1.Improvements
is_workerparameter to SGLang'sinit_kvcached()to match vLLM's behavior_tp_sizeglobal state to SGLang interfaces for proper TP coordinationtp_rankandtp_sizefromget_tp_group()_tp_sizeinshutdown_kvcached()for proper cleanupexceptclauses with specific exception handlingTest Plan
Supersedes #226