[Fix] Fix SGLang tensor parallel rank bug and TP IPC broadcast logic by rishabhsinha17 · Pull Request #234 · ovg-project/kvcached

rishabhsinha17 · 2026-01-26T07:52:23Z

Summary

This PR fixes critical bugs in SGLang's tensor parallel support and the core TP IPC broadcast logic.

Bug Fixes

SGLang TP rank bug: Fix start_worker_listener_thread() to use tp_rank instead of torch.cuda.current_device(). The socket path is based on rank, not device ID, causing IPC failures in multi-GPU setups.
Broadcast functions sending to rank 0: The broadcast_* functions were iterating range(tp_size) instead of range(1, tp_size), sending messages to rank 0 which has no listener thread.
Local + remote check for kv_tensors_created: In TP mode, the coordinator now checks locally first, then broadcasts to workers. Previously it only checked remote workers.
Map/unmap always runs locally first: The page allocator now always calls map_to_kv_tensors/unmap_from_kv_tensors locally on rank 0, then broadcasts to workers if tp_size > 1.

Improvements

Add is_worker parameter to SGLang's init_kvcached() to match vLLM's behavior
Add _tp_size global state to SGLang interfaces for proper TP coordination
Update SGLang patches to detect tp_rank and tp_size from get_tp_group()
Add debug logging when SGLang TP detection falls back to single-GPU mode
Reset _tp_size in shutdown_kvcached() for proper cleanup
Add type hints to global module state variables
Replace bare except clauses with specific exception handling

Test Plan

Verify Python syntax is valid
Test with SGLang single-GPU inference
Test with SGLang tensor parallel (multi-GPU) inference
Test with vLLM to ensure no regressions

Supersedes #226

This PR fixes a critical bug in SGLang's tensor parallel support and improves overall code quality: ## Bug Fix - Fix `start_worker_listener_thread()` call to use `tp_rank` instead of `torch.cuda.current_device()` in SGLang integration. The socket path is based on rank, not device ID, so using the wrong value causes IPC communication failures in multi-GPU setups. - Add `is_worker` parameter to SGLang's `init_kvcached()` to match vLLM's behavior - only workers should start the listener thread. - Add `_tp_size` global state to SGLang interfaces and pass it to `KVCacheManager` for proper tensor parallel coordination. - Update SGLang patches to detect `tp_rank` and `tp_size` from SGLang's distributed state via `get_tp_group()`. ## Code Quality Improvements - Replace bare `except` clauses in `autopatch.py` with specific exception handling (ImportError vs other exceptions) and add debug/warning logging. - Add comprehensive docstrings to all public API functions in both vLLM and SGLang integration modules. - Add type hints to global module state variables. - Add input validation for `limit-percent` CLI command (0-100 range). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This commit fixes critical bugs in the tensor parallel IPC logic: ## Bug Fixes 1. **Broadcast functions skip rank 0**: The broadcast_* functions now correctly send messages only to worker ranks (1 to tp_size-1), not to rank 0 which has no listener thread. Previously, the code was iterating `range(tp_size)` instead of `range(1, tp_size)`. 2. **Local + remote check for kv_tensors_created**: The coordinator now checks locally first, then broadcasts to workers. Previously in TP mode, it only checked remote workers without checking locally. 3. **Map/unmap always runs locally first**: The page allocator now always calls map_to_kv_tensors/unmap_from_kv_tensors locally on rank 0, then broadcasts to workers if tp_size > 1. The original code had an either/or structure that was incorrect. ## Improvements - Add debug logging when SGLang TP detection falls back to single-GPU mode to aid debugging of TP configuration issues. - Reset _tp_size in shutdown_kvcached() for proper cleanup (vLLM). - Add type hints to global module state variables (vLLM).

Copilot

Pull request overview

This PR fixes tensor-parallel (TP) coordination bugs by ensuring rank 0 acts as the coordinator (local operations first) and by preventing IPC broadcasts from incorrectly targeting rank 0, plus it improves integration APIs and error handling/logging.

Changes:

Fix TP IPC broadcasts to target only worker ranks (1..tp_size-1) and update coordinator logic to always do local work first.
Update KVCacheManager initialization gating to check KV tensor readiness locally and across workers.
Improve SGLang/vLLM integration APIs (add is_worker, track _tp_size, reset state on shutdown) and tighten CLI/autopatch behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
kvcached/tp_ipc_util.py	Broadcast helpers now skip rank 0 and handle tp_size<=1 early.
kvcached/page_allocator.py	Coordinator now maps/unmaps locally first, then broadcasts to workers.
kvcached/kv_cache_manager.py	KV tensor readiness check now combines local + worker readiness in TP mode.
kvcached/integration/vllm/interfaces.py	Adds docstrings/type hints, supports `is_worker`, tracks `_tp_size`, resets on shutdown.
kvcached/integration/sglang/patches.py	Detects TP rank/size from SGLang and passes `is_worker`/TP params into init.
kvcached/integration/sglang/interfaces.py	Adds `is_worker` + `_tp_size` plumbing and cleanup parity with vLLM.
kvcached/cli/kvctl.py	Adds range validation for `limit-percent` and improves error output.
kvcached/autopatch.py	Adds module docstring and replaces bare except with structured logging.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-26T08:00:53Z

kvcached/page_allocator.py

            offsets = [pid * self.page_size for pid in page_ids]
-        if self.tp_size > 1:  # map pages across all tensor parallel workers.
+        # Always map locally on rank 0 (coordinator)
+        map_to_kv_tensors(offsets)


map_to_kv_tensors() returns a boolean success flag (see csrc/torch_bindings.cpp), but the result is ignored here. If mapping fails on rank 0 (e.g., KV tensors not created yet), this will silently continue and then potentially broadcast, leaving ranks in an inconsistent state. Consider checking the return value and raising a RuntimeError on failure before broadcasting.

Suggested change

map_to_kv_tensors(offsets)

success = map_to_kv_tensors(offsets)

if not success:

raise RuntimeError(

f"Failed to map KV tensors for {len(offsets)} page offset(s) on coordinator rank 0"

)

Copilot · 2026-01-26T08:00:53Z

kvcached/page_allocator.py

+        # Always unmap locally on rank 0 (coordinator)
+        if self.async_sched:
+            torch.cuda.synchronize()
+        unmap_from_kv_tensors(offsets)


unmap_from_kv_tensors() returns a boolean success flag (see csrc/torch_bindings.cpp), but the result is ignored here. If unmapping fails locally, the code will still broadcast the unmap and proceed as if successful. Consider checking the return value and raising a RuntimeError on failure before broadcasting.

Suggested change

unmap_from_kv_tensors(offsets)

success = unmap_from_kv_tensors(offsets)

if not success:

raise RuntimeError(f"Failed to unmap KV tensors for offsets: {offsets}")

Copilot · 2026-01-26T08:00:53Z

kvcached/kv_cache_manager.py

        def _check_kv_tensors_created():
+            # Always check locally on rank 0 (coordinator)
+            local_created = kv_tensors_created()
            if self.tp_size > 1:
-                return broadcast_kv_tensors_created(self.tp_size)
-            else:
-                return kv_tensors_created()
+                # Also check all workers (ranks 1..tp_size-1)
+                workers_created = broadcast_kv_tensors_created(self.tp_size)
+                return local_created and workers_created
+            return local_created


This changes the TP readiness check semantics (local + workers) but there’s no test coverage ensuring the coordinator combines local_created and workers_created correctly (especially for tp_size>1). Consider adding a unit test in tests/test_kvcache_manager.py that mocks kv_tensors_created and broadcast_kv_tensors_created to assert the AND behavior and that the local check is performed.

Copilot · 2026-01-26T08:00:54Z

kvcached/autopatch.py

+import logging
 from importlib import import_module

+logger = logging.getLogger(__name__)


This module uses logging.getLogger(name) while the rest of the codebase consistently uses get_kvcached_logger() (e.g., kvcached/utils.py:get_kvcached_logger, kvcached/integration/*). Using the standard logger may result in missing handlers/formatting and unintended propagation into host application logging. Consider switching to get_kvcached_logger("kvcached") (or similar) for consistent configuration and to avoid log duplication.

yurekami and others added 2 commits December 27, 2025 23:44

Copilot AI review requested due to automatic review settings January 26, 2026 07:52

Copilot started reviewing on behalf of rishabhsinha17 January 26, 2026 07:52 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

jiarong0907 assigned cui36 Feb 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Fix SGLang tensor parallel rank bug and TP IPC broadcast logic#234

[Fix] Fix SGLang tensor parallel rank bug and TP IPC broadcast logic#234
rishabhsinha17 wants to merge 2 commits intoovg-project:mainfrom
rishabhsinha17:fix/sglang-tp-rank-and-improvements

rishabhsinha17 commented Jan 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-        map_to_kv_tensors(offsets)
+        success = map_to_kv_tensors(offsets)
+        if not success:
+            raise RuntimeError(
+                f"Failed to map KV tensors for {len(offsets)} page offset(s) on coordinator rank 0"
+            )

Conversation

rishabhsinha17 commented Jan 26, 2026

Summary

Bug Fixes

Improvements

Test Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants