Skip to content

[Fix] Fix SGLang tensor parallel rank bug and TP IPC broadcast logic#234

Open
rishabhsinha17 wants to merge 2 commits intoovg-project:mainfrom
rishabhsinha17:fix/sglang-tp-rank-and-improvements
Open

[Fix] Fix SGLang tensor parallel rank bug and TP IPC broadcast logic#234
rishabhsinha17 wants to merge 2 commits intoovg-project:mainfrom
rishabhsinha17:fix/sglang-tp-rank-and-improvements

Conversation

@rishabhsinha17
Copy link

Summary

This PR fixes critical bugs in SGLang's tensor parallel support and the core TP IPC broadcast logic.

Bug Fixes

  1. SGLang TP rank bug: Fix start_worker_listener_thread() to use tp_rank instead of torch.cuda.current_device(). The socket path is based on rank, not device ID, causing IPC failures in multi-GPU setups.

  2. Broadcast functions sending to rank 0: The broadcast_* functions were iterating range(tp_size) instead of range(1, tp_size), sending messages to rank 0 which has no listener thread.

  3. Local + remote check for kv_tensors_created: In TP mode, the coordinator now checks locally first, then broadcasts to workers. Previously it only checked remote workers.

  4. Map/unmap always runs locally first: The page allocator now always calls map_to_kv_tensors/unmap_from_kv_tensors locally on rank 0, then broadcasts to workers if tp_size > 1.

Improvements

  • Add is_worker parameter to SGLang's init_kvcached() to match vLLM's behavior
  • Add _tp_size global state to SGLang interfaces for proper TP coordination
  • Update SGLang patches to detect tp_rank and tp_size from get_tp_group()
  • Add debug logging when SGLang TP detection falls back to single-GPU mode
  • Reset _tp_size in shutdown_kvcached() for proper cleanup
  • Add type hints to global module state variables
  • Replace bare except clauses with specific exception handling

Test Plan

  • Verify Python syntax is valid
  • Test with SGLang single-GPU inference
  • Test with SGLang tensor parallel (multi-GPU) inference
  • Test with vLLM to ensure no regressions

Supersedes #226

yurekami and others added 2 commits December 27, 2025 23:44
This PR fixes a critical bug in SGLang's tensor parallel support and
improves overall code quality:

## Bug Fix
- Fix `start_worker_listener_thread()` call to use `tp_rank` instead of
  `torch.cuda.current_device()` in SGLang integration. The socket path
  is based on rank, not device ID, so using the wrong value causes IPC
  communication failures in multi-GPU setups.
- Add `is_worker` parameter to SGLang's `init_kvcached()` to match vLLM's
  behavior - only workers should start the listener thread.
- Add `_tp_size` global state to SGLang interfaces and pass it to
  `KVCacheManager` for proper tensor parallel coordination.
- Update SGLang patches to detect `tp_rank` and `tp_size` from SGLang's
  distributed state via `get_tp_group()`.

## Code Quality Improvements
- Replace bare `except` clauses in `autopatch.py` with specific exception
  handling (ImportError vs other exceptions) and add debug/warning logging.
- Add comprehensive docstrings to all public API functions in both vLLM
  and SGLang integration modules.
- Add type hints to global module state variables.
- Add input validation for `limit-percent` CLI command (0-100 range).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit fixes critical bugs in the tensor parallel IPC logic:

## Bug Fixes

1. **Broadcast functions skip rank 0**: The broadcast_* functions now
   correctly send messages only to worker ranks (1 to tp_size-1), not
   to rank 0 which has no listener thread. Previously, the code was
   iterating `range(tp_size)` instead of `range(1, tp_size)`.

2. **Local + remote check for kv_tensors_created**: The coordinator
   now checks locally first, then broadcasts to workers. Previously
   in TP mode, it only checked remote workers without checking locally.

3. **Map/unmap always runs locally first**: The page allocator now
   always calls map_to_kv_tensors/unmap_from_kv_tensors locally on
   rank 0, then broadcasts to workers if tp_size > 1. The original
   code had an either/or structure that was incorrect.

## Improvements

- Add debug logging when SGLang TP detection falls back to single-GPU
  mode to aid debugging of TP configuration issues.
- Reset _tp_size in shutdown_kvcached() for proper cleanup (vLLM).
- Add type hints to global module state variables (vLLM).
Copilot AI review requested due to automatic review settings January 26, 2026 07:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes tensor-parallel (TP) coordination bugs by ensuring rank 0 acts as the coordinator (local operations first) and by preventing IPC broadcasts from incorrectly targeting rank 0, plus it improves integration APIs and error handling/logging.

Changes:

  • Fix TP IPC broadcasts to target only worker ranks (1..tp_size-1) and update coordinator logic to always do local work first.
  • Update KVCacheManager initialization gating to check KV tensor readiness locally and across workers.
  • Improve SGLang/vLLM integration APIs (add is_worker, track _tp_size, reset state on shutdown) and tighten CLI/autopatch behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
kvcached/tp_ipc_util.py Broadcast helpers now skip rank 0 and handle tp_size<=1 early.
kvcached/page_allocator.py Coordinator now maps/unmaps locally first, then broadcasts to workers.
kvcached/kv_cache_manager.py KV tensor readiness check now combines local + worker readiness in TP mode.
kvcached/integration/vllm/interfaces.py Adds docstrings/type hints, supports is_worker, tracks _tp_size, resets on shutdown.
kvcached/integration/sglang/patches.py Detects TP rank/size from SGLang and passes is_worker/TP params into init.
kvcached/integration/sglang/interfaces.py Adds is_worker + _tp_size plumbing and cleanup parity with vLLM.
kvcached/cli/kvctl.py Adds range validation for limit-percent and improves error output.
kvcached/autopatch.py Adds module docstring and replaces bare except with structured logging.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

offsets = [pid * self.page_size for pid in page_ids]
if self.tp_size > 1: # map pages across all tensor parallel workers.
# Always map locally on rank 0 (coordinator)
map_to_kv_tensors(offsets)
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map_to_kv_tensors() returns a boolean success flag (see csrc/torch_bindings.cpp), but the result is ignored here. If mapping fails on rank 0 (e.g., KV tensors not created yet), this will silently continue and then potentially broadcast, leaving ranks in an inconsistent state. Consider checking the return value and raising a RuntimeError on failure before broadcasting.

Suggested change
map_to_kv_tensors(offsets)
success = map_to_kv_tensors(offsets)
if not success:
raise RuntimeError(
f"Failed to map KV tensors for {len(offsets)} page offset(s) on coordinator rank 0"
)

Copilot uses AI. Check for mistakes.
# Always unmap locally on rank 0 (coordinator)
if self.async_sched:
torch.cuda.synchronize()
unmap_from_kv_tensors(offsets)
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unmap_from_kv_tensors() returns a boolean success flag (see csrc/torch_bindings.cpp), but the result is ignored here. If unmapping fails locally, the code will still broadcast the unmap and proceed as if successful. Consider checking the return value and raising a RuntimeError on failure before broadcasting.

Suggested change
unmap_from_kv_tensors(offsets)
success = unmap_from_kv_tensors(offsets)
if not success:
raise RuntimeError(f"Failed to unmap KV tensors for offsets: {offsets}")

Copilot uses AI. Check for mistakes.
Comment on lines 107 to +114
def _check_kv_tensors_created():
# Always check locally on rank 0 (coordinator)
local_created = kv_tensors_created()
if self.tp_size > 1:
return broadcast_kv_tensors_created(self.tp_size)
else:
return kv_tensors_created()
# Also check all workers (ranks 1..tp_size-1)
workers_created = broadcast_kv_tensors_created(self.tp_size)
return local_created and workers_created
return local_created
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the TP readiness check semantics (local + workers) but there’s no test coverage ensuring the coordinator combines local_created and workers_created correctly (especially for tp_size>1). Consider adding a unit test in tests/test_kvcache_manager.py that mocks kv_tensors_created and broadcast_kv_tensors_created to assert the AND behavior and that the local check is performed.

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +13
import logging
from importlib import import_module

logger = logging.getLogger(__name__)
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module uses logging.getLogger(name) while the rest of the codebase consistently uses get_kvcached_logger() (e.g., kvcached/utils.py:get_kvcached_logger, kvcached/integration/*). Using the standard logger may result in missing handlers/formatting and unintended propagation into host application logging. Consider switching to get_kvcached_logger("kvcached") (or similar) for consistent configuration and to avoid log duplication.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants