Skip to content

[Fix] Fix SGLang tensor parallel rank bug and improve code quality#226

Open
yurekami wants to merge 1 commit intoovg-project:mainfrom
yurekami:fix/sglang-tp-rank-and-improvements
Open

[Fix] Fix SGLang tensor parallel rank bug and improve code quality#226
yurekami wants to merge 1 commit intoovg-project:mainfrom
yurekami:fix/sglang-tp-rank-and-improvements

Conversation

@yurekami
Copy link

Summary

This PR fixes a critical bug in SGLang's tensor parallel support and improves overall code quality.

Bug Fix

  • Critical: Fix start_worker_listener_thread() call to use tp_rank instead of torch.cuda.current_device() in SGLang integration. The socket path is based on rank, not device ID, so using the wrong value causes IPC communication failures in multi-GPU tensor parallel setups.
  • Add is_worker parameter to SGLang's init_kvcached() to match vLLM's behavior - only workers should start the listener thread.
  • Add _tp_size global state to SGLang interfaces and pass it to KVCacheManager for proper tensor parallel coordination.
  • Update SGLang patches to detect tp_rank and tp_size from SGLang's distributed state via get_tp_group().

Code Quality Improvements

  • Replace bare except clauses in autopatch.py with specific exception handling (ImportError vs other exceptions) and add debug/warning logging.
  • Add comprehensive docstrings to all public API functions in both vLLM and SGLang integration modules.
  • Add type hints to global module state variables.
  • Add input validation for limit-percent CLI command (0-100 range).

Test plan

  • Verify Python syntax is valid (checked locally with py_compile)
  • Test with SGLang single-GPU inference
  • Test with SGLang tensor parallel (multi-GPU) inference
  • Test with vLLM to ensure no regressions

🤖 Generated with Claude Code

This PR fixes a critical bug in SGLang's tensor parallel support and
improves overall code quality:

## Bug Fix
- Fix `start_worker_listener_thread()` call to use `tp_rank` instead of
  `torch.cuda.current_device()` in SGLang integration. The socket path
  is based on rank, not device ID, so using the wrong value causes IPC
  communication failures in multi-GPU setups.
- Add `is_worker` parameter to SGLang's `init_kvcached()` to match vLLM's
  behavior - only workers should start the listener thread.
- Add `_tp_size` global state to SGLang interfaces and pass it to
  `KVCacheManager` for proper tensor parallel coordination.
- Update SGLang patches to detect `tp_rank` and `tp_size` from SGLang's
  distributed state via `get_tp_group()`.

## Code Quality Improvements
- Replace bare `except` clauses in `autopatch.py` with specific exception
  handling (ImportError vs other exceptions) and add debug/warning logging.
- Add comprehensive docstrings to all public API functions in both vLLM
  and SGLang integration modules.
- Add type hints to global module state variables.
- Add input validation for `limit-percent` CLI command (0-100 range).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant