Llumnix is a full-stack solution for distributed LLM inference serving. It has been a key part of the LLM serving infrastructure of Alibaba Cloud PAI-EAS, a cloud-native inference serving platform, supporting production-grade inference deployments.
Llumnix provides key functionalities for modern distributed serving deployments (e.g., PD disaggregation, wide EP), such as LLM-specialized request gateway, intelligent and dynamic scheduling, high-performance KV cache transfer/storage support, etc. With a scheduler + rescheduler architecture and white-box scheduling design, Llumnix achieves fully dynamic request scheduling and pushes the performance of inference engines to the limit.
Note that with this new repository, we are re-architecting Llummix to a more modular and cloud-native design (Llumnix v1). The old Ray-based architecture (Llumnix v0) is a better choice for local deployments and quick prototyping and experimentation of scheduling ideas.
- Scheduler + rescheduler architecture for fully dynamic request scheduling: initial routing + continuous migration
- Advanced scheduling policies: load balancing, KV-aware, SLO/predictor-based scheduling, adaptive PD disaggregation, etc.
- Dual-mode scheduling
- Full mode (white-box) for max performance with engine participation
- Lite mode (black-box) for engine-transparent deployments
- Real-time instance status tracking for optimal scheduling quality
- Modular, extensible policy framework for easily implementing and composing scheduling policies
- LLM-specialized request gateway
- Tokenizers, diverse request routing / disaggregation protocols, batch inference
- Traffic management: splitting, mirroring, throttling, etc.
- High-performance KV cache support (see llumnix-kv)
- Efficient, flexible data plane for KV cache transfer supporting diverse cache layouts and transport protocols (blade-kvt)
- Unified control plane for PD disaggregation, migration, KV storage (hybrid-connector)
- High availability
- Fault tolerance for Llumnix components
- Engine health monitoring and reactive (re-)scheduling upon engine failures
Llumnix is more than a "router". It has a full-stack design to support advanced scheduling features.
Components:
- LlumSched: scheduler for initial scheduling and rescheduler for continuous rescheduling
- Llumlet: an engine-side process that bridges global components and the inference engine
- Cluster meta store: tracking realtime instance status
- Engine: the inference engine (vLLM/SGLang) with Llumnix utility codes for scheduling enhancements (if using full mode)
- Gateway: LLM-specialized capabilities, such as tokenizers, routing protocols, traffic management, batch inference
- Hybrid Connector: unified KV cache control plane, using blade-kvt for KV transfer and external KV storage for offloading
View our documentation to learn more.
Llumnix is licensed under the Apache 2.0 License.
