About

Llumnix is a full-stack solution for distributed LLM inference serving. It has been a key part of the LLM serving infrastructure of Alibaba Cloud PAI-EAS, a cloud-native inference serving platform, supporting production-grade inference deployments.

Llumnix provides key functionalities for modern distributed serving deployments (e.g., PD disaggregation, wide EP), such as LLM-specialized request gateway, intelligent and dynamic scheduling, high-performance KV cache transfer/storage support, etc. With a scheduler + rescheduler architecture and white-box scheduling design, Llumnix achieves fully dynamic request scheduling and pushes the performance of inference engines to the limit.

Note that with this new repository, we are re-architecting Llummix to a more modular and cloud-native design (Llumnix v1). The old Ray-based architecture (Llumnix v0) is a better choice for local deployments and quick prototyping and experimentation of scheduling ideas.

[Documentation]

Key Features

Scheduler + rescheduler architecture for fully dynamic request scheduling: initial routing + continuous migration
Advanced scheduling policies: load balancing, KV-aware, SLO/predictor-based scheduling, adaptive PD disaggregation, etc.
Dual-mode scheduling
- Full mode (white-box) for max performance with engine participation
- Lite mode (black-box) for engine-transparent deployments
Real-time instance status tracking for optimal scheduling quality
Modular, extensible policy framework for easily implementing and composing scheduling policies
LLM-specialized request gateway
- Tokenizers, diverse request routing / disaggregation protocols, batch inference
- Traffic management: splitting, mirroring, throttling, etc.
High-performance KV cache support (see llumnix-kv)
- Efficient, flexible data plane for KV cache transfer supporting diverse cache layouts and transport protocols (blade-kvt)
- Unified control plane for PD disaggregation, migration, KV storage (hybrid-connector)
High availability
- Fault tolerance for Llumnix components
- Engine health monitoring and reactive (re-)scheduling upon engine failures

Architecture

Llumnix is more than a "router". It has a full-stack design to support advanced scheduling features.

Components:

LlumSched: scheduler for initial scheduling and rescheduler for continuous rescheduling
Llumlet: an engine-side process that bridges global components and the inference engine
Cluster meta store: tracking realtime instance status
Engine: the inference engine (vLLM/SGLang) with Llumnix utility codes for scheduling enhancements (if using full mode)
Gateway: LLM-specialized capabilities, such as tokenizers, routing protocols, traffic management, batch inference
Hybrid Connector: unified KV cache control plane, using blade-kvt for KV transfer and external KV storage for offloading

Getting Started

View our documentation to learn more.

License

Llumnix is licensed under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
.github		.github
benchmarks		benchmarks
cmd		cmd
container		container
deploy		deploy
docs		docs
lib		lib
patches		patches
pkg		pkg
python		python
scripts		scripts
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
.golangci.yml		.golangci.yml
.pylintrc		.pylintrc
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Key Features

Architecture

Getting Started

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Key Features

Architecture

Getting Started

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages