micro_vllm_x: A lightweight vLLM-like LLM inference engine with radix-tree based KV cache, and more.
该项目受 nano-vllm 启发,提供一个从零开始构建的 LLM 推理框架
程序的架构遵循了 vLLM v1 相似的组织安排,但是与 vLLM 不同的是,该项目的 KV 缓存系统使用的是 SGLang 的 Radix Cache 实现。
- 轻量但完整的代码实现
- FCFS 调度和持续批处理(Continuous Batching)
- OpenAI 兼容 API 服务
- 基于 Radix Tree 的 Prefix Caching
- Flash Attention 算子支持(FlashInfer 实现)
- 张量并行(Tensor Parallelism)
- 流水线并行(Pipeline Parallelism)
- CUDA Graph 支持(仅 Decoding 阶段)
- 增量调度和有状态的 Worker
- Torch Profiler 支持
- 异步调度(Async Scheduling)
torch>=2.8.0
numpy
triton>=3.0.0
transformers>=4.51.0,<=4.57.3
fastapi>=0.95.0
uvicorn
flashinfer-python>=0.2.7
psutil
pyzmq>=25.0.0
msgspec
cloudpickle
tqdm
git clone https://github.com/weilaiwlai/micro_vllm_x.git && cd vllm启动 API 服务
example:
python -m ullm.entrypoints.api_server --model Qwen3-0.6B --gpu-memory-utilization 0.9 --tp-size 2 --pp-size 2 --context-len 4096 --host 0.0.0.0 --port 8000
usage: api_server.py [-h] [--host HOST] [--port PORT] --model MODEL [--gpu-memory-utilization GPU_MEMORY_UTILIZATION] [--max-bs MAX_BS] [--tp-size TP_SIZE] [--pp-size PP_SIZE]
[--nccl-port NCCL_PORT] [--device-ids DEVICE_IDS] [--context-len CONTEXT_LEN] [--enforce-eager] [--log-level LOG_LEVEL] [--profile] [--profile-dir PROFILE_DIR]
LLM Distributed OpenAI-Compatible API Server
options:
-h, --help show this help message and exit
--host HOST Host name
--port PORT Port number
--model MODEL Model name
--gpu-memory-utilization GPU_MEMORY_UTILIZATION
GPU memory utilization
--max-bs MAX_BS Maximum batch size
--tp-size TP_SIZE Tensor parallel size
--pp-size PP_SIZE Pipeline parallel size
--nccl-port NCCL_PORT
NCCL port for distributed run
--device-ids DEVICE_IDS
Comma-separated list of GPU device IDs to use
--context-len CONTEXT_LEN
Max context length of the model
--enforce-eager Enforce eager execution, disable CUDA graph
--log-level LOG_LEVEL
Log level for the engine
--profile Enable profiling support
--profile-dir PROFILE_DIR
Directory to save profiling results
--disable-async-scheduling
Disable asynchronous scheduling
Offline Inference
from ullm import LLM, SamplingParams, EngineConfig
import asyncio
async def main():
config = EngineConfig(
model="Qwen3-0.6B",
gpu_memory_utilization=0.9,
tp_size=2,
pp_size=2,
context_len=4096,
enforce_eager=False,
)
llm = LLM(config)
prompt = "Once upon a time"
async for token in llm.generate(
prompt,
SamplingParams(
max_new_tokens=50,
temperature=0.6,
top_p=0.95,
top_k=20,
)
):
print(token, end="", flush=True)
if __name__ == "__main__":
asyncio.run(main())Experiment Environment:
- GPU: A100 40GB
- Model: Qwen3-0.6B
- Number of Requests: 256
- Prompt Length: random 100 ~ 1024
- Generation Length: random 100 ~ 1024
- Script: bench.py
Results:
| Inference Engine | Output Tokens | Time (s) | Throughput (tokens/s) |
|---|---|---|---|
| vLLM v0.11.0 | 133966 | 18.24 | 7343.96 |
| ours | 133966 | 14.83 | 9032.37 |
Further development is still ongoing.