Skip to content

weilaiwlai/micro_vllm_x

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

micro_vllm_x: A lightweight vLLM-like LLM inference engine with radix-tree based KV cache, and more.


该项目受 nano-vllm 启发,提供一个从零开始构建的 LLM 推理框架

程序的架构遵循了 vLLM v1 相似的组织安排,但是与 vLLM 不同的是,该项目的 KV 缓存系统使用的是 SGLang 的 Radix Cache 实现。

Features

  • 轻量但完整的代码实现
  • FCFS 调度和持续批处理(Continuous Batching)
  • OpenAI 兼容 API 服务
  • 基于 Radix Tree 的 Prefix Caching
  • Flash Attention 算子支持(FlashInfer 实现)
  • 张量并行(Tensor Parallelism)
  • 流水线并行(Pipeline Parallelism)
  • CUDA Graph 支持(仅 Decoding 阶段)
  • 增量调度和有状态的 Worker
  • Torch Profiler 支持
  • 异步调度(Async Scheduling)

Requirements

torch>=2.8.0
numpy
triton>=3.0.0
transformers>=4.51.0,<=4.57.3
fastapi>=0.95.0
uvicorn
flashinfer-python>=0.2.7
psutil
pyzmq>=25.0.0
msgspec
cloudpickle
tqdm

Installation

git clone https://github.com/weilaiwlai/micro_vllm_x.git && cd vllm

Quick Start

启动 API 服务

example:
python -m ullm.entrypoints.api_server --model Qwen3-0.6B --gpu-memory-utilization 0.9 --tp-size 2 --pp-size 2 --context-len 4096 --host 0.0.0.0 --port 8000

usage: api_server.py [-h] [--host HOST] [--port PORT] --model MODEL [--gpu-memory-utilization GPU_MEMORY_UTILIZATION] [--max-bs MAX_BS] [--tp-size TP_SIZE] [--pp-size PP_SIZE]
                     [--nccl-port NCCL_PORT] [--device-ids DEVICE_IDS] [--context-len CONTEXT_LEN] [--enforce-eager] [--log-level LOG_LEVEL] [--profile] [--profile-dir PROFILE_DIR]

LLM Distributed OpenAI-Compatible API Server

options:
  -h, --help            show this help message and exit
  --host HOST           Host name
  --port PORT           Port number
  --model MODEL         Model name
  --gpu-memory-utilization GPU_MEMORY_UTILIZATION
                        GPU memory utilization
  --max-bs MAX_BS       Maximum batch size
  --tp-size TP_SIZE     Tensor parallel size
  --pp-size PP_SIZE     Pipeline parallel size
  --nccl-port NCCL_PORT
                        NCCL port for distributed run
  --device-ids DEVICE_IDS
                        Comma-separated list of GPU device IDs to use
  --context-len CONTEXT_LEN
                        Max context length of the model
  --enforce-eager       Enforce eager execution, disable CUDA graph
  --log-level LOG_LEVEL
                        Log level for the engine
  --profile             Enable profiling support
  --profile-dir PROFILE_DIR
                        Directory to save profiling results
  --disable-async-scheduling
                        Disable asynchronous scheduling

Offline Inference

from ullm import LLM, SamplingParams, EngineConfig
import asyncio

async def main():
    config = EngineConfig(
        model="Qwen3-0.6B",
        gpu_memory_utilization=0.9,
        tp_size=2,
        pp_size=2,
        context_len=4096,
        enforce_eager=False,
    )
    llm = LLM(config)

    prompt = "Once upon a time"
    async for token in llm.generate(
        prompt,
        SamplingParams(
            max_new_tokens=50,
            temperature=0.6,
            top_p=0.95,
            top_k=20,
        )
    ):
        print(token, end="", flush=True)

if __name__ == "__main__":
    asyncio.run(main())

Benchmarks

Experiment Environment:

  • GPU: A100 40GB
  • Model: Qwen3-0.6B
  • Number of Requests: 256
  • Prompt Length: random 100 ~ 1024
  • Generation Length: random 100 ~ 1024
  • Script: bench.py

Results:

Inference Engine Output Tokens Time (s) Throughput (tokens/s)
vLLM v0.11.0 133966 18.24 7343.96
ours 133966 14.83 9032.37

Further development is still ongoing.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages