Skip to content

Releases: Chen-zexi/vllm-cli

v0.2.5

25 Aug 13:46

Choose a tag to compare

Added

  • Multi-Model Proxy Server (Experimental): Enabling multiple LLMs through a single unified API endpoint
    • Single OpenAI-compatible endpoint for all models
    • Request routing based on model name
    • Save and reuse proxy configurations
  • Dynamic Model Management: Add or remove models at runtime without restarting the proxy
    • Live model registration and unregistration
    • Pre-registration with verification lifecycle
    • Graceful handling of model failures without affecting other models
    • Model state tracking (pending, running, sleeping, stopped)
  • Model Sleep/Wake for GPU Memory Management: Efficient GPU resource distribution
    • Sleep Level 1: CPU offload for faster wake-up
    • Sleep Level 2: Full memory discard for maximum savings
    • Real-time memory usage tracking and reporting
    • Models maintain their ports while sleeping
  • Test Coverage: Added comprehensive tests for multi-model proxy and model registry

Changed

  • Improved error handling with detailed logs when PyTorch is not installed
  • Better server cleanup and process management

Fixed

  • UI navigation improvements and minor display fixes

v0.2.5rc2

24 Aug 22:10

Choose a tag to compare

v0.2.5rc2 Pre-release
Pre-release

Multi-Model Proxy Server (Experimental)

The Multi-Model Proxy is a new experimental feature that enables serving multiple LLMs through a single unified API endpoint. This feature is currently under active development and available for testing.

What It Does:

  • Single Endpoint - All your models accessible through one API
  • Live Management - Add or remove models without stopping the service
  • Dynamic GPU Management - Efficient GPU resource distribution through vLLM's sleep/wake functionality
  • Interactive Setup - User-friendly wizard guides you through configuration

You can install the pre-release version with:

pip install --pre --upgrade vllm-cli

v0.2.5rc1

22 Aug 04:26

Choose a tag to compare

v0.2.5rc1 Pre-release
Pre-release

Added multi model support through proxy

v0.2.4

20 Aug 16:43

Choose a tag to compare

Added

  • Hardware-Optimized Profiles for GPT-OSS Models: New built-in profiles optimized for different GPU architectures
    • gpt_oss_ampere: Optimized for NVIDIA A100 GPUs
    • gpt_oss_hopper: Optimized for NVIDIA H100/H200 GPUs
    • gpt_oss_blackwell: Optimized for NVIDIA Blackwell (B100/B200) GPUs
    • Based on official vLLM GPT recipes
  • Shortcuts System: Save and quickly launch model + profile combinations
    • Quick launch from CLI: vllm-cli serve --shortcut NAME
    • Manage shortcuts through interactive mode or CLI commands
    • Import/export shortcuts for sharing configurations
  • Ollama Model Support: Full integration with Ollama-downloaded models
    • Automatic discovery in user (~/.ollama) and system (/usr/share/ollama) directories
    • GGUF format detection and experimental serving support
  • Environment Variable Management: Two-tier system for complete control
    • Universal environment variables for all servers
    • Profile-specific environment variables (override universal)
    • Clear indication of environment sources when launching
  • GPU Selection: Select specific GPUs for model serving
    • CLI: --device 0,1 to use specific GPUs
    • Interactive UI for GPU selection in advanced settings
    • Automatic tensor_parallel_size adjustment
  • Enhanced System Information: vLLM built-in feature detection
    • Detailed attention backend availability (Flash Attention 2/3, xFormers)
    • Feature compatibility checking per backend
  • Server Cleanup Control: Configure server behavior on CLI exit
  • Extended vLLM Arguments: Added 16+ new arguments for v1 engine
    • Performance, optimization, API, configuration, and monitoring options

Changed

  • Enhanced Quick Serve menu shows last configuration and saved shortcuts
  • Model field excluded from profiles for model-agnostic templates
  • Model cache refresh properly respects TTL settings (>60s)
  • Environment variables available in Custom Configuration menu

Fixed

  • Fixed manual cache refresh functionality
  • Fixed profile creation inconsistency between menus
  • Fixed UI consistency issues with prompt formatting

v0.2.4rc2

19 Aug 23:08

Choose a tag to compare

v0.2.4rc2 Pre-release
Pre-release

Full Changelog: v0.2.4rc1...v0.2.4rc2

v0.2.4rc1

19 Aug 00:16

Choose a tag to compare

v0.2.4rc1 Pre-release
Pre-release

Added

  • Ollama Model Support: Full integration with Ollama-downloaded models through hf-model-tool
    • Automatic discovery of Ollama models in user (~/.ollama) and system (/usr/share/ollama) directories
    • GGUF format detection and experimental serving support

Changed

  • Model cache refresh properly respects TTL settings (>60s)
  • Improved path display in model management UI for better clarity

Fixed

  • Fixed duplicate Ollama models appearing in model list
  • Fixed manual cache refresh not working

[Hotfix]v0.2.3

18 Aug 02:30

Choose a tag to compare

Fixed

  • Critical: Fixed missing built-in profiles when installing from PyPI - JSON schema files are now properly included in the package distribution

v0.2.2

18 Aug 02:00

Choose a tag to compare

Added

  • Model Manifest Support: Introduced models_manifest.json for mapping custom models in vLLM CLI native way
  • Documentation: Added custom-model-serving.md for custom model serving guide

Fixed

  • Serving models from custom directories now works as expected
  • Fixed some UI issues

[Hotfix] v0.2.1

17 Aug 20:44

Choose a tag to compare

  • Critical: Fixed package installation issue - setuptools now correctly includes all sub-packages

v0.2.0

17 Aug 19:42

Choose a tag to compare

[0.2.0] - 2025-08-17

Added

  • LoRA Adapter Support: Serve models with LoRA adapters - select base model and multiple LoRA adapters for serving
  • Enhanced Model List Display: Comprehensive model listing showing HuggingFace models, LoRA adapters, and datasets with size information
  • Model Directory Management: Configure and manage custom model directories for automatic model discovery
  • Model Caching: Performance optimization through intelligent caching with TTL for model listings
  • Improved Model Discovery: Integration with hf-model-tool for comprehensive model detection with fallback mechanisms
  • HuggingFace Token Support: Authentication support for accessing gated models with automatic token validation
  • Profile Management Enhancements:
    • View/Edit profiles in unified interface with detailed configuration display
    • Direct editing of built-in profiles with user overrides
    • Reset customized built-in profiles to defaults

Changed

  • Refactored model management system with new models/ package structure
  • Enhanced error handling with comprehensive error recovery strategies
  • Improved configuration validation framework with type checking and schemas
  • Updated low_memory profile to use FP8 quantization instead of bitsandbytes

Fixed

  • Better handling of model metadata extraction
  • Improved error messages for better user experience