Skip to content
/ mixvllm Public

Configurable vLLM inference platform with tensor parallelism support for NVIDIA multi-GPU deployments. Built for production LLM serving with both programmatic and declarative configuration.

Notifications You must be signed in to change notification settings

geosp/mixvllm

Repository files navigation

vLLM Development Environment

A flexible development environment for vLLM that provides:

  • Single-machine tensor parallelism and distributed inference
  • Python-based configuration management
  • Dynamic model deployment system
  • Comprehensive parameter customization
  • RDMA-optimized networking

The environment uses a YAML-based registry for model configurations, supporting all vLLM parameters and features through a robust Python launcher.

Deployment Options

1. Single Machine Setup

  • GPUs: 2x NVIDIA GeForce RTX 3090 Ti (24GB VRAM each)
  • CUDA Version: 12.8
  • Driver: 570.172.08

2. Distributed Cluster Setup

  • Network: RDMA over Converged Ethernet (RoCE)
  • Bandwidth: Up to 12 GB/s inter-node communication
  • Requirements:
    • RDMA-capable NICs (e.g., Mellanox ConnectX)
    • GPUDirect RDMA support
    • High-bandwidth interconnect

For detailed distributed setup information, see:

Setup

This project was initialized using create_project.sh at the root.

Install Dependencies

# Install all dependencies including dev tools
uv sync --all-extras

# Or install without dev dependencies
uv sync

Model Configuration

Models are configured through docker/config/model_registry.yml. The configuration system is fully flexible and supports any vLLM parameter - each YAML key is automatically converted to a CLI argument (with -- prefix).

Common parameters include:

model: "path/to/model"              # --model
dtype: "bfloat16"                  # --dtype
tensor-parallel-size: 2            # --tensor-parallel-size
gpu-memory-utilization: 0.35       # --gpu-memory-utilization
max-num-seqs: 8                    # --max-num-seqs
max-model-len: 131072             # --max-model-len
trust-remote-code: true           # --trust-remote-code (flag only if true)
enable-prefix-caching: true       # --enable-prefix-caching (flag only if true)
description: "..."                # (metadata, not passed to vLLM)

Example registry entry:

gpt-oss-20b:
  model: openai/gpt-oss-20b
  dtype: bfloat16
  tensor-parallel-size: 2
  gpu-memory-utilization: 0.35
  max-num-seqs: 8
  max-model-len: 131072
  swap-space: 4                    # Additional vLLM parameter
  max-num-batched-tokens: 8192     # Additional vLLM parameter
  enable-prefix-caching: true      # Optional feature flag
  description: Lightweight

Any valid vLLM parameter can be added to the configuration. The launcher will automatically convert all parameters (except description) into appropriate command-line arguments.

Model Launch System

The Python-based launch system (docker/config/launch.py) provides:

  • Robust YAML configuration parsing
  • Validation of model settings
  • Dry-run capability for testing
  • User-friendly error messages
  • Available models listing

Features:

# Normal launch
MODEL_NAME=gpt-oss-20b python launch.py

# Validate configuration
MODEL_NAME=gpt-oss-20b python launch.py --dry-run

# See error messages and available models
MODEL_NAME=invalid-model python launch.py

Deployment

  1. Set the model in your environment:
# In .env file
MODEL_NAME=gpt-oss-20b  # Must match an entry in model_registry.yml
  1. Launch the service:
cd docker/head  # or docker/worker for distributed setup
docker compose up -d

The Python launcher will automatically load and validate the configuration before starting the model.

Activate Environment

# Run commands with uv (recommended)
uv run python script.py

# Or activate the virtual environment
source .venv/bin/activate

Project Structure

.
├── configs/                           # Configuration files
│   ├── example_model.yaml            # Template configuration
│   ├── llama-70b-tp2.yaml           # Tensor parallel config for Llama 70B
│   ├── llama-7b.yaml                # Single GPU config for Llama 7B
│   ├── mcp_servers.yaml             # MCP servers configuration
│   └── phi3-mini-with-terminal.yaml # Configuration for Phi-3 model
├── docker/                          # Docker deployment configurations
│   ├── config/                      # Configuration directory
│   │   ├── launch.py               # Python-based model launcher
│   │   └── model_registry.yml      # Model configurations
│   ├── head/                        # Ray head node setup
│   │   ├── docker-compose.yml       # Head node container config
│   │   ├── .env.example            # Environment template for head
│   │   └── README.md               # Head node documentation
│   ├── stand_alone/                 # Single node deployment
│   │   └── docker-compose.yml       # Standalone container config
│   └── worker/                      # Ray worker node setup
│       ├── docker-compose.yml       # Worker node container config
│       ├── .env.exmple             # Environment template for worker
│       └── README.md               # Worker node documentation
├── mixvllm_server/                  # Core server implementation
│   ├── src/                        # Source code
│   │   ├── cli/                    # Command-line interface
│   │   ├── config/                 # Server configuration
│   │   └── inference/              # Inference implementation
│   └── README.md                   # Server documentation
├── mixvllm-chat/                    # Chat interface implementation
│   ├── app/                        # Application code
│   │   ├── client/                 # Chat client implementation
│   │   └── utils/                  # Utility functions
│   └── terminal/                   # Terminal server implementation
├── multi_node_gpu_cluster_with_rdma/ # RDMA cluster setup guides
│   └── setup.md                    # Detailed RDMA configuration
└── tests/                          # Test suite
    ├── tensor_parallel.py          # Tensor parallelism tests
    └── test_mcp.py                # MCP integration tests
│   ├── docker-compose.yml
│   ├── entrypoint.sh
│   └── README.md
├── mixvllm_server/
│   ├── pyproject.toml
│   ├── README.md
│   └── src/
│       ├── cli/
│       │   ├── serve_model.py
│       │   └── README.md
│       ├── config/
│       │   ├── gpt-oss-20b.yaml
│       │   └── phi3-mini.yaml
│       ├── inference/
│       │   ├── config.py
│       │   ├── server.py
│       │   ├── utils.py
│       │   ├── terminal_server.py
│       │   └── README.md
├── mixvllm-chat/
│   ├── Dockerfile.terminal
│   ├── pyproject.toml
│   ├── README.md
│   ├── app/
│   │   ├── chat_client.py
│   │   ├── client/
│   │   │   ├── chat_client.py
│   │   │   ├── chat_engine.py
│   │   │   ├── cli.py
│   │   │   ├── config.py
│   │   │   ├── connection_manager.py
│   │   │   ├── history_manager.py
│   │   │   ├── response_handler.py
│   │   │   ├── tool_manager.py
│   │   │   ├── ui_manager.py
│   │   │   └── utils/
│   │   │       ├── mcp_client.py
│   │   │       └── mcp_tools.py
│   │   ├── config/
│   │   │   └── mcp_servers.yaml
│   │   └── utils/
│   │       ├── mcp_client.py
│   │       └── mcp_tools.py
│   ├── terminal/
│   │   └── terminal_server_standalone.py
├── tests/
│   ├── __init__.py
│   ├── tensor_parallel.py
│   └── test_mcp.py

Quick Start

1. Test GPU Detection

uv run python .claude/experiments/test_gpu.py

2. Test vLLM Installation

uv run python .claude/experiments/test_vllm.py

3. Run a Model with Tensor Parallelism

from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=2,  # Use both GPUs
    gpu_memory_utilization=0.90,
    trust_remote_code=True
)

outputs = llm.generate("Hello, my name is")
print(outputs[0].outputs[0].text)

Configuration

See configs/example_model.yaml for a complete configuration template.

Development

# Run tests
uv run pytest

# Type checking
uv run mypy mixvllm/

# Linting
uv run ruff check mixvllm/

# Auto-formatting
uv run black mixvllm/

Tensor Parallelism Notes

With 2x RTX 3090 Ti (24GB each = 48GB total):

  • Can run 70B models in FP16 (requires ~140GB, use quantization)
  • Can run 70B models in 4-bit quantization comfortably
  • Can run 34B models in FP16 easily
  • Communication overhead between GPUs is minimal on PCIe 4.0

Troubleshooting

Out of Memory Errors:

  • Reduce gpu_memory_utilization (try 0.85 or 0.80)
  • Use quantization (4-bit or 8-bit)
  • Reduce max_model_len

Slow Inference:

  • Check GPU utilization with nvidia-smi
  • Verify both GPUs are being used
  • Ensure PCIe link is running at full speed

401 Unauthorized Errors:

  • Set HF_TOKEN environment variable with your HuggingFace token
  • For gated models, request access on the HuggingFace model page
  • Verify token has read permissions: huggingface-cli whoami
  • Some models require accepting terms/conditions on HuggingFace

Authentication

Some models require authentication to access from HuggingFace. If you encounter 401 Unauthorized errors, you need to:

HuggingFace Token Setup

  1. Get a token: Visit https://huggingface.co/settings/tokens to create an access token
  2. Set environment variable:
    export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
  3. Or login via CLI:
    huggingface-cli login

Gated Models

Some models (especially from OpenAI, Meta, etc.) are gated repositories that require:

  • ✅ Valid HuggingFace account
  • ✅ Explicit access approval on the model page
  • ✅ Proper authentication token

Example with authentication:

HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --config configs/gpt-oss-20b.yaml

Models that may require authentication:

  • openai/gpt-oss-20b (gated)
  • meta-llama/Llama-2-* (gated)
  • meta-llama/Llama-3-* (gated)

Public models (no auth required):

  • microsoft/Phi-3-mini-4k-instruct
  • Most Microsoft and Google models

Model Serving

Serve vLLM models with the serve_model.py script, which provides an OpenAI-compatible API server.

Basic Usage

# Serve Phi-3 Mini on single GPU (no auth required)
uv run mixvllm-serve --model microsoft/Phi-3-mini-4k-instruct --gpus 1

# Serve Llama 2 70B with tensor parallelism (requires HF_TOKEN)
HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --model meta-llama/Llama-2-70b-hf --gpus 2 --trust-remote-code

Using Configuration Files

# Use predefined configurations
uv run mixvllm-serve --config configs/phi3-mini.yaml          # No auth required
uv run mixvllm-serve --config configs/llama-7b.yaml           # May require HF_TOKEN
HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --config configs/llama-70b-tp2.yaml  # Requires HF_TOKEN
HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --config configs/gpt-oss-20b.yaml    # Requires HF_TOKEN

# Override config with CLI options
uv run mixvllm-serve --config configs/phi3-mini.yaml --port 8080

Advanced Options

HF_TOKEN=$HF_TOKEN uv run mixvllm-serve \
  --model meta-llama/Llama-2-70b-hf \
  --gpus 2 \
  --gpu-memory 0.85 \
  --max-model-len 4096 \
  --port 8000 \
  --temperature 0.8 \
  --max-tokens 1024

API Usage

Once running, the server provides an OpenAI-compatible API:

# Health check
curl http://localhost:8000/health

# Chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/Phi-3-mini-4k-instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Chat Client

The mixvllm-chat command provides a CLI chat interface for interactive conversations with your served models. It features rich terminal formatting and enhanced input handling similar to modern CLI applications.

# Install dependencies (if not already done)
uv sync

# Start chatting with default settings
uv run mixvllm-chat

# Connect to specific server and model
uv run mixvllm-chat --base-url http://localhost:8000 --model microsoft/Phi-3-mini-4k-instruct

# Enable streaming responses
uv run mixvllm-chat --stream --temperature 0.8

Chat Client Features

  • Rich Terminal UI: Beautiful formatting with colors, panels, and markdown rendering
  • Conversation Context: Maintains chat history for coherent conversations
  • Command Support: /help, /clear, /history, /quit
  • Enhanced Input: History-based auto-completion and navigation (with prompt_toolkit)
  • Streaming Support: Real-time response streaming with live updates
  • Model Auto-detection: Automatically detects available models from server
  • Error Handling: Clear error messages with appropriate formatting

Dependencies

The chat client uses these optional libraries for enhanced UI:

  • rich: Beautiful terminal formatting and colors
  • prompt_toolkit: Enhanced input with history and completion
  • requests: HTTP client for API calls

If these libraries are not available, the client falls back to basic text output.

Example Chat Session

✓ Connected to vLLM server at http://localhost:8000
✓ Auto-selected model: microsoft/Phi-3-mini-4k-instruct

╭─ Welcome ─────────────────────────────────────────────────────────────────╮
│                                                                            │
│ 🤖 vLLM Chat Client                                                        │
│                                                                            │
│ Configuration:                                                             │
│ • Server: http://localhost:8000                                            │
│ • Model: microsoft/Phi-3-mini-4k-instruct                                  │
│                                                                            │
│ Commands: /help, /clear, /history, /quit                                   │
│ Type your message and press Enter to chat!                                 │
│                                                                            │
╰────────────────────────────────────────────────────────────────────────────╯

You: Hello! How are you today?
╭─ 🤖 Assistant ─────────────────────────────────────────────────────────────╮
│ Hello! I'm doing well, thank you for asking. I'm here and ready to help   │
│ you with any questions or tasks you might have. How can I assist you      │
│ today?                                                                     │
╰────────────────────────────────────────────────────────────────────────────╯

You: Tell me about machine learning
╭─ 🤖 Assistant ─────────────────────────────────────────────────────────────╮
│ Machine learning is a fascinating field that involves teaching computers  │
│ to learn from data and make predictions or decisions without being        │
│ explicitly programmed for each specific task. It's a subset of artificial │
│ intelligence that focuses on algorithms and statistical models that can   │
│ improve their performance as they are exposed to more data.               │
│                                                                            │
│ There are several main types of machine learning:                          │
│                                                                            │
│ 1. **Supervised Learning**: The algorithm learns from labeled training    │
│    data to make predictions on new, unseen data. Examples include          │
│    classification (like spam detection) and regression (like predicting    │
│    house prices).                                                          │
│                                                                            │
│ 2. **Unsupervised Learning**: The algorithm finds patterns in data        │
│    without labeled examples. This includes clustering (grouping similar    │
│    data points) and dimensionality reduction.                              │
│                                                                            │
│ 3. **Reinforcement Learning**: An agent learns through trial and error by │
│    interacting with an environment, receiving rewards or penalties for     │
│    actions.                                                                │
│                                                                            │
│ Machine learning has applications in many fields including computer        │
│ vision, natural language processing, recommendation systems, autonomous    │
│ vehicles, medical diagnosis, and financial trading.                        │
╰────────────────────────────────────────────────────────────────────────────╯

You: /history
╭─ 📝 Conversation History ──────────────────────────────────────────────────╮
│ ┏━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │
│ ┃ Turn ┃ Role         ┃ Content                                         ┃ │
│ ┡━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │
│ │ 1    │ User         │ Hello! How are you today?                       │
│ │ 2    │ Assistant    │ Hello! I'm doing well, thank you for asking. ... │
│ │ 3    │ User         │ Tell me about machine learning                  │
│ │ 4    │ Assistant    │ Machine learning is a fascinating field that... │
│ └──────┴──────────────┴─────────────────────────────────────────────────┘ │
╰────────────────────────────────────────────────────────────────────────────╯

You: /quit
👋 Goodbye!

Enhanced Chat Client with MCP Tools

The mixvllm-chat command provides an advanced chat client with MCP (Model Context Protocol) tool integration, enabling the LLM to call external tools during conversations.

Features

  • MCP Tool Integration: Weather queries and other MCP tools
  • Tool Discovery Display: Shows available MCP tools on startup
  • Dual Modes: Simple chat or agent mode with tool calling
  • Rich Terminal UI: Enhanced formatting with panels and colors
  • Conversation Context: Maintains chat history
  • Streaming Support: Real-time response streaming
  • Command System: /help, /clear, /history, /mcp, /quit

Installation

Install additional dependencies for MCP support:

uv sync

Usage

Running the gpt-oss-20b Model

To run the gpt-oss-20b model using your configuration file, use the following command from the mixvllm_server directory:

HF_TOKEN=<your_huggingface_token> uv run mixvllm-serve --config src/config/gpt-oss-20b.yaml

Notes:

  • Replace <your_huggingface_token> with your actual Hugging Face access token.
  • This command will use all settings from src/config/gpt-oss-20b.yaml, including model name, tensor parallelism, GPU memory, and dtype.
  • Make sure your environment has access to the required GPUs and the Hugging Face repository.
  • If you encounter quantization or dtype errors, ensure your config file sets dtype: bfloat16 as shown in the example config.
# Basic chat with vLLM server (auto-detects model)
uv run mixvllm-chat

# Connect to specific server (auto-detects model)
uv run mixvllm-chat --base-url http://localhost:8000

# Specify model explicitly (optional)
uv run mixvllm-chat --base-url http://localhost:8000 --model microsoft/Phi-3-mini-4k-instruct

MCP Agent Mode

# Enable MCP tools for weather queries (auto-detects model)
uv run mixvllm-chat --enable-mcp

# Full configuration with custom MCP config
uv run mixvllm-chat \
  --enable-mcp \
  --mcp-config configs/mcp_servers.yaml \
  --base-url http://localhost:8000 \
  --stream \
  --temperature 0.8

MCP Tools Available

When MCP mode is enabled, the following tools are available:

  • Weather Queries: Get current weather, forecasts, and historical data
  • Location Support: Supports city names and coordinates
  • Units: Celsius or Fahrenheit temperature units

Example MCP Conversation

✓ Connected to vLLM server at http://localhost:8000
✓ Auto-selected model: microsoft/Phi-3-mini-4k-instruct
✓ MCP tools enabled (2 tools available)

╭─ Welcome ─────────────────────────────────────────────────────────────────╮
│ 🤖 Enhanced vLLM Chat Client (with MCP tools)                             │
│                                                                           │
│ Configuration:                                                            │
│ • Server: http://localhost:8000                                           │
│ • Model: microsoft/Phi-3-mini-4k-instruct                                 │
│ • MCP Tools: Enabled                                                      │
│                                                                           │
│ Available MCP Tools (2):                                                  │
│ • weather_get_hourly_weather - Get hourly weather forecast for a location│
│   using Open-Meteo API (Weather information and forecasts)               │
│ • weather_geocode_location - Get coordinates and timezone information for│
│   a location. (Weather information and forecasts)                         │
│                                                                           │
│ Commands: /help, /clear, /history, /mcp, /quit                            │
│ Type your message and press Enter to chat!                                │
╰────────────────────────────────────────────────────────────────────────────╯

You: What's the weather like in New York?
╭─ 🌤️ Assistant (with tools) ───────────────────────────────────────────────╮
│ The user is asking about the weather in New York. I should use the        │
│ weather_get_weather tool to get current weather information.              │
│                                                                           │
│ Tool Call: weather_get_weather(location="New York", units="celsius")      │
│                                                                           │
│ Tool Result: [weather] Weather for New York: 22°C, Partly Cloudy, Wind 5  │
│ km/h                                                                     │
│                                                                           │
│ Current weather in New York: 22°C with partly cloudy conditions and light │
│ winds at 5 km/h.                                                          │
╰────────────────────────────────────────────────────────────────────────────╯

You: /mcp
╭─ 🔧 MCP Integration Status ───────────────────────────────────────────────╮
│ ┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ │
│ ┃ Server  ┃ Status                                        ┃ Tools       ┃ │
│ ┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ │
│ │ weather │ ✓ Connected (2 tools)                         │ get_hourly_ │
│ │         │                                               │ weather,    │
│ │         │                                               │ geocode_loc │
│ │         │                                               │ ation       │
│ └─────────┴───────────────────────────────────────────────┴─────────────┘ │
╰────────────────────────────────────────────────────────────────────────────╯

Distributed Architecture

Overview

The system supports both standalone and distributed deployment modes, leveraging Ray for cluster management and RDMA for high-performance communication.

graph TB
    subgraph "MixVLLM Architecture"
        Client(["Client"])
        
        subgraph "Head Node"
            HS["HTTP Server"]
            RC["Ray Controller"]
            MS1["Model Shard 1"]
        end
        
        subgraph "Worker Node"
            RW["Ray Worker"]
            MS2["Model Shard 2"]
        end
        
        Client -->|"HTTP/REST"| HS
        HS --> RC
        RC <-->|"RDMA/Ray"| RW
        RC --> MS1
        RW --> MS2
        
        MS1 <-->|"NCCL (12GB/s)"| MS2
    end
Loading

Deployment Options

1. Standalone Mode

For single-node deployment:

cd docker/stand_alone
docker-compose up -d

2. Distributed Mode

Head Node:

cd docker/head
cp .env.example .env
# Configure environment
docker-compose up -d

Worker Node:

cd docker/worker
cp .env.exmple .env
# Configure environment
docker-compose up -d

For detailed configuration, see Docker Configuration Guide.

Web Terminal

The web terminal provides browser-based access to CLI tools and is now designed to run as a separate process from the model server. This separation allows for more flexible deployment, improved scalability, and independent management of terminal and model services.

Architecture

  • Model Server: Serves the vLLM API (OpenAI-compatible) on a configurable port (default: 8000).
  • Terminal Server: Runs independently, connects to any model server via HTTP, and provides a full-featured shell and chat interface in the browser (default port: 8888).

Usage

Start the model server:

uv run mixvllm-serve --config configs/gpt-oss-20b.yaml
# or use any supported config/model options

Start the terminal server (in a separate process):

uv run mixvllm-terminal-server --model-server-url http://localhost:8000
# or use --port to change the terminal port

Features

  • Browser-Based Terminal: xterm.js frontend with full shell access
  • Auto-Start Chat: Optionally launches mixvllm-chat connected to your model server
  • Flexible Connection: Terminal server can connect to any OpenAI-compatible API endpoint
  • Separate Ports: Terminal and model server run on independent ports for easier scaling and security
  • Docker Support: Docker Compose can launch both services as separate containers

Docker Deployment

cd docker
docker-compose up -d
# Starts both model-server and terminal-server containers independently

Security Note

The web terminal provides full shell access with the same permissions as the server process. Only enable in trusted environments. For production, consider network restrictions or authentication.

About

Configurable vLLM inference platform with tensor parallelism support for NVIDIA multi-GPU deployments. Built for production LLM serving with both programmatic and declarative configuration.

Resources

Stars

Watchers

Forks

Packages