vLLM Development Environment

A flexible development environment for vLLM that provides:

Single-machine tensor parallelism and distributed inference
Python-based configuration management
Dynamic model deployment system
Comprehensive parameter customization
RDMA-optimized networking

The environment uses a YAML-based registry for model configurations, supporting all vLLM parameters and features through a robust Python launcher.

Deployment Options

1. Single Machine Setup

GPUs: 2x NVIDIA GeForce RTX 3090 Ti (24GB VRAM each)
CUDA Version: 12.8
Driver: 570.172.08

2. Distributed Cluster Setup

Network: RDMA over Converged Ethernet (RoCE)
Bandwidth: Up to 12 GB/s inter-node communication
Requirements:
- RDMA-capable NICs (e.g., Mellanox ConnectX)
- GPUDirect RDMA support
- High-bandwidth interconnect

For detailed distributed setup information, see:

Setup

This project was initialized using create_project.sh at the root.

Install Dependencies

# Install all dependencies including dev tools
uv sync --all-extras

# Or install without dev dependencies
uv sync

Model Configuration

Models are configured through docker/config/model_registry.yml. The configuration system is fully flexible and supports any vLLM parameter - each YAML key is automatically converted to a CLI argument (with -- prefix).

Common parameters include:

model: "path/to/model"              # --model
dtype: "bfloat16"                  # --dtype
tensor-parallel-size: 2            # --tensor-parallel-size
gpu-memory-utilization: 0.35       # --gpu-memory-utilization
max-num-seqs: 8                    # --max-num-seqs
max-model-len: 131072             # --max-model-len
trust-remote-code: true           # --trust-remote-code (flag only if true)
enable-prefix-caching: true       # --enable-prefix-caching (flag only if true)
description: "..."                # (metadata, not passed to vLLM)

Example registry entry:

gpt-oss-20b:
  model: openai/gpt-oss-20b
  dtype: bfloat16
  tensor-parallel-size: 2
  gpu-memory-utilization: 0.35
  max-num-seqs: 8
  max-model-len: 131072
  swap-space: 4                    # Additional vLLM parameter
  max-num-batched-tokens: 8192     # Additional vLLM parameter
  enable-prefix-caching: true      # Optional feature flag
  description: Lightweight

Any valid vLLM parameter can be added to the configuration. The launcher will automatically convert all parameters (except description) into appropriate command-line arguments.

Model Launch System

The Python-based launch system (docker/config/launch.py) provides:

Robust YAML configuration parsing
Validation of model settings
Dry-run capability for testing
User-friendly error messages
Available models listing

Features:

# Normal launch
MODEL_NAME=gpt-oss-20b python launch.py

# Validate configuration
MODEL_NAME=gpt-oss-20b python launch.py --dry-run

# See error messages and available models
MODEL_NAME=invalid-model python launch.py

Deployment

Set the model in your environment:

# In .env file
MODEL_NAME=gpt-oss-20b  # Must match an entry in model_registry.yml

Launch the service:

cd docker/head  # or docker/worker for distributed setup
docker compose up -d

The Python launcher will automatically load and validate the configuration before starting the model.

Activate Environment

# Run commands with uv (recommended)
uv run python script.py

# Or activate the virtual environment
source .venv/bin/activate

Project Structure

.
├── configs/                           # Configuration files
│   ├── example_model.yaml            # Template configuration
│   ├── llama-70b-tp2.yaml           # Tensor parallel config for Llama 70B
│   ├── llama-7b.yaml                # Single GPU config for Llama 7B
│   ├── mcp_servers.yaml             # MCP servers configuration
│   └── phi3-mini-with-terminal.yaml # Configuration for Phi-3 model
├── docker/                          # Docker deployment configurations
│   ├── config/                      # Configuration directory
│   │   ├── launch.py               # Python-based model launcher
│   │   └── model_registry.yml      # Model configurations
│   ├── head/                        # Ray head node setup
│   │   ├── docker-compose.yml       # Head node container config
│   │   ├── .env.example            # Environment template for head
│   │   └── README.md               # Head node documentation
│   ├── stand_alone/                 # Single node deployment
│   │   └── docker-compose.yml       # Standalone container config
│   └── worker/                      # Ray worker node setup
│       ├── docker-compose.yml       # Worker node container config
│       ├── .env.exmple             # Environment template for worker
│       └── README.md               # Worker node documentation
├── mixvllm_server/                  # Core server implementation
│   ├── src/                        # Source code
│   │   ├── cli/                    # Command-line interface
│   │   ├── config/                 # Server configuration
│   │   └── inference/              # Inference implementation
│   └── README.md                   # Server documentation
├── mixvllm-chat/                    # Chat interface implementation
│   ├── app/                        # Application code
│   │   ├── client/                 # Chat client implementation
│   │   └── utils/                  # Utility functions
│   └── terminal/                   # Terminal server implementation
├── multi_node_gpu_cluster_with_rdma/ # RDMA cluster setup guides
│   └── setup.md                    # Detailed RDMA configuration
└── tests/                          # Test suite
    ├── tensor_parallel.py          # Tensor parallelism tests
    └── test_mcp.py                # MCP integration tests
│   ├── docker-compose.yml
│   ├── entrypoint.sh
│   └── README.md
├── mixvllm_server/
│   ├── pyproject.toml
│   ├── README.md
│   └── src/
│       ├── cli/
│       │   ├── serve_model.py
│       │   └── README.md
│       ├── config/
│       │   ├── gpt-oss-20b.yaml
│       │   └── phi3-mini.yaml
│       ├── inference/
│       │   ├── config.py
│       │   ├── server.py
│       │   ├── utils.py
│       │   ├── terminal_server.py
│       │   └── README.md
├── mixvllm-chat/
│   ├── Dockerfile.terminal
│   ├── pyproject.toml
│   ├── README.md
│   ├── app/
│   │   ├── chat_client.py
│   │   ├── client/
│   │   │   ├── chat_client.py
│   │   │   ├── chat_engine.py
│   │   │   ├── cli.py
│   │   │   ├── config.py
│   │   │   ├── connection_manager.py
│   │   │   ├── history_manager.py
│   │   │   ├── response_handler.py
│   │   │   ├── tool_manager.py
│   │   │   ├── ui_manager.py
│   │   │   └── utils/
│   │   │       ├── mcp_client.py
│   │   │       └── mcp_tools.py
│   │   ├── config/
│   │   │   └── mcp_servers.yaml
│   │   └── utils/
│   │       ├── mcp_client.py
│   │       └── mcp_tools.py
│   ├── terminal/
│   │   └── terminal_server_standalone.py
├── tests/
│   ├── __init__.py
│   ├── tensor_parallel.py
│   └── test_mcp.py

Quick Start

1. Test GPU Detection

uv run python .claude/experiments/test_gpu.py

2. Test vLLM Installation

uv run python .claude/experiments/test_vllm.py

3. Run a Model with Tensor Parallelism

from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=2,  # Use both GPUs
    gpu_memory_utilization=0.90,
    trust_remote_code=True
)

outputs = llm.generate("Hello, my name is")
print(outputs[0].outputs[0].text)

Configuration

See configs/example_model.yaml for a complete configuration template.

Development

# Run tests
uv run pytest

# Type checking
uv run mypy mixvllm/

# Linting
uv run ruff check mixvllm/

# Auto-formatting
uv run black mixvllm/

Tensor Parallelism Notes

With 2x RTX 3090 Ti (24GB each = 48GB total):

Can run 70B models in FP16 (requires ~140GB, use quantization)
Can run 70B models in 4-bit quantization comfortably
Can run 34B models in FP16 easily
Communication overhead between GPUs is minimal on PCIe 4.0

Troubleshooting

Out of Memory Errors:

Reduce gpu_memory_utilization (try 0.85 or 0.80)
Use quantization (4-bit or 8-bit)
Reduce max_model_len

Slow Inference:

Check GPU utilization with nvidia-smi
Verify both GPUs are being used
Ensure PCIe link is running at full speed

401 Unauthorized Errors:

Set HF_TOKEN environment variable with your HuggingFace token
For gated models, request access on the HuggingFace model page
Verify token has read permissions: huggingface-cli whoami
Some models require accepting terms/conditions on HuggingFace

Authentication

Some models require authentication to access from HuggingFace. If you encounter 401 Unauthorized errors, you need to:

HuggingFace Token Setup

Get a token: Visit https://huggingface.co/settings/tokens to create an access token

Set environment variable:

export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx

Or login via CLI:
```
huggingface-cli login
```

Gated Models

Some models (especially from OpenAI, Meta, etc.) are gated repositories that require:

✅ Valid HuggingFace account
✅ Explicit access approval on the model page
✅ Proper authentication token

Example with authentication:

HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --config configs/gpt-oss-20b.yaml

Models that may require authentication:

openai/gpt-oss-20b (gated)
meta-llama/Llama-2-* (gated)
meta-llama/Llama-3-* (gated)

Public models (no auth required):

microsoft/Phi-3-mini-4k-instruct
Most Microsoft and Google models

Model Serving

Serve vLLM models with the serve_model.py script, which provides an OpenAI-compatible API server.

Basic Usage

# Serve Phi-3 Mini on single GPU (no auth required)
uv run mixvllm-serve --model microsoft/Phi-3-mini-4k-instruct --gpus 1

# Serve Llama 2 70B with tensor parallelism (requires HF_TOKEN)
HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --model meta-llama/Llama-2-70b-hf --gpus 2 --trust-remote-code

Using Configuration Files

# Use predefined configurations
uv run mixvllm-serve --config configs/phi3-mini.yaml          # No auth required
uv run mixvllm-serve --config configs/llama-7b.yaml           # May require HF_TOKEN
HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --config configs/llama-70b-tp2.yaml  # Requires HF_TOKEN
HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --config configs/gpt-oss-20b.yaml    # Requires HF_TOKEN

# Override config with CLI options
uv run mixvllm-serve --config configs/phi3-mini.yaml --port 8080

Advanced Options

HF_TOKEN=$HF_TOKEN uv run mixvllm-serve \
  --model meta-llama/Llama-2-70b-hf \
  --gpus 2 \
  --gpu-memory 0.85 \
  --max-model-len 4096 \
  --port 8000 \
  --temperature 0.8 \
  --max-tokens 1024

API Usage

Once running, the server provides an OpenAI-compatible API:

# Health check
curl http://localhost:8000/health

# Chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/Phi-3-mini-4k-instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Chat Client

The mixvllm-chat command provides a CLI chat interface for interactive conversations with your served models. It features rich terminal formatting and enhanced input handling similar to modern CLI applications.

# Install dependencies (if not already done)
uv sync

# Start chatting with default settings
uv run mixvllm-chat

# Connect to specific server and model
uv run mixvllm-chat --base-url http://localhost:8000 --model microsoft/Phi-3-mini-4k-instruct

# Enable streaming responses
uv run mixvllm-chat --stream --temperature 0.8

Chat Client Features

Rich Terminal UI: Beautiful formatting with colors, panels, and markdown rendering
Conversation Context: Maintains chat history for coherent conversations
Command Support: /help, /clear, /history, /quit
Enhanced Input: History-based auto-completion and navigation (with prompt_toolkit)
Streaming Support: Real-time response streaming with live updates
Model Auto-detection: Automatically detects available models from server
Error Handling: Clear error messages with appropriate formatting

Dependencies

The chat client uses these optional libraries for enhanced UI:

rich: Beautiful terminal formatting and colors
prompt_toolkit: Enhanced input with history and completion
requests: HTTP client for API calls

If these libraries are not available, the client falls back to basic text output.

Example Chat Session

✓ Connected to vLLM server at http://localhost:8000
✓ Auto-selected model: microsoft/Phi-3-mini-4k-instruct

╭─ Welcome ─────────────────────────────────────────────────────────────────╮
│                                                                            │
│ 🤖 vLLM Chat Client                                                        │
│                                                                            │
│ Configuration:                                                             │
│ • Server: http://localhost:8000                                            │
│ • Model: microsoft/Phi-3-mini-4k-instruct                                  │
│                                                                            │
│ Commands: /help, /clear, /history, /quit                                   │
│ Type your message and press Enter to chat!                                 │
│                                                                            │
╰────────────────────────────────────────────────────────────────────────────╯

You: Hello! How are you today?
╭─ 🤖 Assistant ─────────────────────────────────────────────────────────────╮
│ Hello! I'm doing well, thank you for asking. I'm here and ready to help   │
│ you with any questions or tasks you might have. How can I assist you      │
│ today?                                                                     │
╰────────────────────────────────────────────────────────────────────────────╯

You: Tell me about machine learning
╭─ 🤖 Assistant ─────────────────────────────────────────────────────────────╮
│ Machine learning is a fascinating field that involves teaching computers  │
│ to learn from data and make predictions or decisions without being        │
│ explicitly programmed for each specific task. It's a subset of artificial │
│ intelligence that focuses on algorithms and statistical models that can   │
│ improve their performance as they are exposed to more data.               │
│                                                                            │
│ There are several main types of machine learning:                          │
│                                                                            │
│ 1. **Supervised Learning**: The algorithm learns from labeled training    │
│    data to make predictions on new, unseen data. Examples include          │
│    classification (like spam detection) and regression (like predicting    │
│    house prices).                                                          │
│                                                                            │
│ 2. **Unsupervised Learning**: The algorithm finds patterns in data        │
│    without labeled examples. This includes clustering (grouping similar    │
│    data points) and dimensionality reduction.                              │
│                                                                            │
│ 3. **Reinforcement Learning**: An agent learns through trial and error by │
│    interacting with an environment, receiving rewards or penalties for     │
│    actions.                                                                │
│                                                                            │
│ Machine learning has applications in many fields including computer        │
│ vision, natural language processing, recommendation systems, autonomous    │
│ vehicles, medical diagnosis, and financial trading.                        │
╰────────────────────────────────────────────────────────────────────────────╯

You: /history
╭─ 📝 Conversation History ──────────────────────────────────────────────────╮
│ ┏━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │
│ ┃ Turn ┃ Role         ┃ Content                                         ┃ │
│ ┡━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │
│ │ 1    │ User         │ Hello! How are you today?                       │
│ │ 2    │ Assistant    │ Hello! I'm doing well, thank you for asking. ... │
│ │ 3    │ User         │ Tell me about machine learning                  │
│ │ 4    │ Assistant    │ Machine learning is a fascinating field that... │
│ └──────┴──────────────┴─────────────────────────────────────────────────┘ │
╰────────────────────────────────────────────────────────────────────────────╯

You: /quit
👋 Goodbye!

Enhanced Chat Client with MCP Tools

The mixvllm-chat command provides an advanced chat client with MCP (Model Context Protocol) tool integration, enabling the LLM to call external tools during conversations.

Features

MCP Tool Integration: Weather queries and other MCP tools
Tool Discovery Display: Shows available MCP tools on startup
Dual Modes: Simple chat or agent mode with tool calling
Rich Terminal UI: Enhanced formatting with panels and colors
Conversation Context: Maintains chat history
Streaming Support: Real-time response streaming
Command System: /help, /clear, /history, /mcp, /quit

Installation

Install additional dependencies for MCP support:

uv sync

Usage

Running the gpt-oss-20b Model

To run the gpt-oss-20b model using your configuration file, use the following command from the mixvllm_server directory:

HF_TOKEN=<your_huggingface_token> uv run mixvllm-serve --config src/config/gpt-oss-20b.yaml

Notes:

Replace <your_huggingface_token> with your actual Hugging Face access token.
This command will use all settings from src/config/gpt-oss-20b.yaml, including model name, tensor parallelism, GPU memory, and dtype.
Make sure your environment has access to the required GPUs and the Hugging Face repository.
If you encounter quantization or dtype errors, ensure your config file sets dtype: bfloat16 as shown in the example config.

# Basic chat with vLLM server (auto-detects model)
uv run mixvllm-chat

# Connect to specific server (auto-detects model)
uv run mixvllm-chat --base-url http://localhost:8000

# Specify model explicitly (optional)
uv run mixvllm-chat --base-url http://localhost:8000 --model microsoft/Phi-3-mini-4k-instruct

MCP Agent Mode

# Enable MCP tools for weather queries (auto-detects model)
uv run mixvllm-chat --enable-mcp

# Full configuration with custom MCP config
uv run mixvllm-chat \
  --enable-mcp \
  --mcp-config configs/mcp_servers.yaml \
  --base-url http://localhost:8000 \
  --stream \
  --temperature 0.8

MCP Tools Available

When MCP mode is enabled, the following tools are available:

Weather Queries: Get current weather, forecasts, and historical data
Location Support: Supports city names and coordinates
Units: Celsius or Fahrenheit temperature units

Example MCP Conversation

✓ Connected to vLLM server at http://localhost:8000
✓ Auto-selected model: microsoft/Phi-3-mini-4k-instruct
✓ MCP tools enabled (2 tools available)

╭─ Welcome ─────────────────────────────────────────────────────────────────╮
│ 🤖 Enhanced vLLM Chat Client (with MCP tools)                             │
│                                                                           │
│ Configuration:                                                            │
│ • Server: http://localhost:8000                                           │
│ • Model: microsoft/Phi-3-mini-4k-instruct                                 │
│ • MCP Tools: Enabled                                                      │
│                                                                           │
│ Available MCP Tools (2):                                                  │
│ • weather_get_hourly_weather - Get hourly weather forecast for a location│
│   using Open-Meteo API (Weather information and forecasts)               │
│ • weather_geocode_location - Get coordinates and timezone information for│
│   a location. (Weather information and forecasts)                         │
│                                                                           │
│ Commands: /help, /clear, /history, /mcp, /quit                            │
│ Type your message and press Enter to chat!                                │
╰────────────────────────────────────────────────────────────────────────────╯

You: What's the weather like in New York?
╭─ 🌤️ Assistant (with tools) ───────────────────────────────────────────────╮
│ The user is asking about the weather in New York. I should use the        │
│ weather_get_weather tool to get current weather information.              │
│                                                                           │
│ Tool Call: weather_get_weather(location="New York", units="celsius")      │
│                                                                           │
│ Tool Result: [weather] Weather for New York: 22°C, Partly Cloudy, Wind 5  │
│ km/h                                                                     │
│                                                                           │
│ Current weather in New York: 22°C with partly cloudy conditions and light │
│ winds at 5 km/h.                                                          │
╰────────────────────────────────────────────────────────────────────────────╯

You: /mcp
╭─ 🔧 MCP Integration Status ───────────────────────────────────────────────╮
│ ┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ │
│ ┃ Server  ┃ Status                                        ┃ Tools       ┃ │
│ ┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ │
│ │ weather │ ✓ Connected (2 tools)                         │ get_hourly_ │
│ │         │                                               │ weather,    │
│ │         │                                               │ geocode_loc │
│ │         │                                               │ ation       │
│ └─────────┴───────────────────────────────────────────────┴─────────────┘ │
╰────────────────────────────────────────────────────────────────────────────╯

Distributed Architecture

Overview

The system supports both standalone and distributed deployment modes, leveraging Ray for cluster management and RDMA for high-performance communication.

graph TB
    subgraph "MixVLLM Architecture"
        Client(["Client"])
        
        subgraph "Head Node"
            HS["HTTP Server"]
            RC["Ray Controller"]
            MS1["Model Shard 1"]
        end
        
        subgraph "Worker Node"
            RW["Ray Worker"]
            MS2["Model Shard 2"]
        end
        
        Client -->|"HTTP/REST"| HS
        HS --> RC
        RC <-->|"RDMA/Ray"| RW
        RC --> MS1
        RW --> MS2
        
        MS1 <-->|"NCCL (12GB/s)"| MS2
    end

Deployment Options

1. Standalone Mode

For single-node deployment:

cd docker/stand_alone
docker-compose up -d

2. Distributed Mode

Head Node:

cd docker/head
cp .env.example .env
# Configure environment
docker-compose up -d

Worker Node:

cd docker/worker
cp .env.exmple .env
# Configure environment
docker-compose up -d

For detailed configuration, see Docker Configuration Guide.

Web Terminal

The web terminal provides browser-based access to CLI tools and is now designed to run as a separate process from the model server. This separation allows for more flexible deployment, improved scalability, and independent management of terminal and model services.

Architecture

Model Server: Serves the vLLM API (OpenAI-compatible) on a configurable port (default: 8000).
Terminal Server: Runs independently, connects to any model server via HTTP, and provides a full-featured shell and chat interface in the browser (default port: 8888).

Usage

Start the model server:

uv run mixvllm-serve --config configs/gpt-oss-20b.yaml
# or use any supported config/model options

Start the terminal server (in a separate process):

uv run mixvllm-terminal-server --model-server-url http://localhost:8000
# or use --port to change the terminal port

Features

Browser-Based Terminal: xterm.js frontend with full shell access
Auto-Start Chat: Optionally launches mixvllm-chat connected to your model server
Flexible Connection: Terminal server can connect to any OpenAI-compatible API endpoint
Separate Ports: Terminal and model server run on independent ports for easier scaling and security
Docker Support: Docker Compose can launch both services as separate containers

Docker Deployment

cd docker
docker-compose up -d
# Starts both model-server and terminal-server containers independently

Security Note

The web terminal provides full shell access with the same permissions as the server process. Only enable in trusted environments. For production, consider network restrictions or authentication.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github/workflows		.github/workflows
configs		configs
docker		docker
mixvllm-chat		mixvllm-chat
mixvllm_server		mixvllm_server
multi_node_gpu_cluster_with_rdma		multi_node_gpu_cluster_with_rdma
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md

geosp/mixvllm

Folders and files

Latest commit

History

Repository files navigation

vLLM Development Environment

Deployment Options

1. Single Machine Setup

2. Distributed Cluster Setup

Setup

Install Dependencies

Model Configuration

Model Launch System

Deployment

Activate Environment

Project Structure

Quick Start

1. Test GPU Detection

2. Test vLLM Installation

3. Run a Model with Tensor Parallelism

Configuration

Development

Tensor Parallelism Notes

Troubleshooting

Authentication

HuggingFace Token Setup

Gated Models

Model Serving

Basic Usage

Using Configuration Files

Advanced Options

API Usage

Chat Client

Chat Client Features

Dependencies

Example Chat Session

Enhanced Chat Client with MCP Tools

Features

Installation

Usage

Running the gpt-oss-20b Model

MCP Agent Mode

MCP Tools Available

Example MCP Conversation

Distributed Architecture

Overview

Deployment Options

1. Standalone Mode

2. Distributed Mode

Web Terminal

Architecture

Usage

Features

Docker Deployment

Security Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages