A flexible development environment for vLLM that provides:
- Single-machine tensor parallelism and distributed inference
- Python-based configuration management
- Dynamic model deployment system
- Comprehensive parameter customization
- RDMA-optimized networking
The environment uses a YAML-based registry for model configurations, supporting all vLLM parameters and features through a robust Python launcher.
- GPUs: 2x NVIDIA GeForce RTX 3090 Ti (24GB VRAM each)
- CUDA Version: 12.8
- Driver: 570.172.08
- Network: RDMA over Converged Ethernet (RoCE)
- Bandwidth: Up to 12 GB/s inter-node communication
- Requirements:
- RDMA-capable NICs (e.g., Mellanox ConnectX)
- GPUDirect RDMA support
- High-bandwidth interconnect
For detailed distributed setup information, see:
This project was initialized using create_project.sh at the root.
# Install all dependencies including dev tools
uv sync --all-extras
# Or install without dev dependencies
uv syncModels are configured through docker/config/model_registry.yml. The configuration system is fully flexible and supports any vLLM parameter - each YAML key is automatically converted to a CLI argument (with -- prefix).
Common parameters include:
model: "path/to/model" # --model
dtype: "bfloat16" # --dtype
tensor-parallel-size: 2 # --tensor-parallel-size
gpu-memory-utilization: 0.35 # --gpu-memory-utilization
max-num-seqs: 8 # --max-num-seqs
max-model-len: 131072 # --max-model-len
trust-remote-code: true # --trust-remote-code (flag only if true)
enable-prefix-caching: true # --enable-prefix-caching (flag only if true)
description: "..." # (metadata, not passed to vLLM)Example registry entry:
gpt-oss-20b:
model: openai/gpt-oss-20b
dtype: bfloat16
tensor-parallel-size: 2
gpu-memory-utilization: 0.35
max-num-seqs: 8
max-model-len: 131072
swap-space: 4 # Additional vLLM parameter
max-num-batched-tokens: 8192 # Additional vLLM parameter
enable-prefix-caching: true # Optional feature flag
description: LightweightAny valid vLLM parameter can be added to the configuration. The launcher will automatically convert all parameters (except description) into appropriate command-line arguments.
The Python-based launch system (docker/config/launch.py) provides:
- Robust YAML configuration parsing
- Validation of model settings
- Dry-run capability for testing
- User-friendly error messages
- Available models listing
Features:
# Normal launch
MODEL_NAME=gpt-oss-20b python launch.py
# Validate configuration
MODEL_NAME=gpt-oss-20b python launch.py --dry-run
# See error messages and available models
MODEL_NAME=invalid-model python launch.py- Set the model in your environment:
# In .env file
MODEL_NAME=gpt-oss-20b # Must match an entry in model_registry.yml- Launch the service:
cd docker/head # or docker/worker for distributed setup
docker compose up -dThe Python launcher will automatically load and validate the configuration before starting the model.
# Run commands with uv (recommended)
uv run python script.py
# Or activate the virtual environment
source .venv/bin/activate.
├── configs/ # Configuration files
│ ├── example_model.yaml # Template configuration
│ ├── llama-70b-tp2.yaml # Tensor parallel config for Llama 70B
│ ├── llama-7b.yaml # Single GPU config for Llama 7B
│ ├── mcp_servers.yaml # MCP servers configuration
│ └── phi3-mini-with-terminal.yaml # Configuration for Phi-3 model
├── docker/ # Docker deployment configurations
│ ├── config/ # Configuration directory
│ │ ├── launch.py # Python-based model launcher
│ │ └── model_registry.yml # Model configurations
│ ├── head/ # Ray head node setup
│ │ ├── docker-compose.yml # Head node container config
│ │ ├── .env.example # Environment template for head
│ │ └── README.md # Head node documentation
│ ├── stand_alone/ # Single node deployment
│ │ └── docker-compose.yml # Standalone container config
│ └── worker/ # Ray worker node setup
│ ├── docker-compose.yml # Worker node container config
│ ├── .env.exmple # Environment template for worker
│ └── README.md # Worker node documentation
├── mixvllm_server/ # Core server implementation
│ ├── src/ # Source code
│ │ ├── cli/ # Command-line interface
│ │ ├── config/ # Server configuration
│ │ └── inference/ # Inference implementation
│ └── README.md # Server documentation
├── mixvllm-chat/ # Chat interface implementation
│ ├── app/ # Application code
│ │ ├── client/ # Chat client implementation
│ │ └── utils/ # Utility functions
│ └── terminal/ # Terminal server implementation
├── multi_node_gpu_cluster_with_rdma/ # RDMA cluster setup guides
│ └── setup.md # Detailed RDMA configuration
└── tests/ # Test suite
├── tensor_parallel.py # Tensor parallelism tests
└── test_mcp.py # MCP integration tests
│ ├── docker-compose.yml
│ ├── entrypoint.sh
│ └── README.md
├── mixvllm_server/
│ ├── pyproject.toml
│ ├── README.md
│ └── src/
│ ├── cli/
│ │ ├── serve_model.py
│ │ └── README.md
│ ├── config/
│ │ ├── gpt-oss-20b.yaml
│ │ └── phi3-mini.yaml
│ ├── inference/
│ │ ├── config.py
│ │ ├── server.py
│ │ ├── utils.py
│ │ ├── terminal_server.py
│ │ └── README.md
├── mixvllm-chat/
│ ├── Dockerfile.terminal
│ ├── pyproject.toml
│ ├── README.md
│ ├── app/
│ │ ├── chat_client.py
│ │ ├── client/
│ │ │ ├── chat_client.py
│ │ │ ├── chat_engine.py
│ │ │ ├── cli.py
│ │ │ ├── config.py
│ │ │ ├── connection_manager.py
│ │ │ ├── history_manager.py
│ │ │ ├── response_handler.py
│ │ │ ├── tool_manager.py
│ │ │ ├── ui_manager.py
│ │ │ └── utils/
│ │ │ ├── mcp_client.py
│ │ │ └── mcp_tools.py
│ │ ├── config/
│ │ │ └── mcp_servers.yaml
│ │ └── utils/
│ │ ├── mcp_client.py
│ │ └── mcp_tools.py
│ ├── terminal/
│ │ └── terminal_server_standalone.py
├── tests/
│ ├── __init__.py
│ ├── tensor_parallel.py
│ └── test_mcp.py
uv run python .claude/experiments/test_gpu.pyuv run python .claude/experiments/test_vllm.pyfrom vllm import LLM
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=2, # Use both GPUs
gpu_memory_utilization=0.90,
trust_remote_code=True
)
outputs = llm.generate("Hello, my name is")
print(outputs[0].outputs[0].text)See configs/example_model.yaml for a complete configuration template.
# Run tests
uv run pytest
# Type checking
uv run mypy mixvllm/
# Linting
uv run ruff check mixvllm/
# Auto-formatting
uv run black mixvllm/With 2x RTX 3090 Ti (24GB each = 48GB total):
- Can run 70B models in FP16 (requires ~140GB, use quantization)
- Can run 70B models in 4-bit quantization comfortably
- Can run 34B models in FP16 easily
- Communication overhead between GPUs is minimal on PCIe 4.0
Out of Memory Errors:
- Reduce
gpu_memory_utilization(try 0.85 or 0.80) - Use quantization (4-bit or 8-bit)
- Reduce
max_model_len
Slow Inference:
- Check GPU utilization with
nvidia-smi - Verify both GPUs are being used
- Ensure PCIe link is running at full speed
401 Unauthorized Errors:
- Set
HF_TOKENenvironment variable with your HuggingFace token - For gated models, request access on the HuggingFace model page
- Verify token has read permissions:
huggingface-cli whoami - Some models require accepting terms/conditions on HuggingFace
Some models require authentication to access from HuggingFace. If you encounter 401 Unauthorized errors, you need to:
- Get a token: Visit https://huggingface.co/settings/tokens to create an access token
- Set environment variable:
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx - Or login via CLI:
huggingface-cli login
Some models (especially from OpenAI, Meta, etc.) are gated repositories that require:
- ✅ Valid HuggingFace account
- ✅ Explicit access approval on the model page
- ✅ Proper authentication token
Example with authentication:
HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --config configs/gpt-oss-20b.yamlModels that may require authentication:
openai/gpt-oss-20b(gated)meta-llama/Llama-2-*(gated)meta-llama/Llama-3-*(gated)
Public models (no auth required):
microsoft/Phi-3-mini-4k-instruct- Most Microsoft and Google models
Serve vLLM models with the serve_model.py script, which provides an OpenAI-compatible API server.
# Serve Phi-3 Mini on single GPU (no auth required)
uv run mixvllm-serve --model microsoft/Phi-3-mini-4k-instruct --gpus 1
# Serve Llama 2 70B with tensor parallelism (requires HF_TOKEN)
HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --model meta-llama/Llama-2-70b-hf --gpus 2 --trust-remote-code# Use predefined configurations
uv run mixvllm-serve --config configs/phi3-mini.yaml # No auth required
uv run mixvllm-serve --config configs/llama-7b.yaml # May require HF_TOKEN
HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --config configs/llama-70b-tp2.yaml # Requires HF_TOKEN
HF_TOKEN=$HF_TOKEN uv run mixvllm-serve --config configs/gpt-oss-20b.yaml # Requires HF_TOKEN
# Override config with CLI options
uv run mixvllm-serve --config configs/phi3-mini.yaml --port 8080HF_TOKEN=$HF_TOKEN uv run mixvllm-serve \
--model meta-llama/Llama-2-70b-hf \
--gpus 2 \
--gpu-memory 0.85 \
--max-model-len 4096 \
--port 8000 \
--temperature 0.8 \
--max-tokens 1024Once running, the server provides an OpenAI-compatible API:
# Health check
curl http://localhost:8000/health
# Chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/Phi-3-mini-4k-instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'The mixvllm-chat command provides a CLI chat interface for interactive conversations with your served models. It features rich terminal formatting and enhanced input handling similar to modern CLI applications.
# Install dependencies (if not already done)
uv sync
# Start chatting with default settings
uv run mixvllm-chat
# Connect to specific server and model
uv run mixvllm-chat --base-url http://localhost:8000 --model microsoft/Phi-3-mini-4k-instruct
# Enable streaming responses
uv run mixvllm-chat --stream --temperature 0.8- Rich Terminal UI: Beautiful formatting with colors, panels, and markdown rendering
- Conversation Context: Maintains chat history for coherent conversations
- Command Support:
/help,/clear,/history,/quit - Enhanced Input: History-based auto-completion and navigation (with prompt_toolkit)
- Streaming Support: Real-time response streaming with live updates
- Model Auto-detection: Automatically detects available models from server
- Error Handling: Clear error messages with appropriate formatting
The chat client uses these optional libraries for enhanced UI:
rich: Beautiful terminal formatting and colorsprompt_toolkit: Enhanced input with history and completionrequests: HTTP client for API calls
If these libraries are not available, the client falls back to basic text output.
✓ Connected to vLLM server at http://localhost:8000
✓ Auto-selected model: microsoft/Phi-3-mini-4k-instruct
╭─ Welcome ─────────────────────────────────────────────────────────────────╮
│ │
│ 🤖 vLLM Chat Client │
│ │
│ Configuration: │
│ • Server: http://localhost:8000 │
│ • Model: microsoft/Phi-3-mini-4k-instruct │
│ │
│ Commands: /help, /clear, /history, /quit │
│ Type your message and press Enter to chat! │
│ │
╰────────────────────────────────────────────────────────────────────────────╯
You: Hello! How are you today?
╭─ 🤖 Assistant ─────────────────────────────────────────────────────────────╮
│ Hello! I'm doing well, thank you for asking. I'm here and ready to help │
│ you with any questions or tasks you might have. How can I assist you │
│ today? │
╰────────────────────────────────────────────────────────────────────────────╯
You: Tell me about machine learning
╭─ 🤖 Assistant ─────────────────────────────────────────────────────────────╮
│ Machine learning is a fascinating field that involves teaching computers │
│ to learn from data and make predictions or decisions without being │
│ explicitly programmed for each specific task. It's a subset of artificial │
│ intelligence that focuses on algorithms and statistical models that can │
│ improve their performance as they are exposed to more data. │
│ │
│ There are several main types of machine learning: │
│ │
│ 1. **Supervised Learning**: The algorithm learns from labeled training │
│ data to make predictions on new, unseen data. Examples include │
│ classification (like spam detection) and regression (like predicting │
│ house prices). │
│ │
│ 2. **Unsupervised Learning**: The algorithm finds patterns in data │
│ without labeled examples. This includes clustering (grouping similar │
│ data points) and dimensionality reduction. │
│ │
│ 3. **Reinforcement Learning**: An agent learns through trial and error by │
│ interacting with an environment, receiving rewards or penalties for │
│ actions. │
│ │
│ Machine learning has applications in many fields including computer │
│ vision, natural language processing, recommendation systems, autonomous │
│ vehicles, medical diagnosis, and financial trading. │
╰────────────────────────────────────────────────────────────────────────────╯
You: /history
╭─ 📝 Conversation History ──────────────────────────────────────────────────╮
│ ┏━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │
│ ┃ Turn ┃ Role ┃ Content ┃ │
│ ┡━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │
│ │ 1 │ User │ Hello! How are you today? │
│ │ 2 │ Assistant │ Hello! I'm doing well, thank you for asking. ... │
│ │ 3 │ User │ Tell me about machine learning │
│ │ 4 │ Assistant │ Machine learning is a fascinating field that... │
│ └──────┴──────────────┴─────────────────────────────────────────────────┘ │
╰────────────────────────────────────────────────────────────────────────────╯
You: /quit
👋 Goodbye!
The mixvllm-chat command provides an advanced chat client with MCP (Model Context Protocol) tool integration, enabling the LLM to call external tools during conversations.
- MCP Tool Integration: Weather queries and other MCP tools
- Tool Discovery Display: Shows available MCP tools on startup
- Dual Modes: Simple chat or agent mode with tool calling
- Rich Terminal UI: Enhanced formatting with panels and colors
- Conversation Context: Maintains chat history
- Streaming Support: Real-time response streaming
- Command System:
/help,/clear,/history,/mcp,/quit
Install additional dependencies for MCP support:
uv syncTo run the gpt-oss-20b model using your configuration file, use the following command from the mixvllm_server directory:
HF_TOKEN=<your_huggingface_token> uv run mixvllm-serve --config src/config/gpt-oss-20b.yamlNotes:
- Replace
<your_huggingface_token>with your actual Hugging Face access token. - This command will use all settings from
src/config/gpt-oss-20b.yaml, including model name, tensor parallelism, GPU memory, and dtype. - Make sure your environment has access to the required GPUs and the Hugging Face repository.
- If you encounter quantization or dtype errors, ensure your config file sets
dtype: bfloat16as shown in the example config.
# Basic chat with vLLM server (auto-detects model)
uv run mixvllm-chat
# Connect to specific server (auto-detects model)
uv run mixvllm-chat --base-url http://localhost:8000
# Specify model explicitly (optional)
uv run mixvllm-chat --base-url http://localhost:8000 --model microsoft/Phi-3-mini-4k-instruct# Enable MCP tools for weather queries (auto-detects model)
uv run mixvllm-chat --enable-mcp
# Full configuration with custom MCP config
uv run mixvllm-chat \
--enable-mcp \
--mcp-config configs/mcp_servers.yaml \
--base-url http://localhost:8000 \
--stream \
--temperature 0.8When MCP mode is enabled, the following tools are available:
- Weather Queries: Get current weather, forecasts, and historical data
- Location Support: Supports city names and coordinates
- Units: Celsius or Fahrenheit temperature units
✓ Connected to vLLM server at http://localhost:8000
✓ Auto-selected model: microsoft/Phi-3-mini-4k-instruct
✓ MCP tools enabled (2 tools available)
╭─ Welcome ─────────────────────────────────────────────────────────────────╮
│ 🤖 Enhanced vLLM Chat Client (with MCP tools) │
│ │
│ Configuration: │
│ • Server: http://localhost:8000 │
│ • Model: microsoft/Phi-3-mini-4k-instruct │
│ • MCP Tools: Enabled │
│ │
│ Available MCP Tools (2): │
│ • weather_get_hourly_weather - Get hourly weather forecast for a location│
│ using Open-Meteo API (Weather information and forecasts) │
│ • weather_geocode_location - Get coordinates and timezone information for│
│ a location. (Weather information and forecasts) │
│ │
│ Commands: /help, /clear, /history, /mcp, /quit │
│ Type your message and press Enter to chat! │
╰────────────────────────────────────────────────────────────────────────────╯
You: What's the weather like in New York?
╭─ 🌤️ Assistant (with tools) ───────────────────────────────────────────────╮
│ The user is asking about the weather in New York. I should use the │
│ weather_get_weather tool to get current weather information. │
│ │
│ Tool Call: weather_get_weather(location="New York", units="celsius") │
│ │
│ Tool Result: [weather] Weather for New York: 22°C, Partly Cloudy, Wind 5 │
│ km/h │
│ │
│ Current weather in New York: 22°C with partly cloudy conditions and light │
│ winds at 5 km/h. │
╰────────────────────────────────────────────────────────────────────────────╯
You: /mcp
╭─ 🔧 MCP Integration Status ───────────────────────────────────────────────╮
│ ┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ │
│ ┃ Server ┃ Status ┃ Tools ┃ │
│ ┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ │
│ │ weather │ ✓ Connected (2 tools) │ get_hourly_ │
│ │ │ │ weather, │
│ │ │ │ geocode_loc │
│ │ │ │ ation │
│ └─────────┴───────────────────────────────────────────────┴─────────────┘ │
╰────────────────────────────────────────────────────────────────────────────╯
The system supports both standalone and distributed deployment modes, leveraging Ray for cluster management and RDMA for high-performance communication.
graph TB
subgraph "MixVLLM Architecture"
Client(["Client"])
subgraph "Head Node"
HS["HTTP Server"]
RC["Ray Controller"]
MS1["Model Shard 1"]
end
subgraph "Worker Node"
RW["Ray Worker"]
MS2["Model Shard 2"]
end
Client -->|"HTTP/REST"| HS
HS --> RC
RC <-->|"RDMA/Ray"| RW
RC --> MS1
RW --> MS2
MS1 <-->|"NCCL (12GB/s)"| MS2
end
For single-node deployment:
cd docker/stand_alone
docker-compose up -dHead Node:
cd docker/head
cp .env.example .env
# Configure environment
docker-compose up -dWorker Node:
cd docker/worker
cp .env.exmple .env
# Configure environment
docker-compose up -dFor detailed configuration, see Docker Configuration Guide.
The web terminal provides browser-based access to CLI tools and is now designed to run as a separate process from the model server. This separation allows for more flexible deployment, improved scalability, and independent management of terminal and model services.
- Model Server: Serves the vLLM API (OpenAI-compatible) on a configurable port (default: 8000).
- Terminal Server: Runs independently, connects to any model server via HTTP, and provides a full-featured shell and chat interface in the browser (default port: 8888).
Start the model server:
uv run mixvllm-serve --config configs/gpt-oss-20b.yaml
# or use any supported config/model optionsStart the terminal server (in a separate process):
uv run mixvllm-terminal-server --model-server-url http://localhost:8000
# or use --port to change the terminal port- Browser-Based Terminal: xterm.js frontend with full shell access
- Auto-Start Chat: Optionally launches
mixvllm-chatconnected to your model server - Flexible Connection: Terminal server can connect to any OpenAI-compatible API endpoint
- Separate Ports: Terminal and model server run on independent ports for easier scaling and security
- Docker Support: Docker Compose can launch both services as separate containers
cd docker
docker-compose up -d
# Starts both model-server and terminal-server containers independentlyThe web terminal provides full shell access with the same permissions as the server process. Only enable in trusted environments. For production, consider network restrictions or authentication.