vllama is a hybrid server that brings together the best of two worlds: it combines Ollama's versatile model management with the high-speed GPU inference of vLLM. The result is an OpenAI-compatible API that serves local models with optimized performance. It runs on port 11435 as a fast alternative to Ollama (which uses port 11434), allowing you to run both simultaneously.
The server empowers you to use local large language models for programming tasks like code generation, debugging, and code completion. It is designed for efficient local LLM operations and on-device AI, serving as a powerful, private alternative to cloud-based services like GitHub Copilot.
Key Features:
- On-Demand Model Loading & Unloading: Models are loaded on-demand when a request is received and automatically unloaded after 5 minutes of inactivity, freeing up VRAM and making it a true on-demand solution.
- Automatic Context Length Optimization: vllama automatically calculates and maximizes the context length based on your available VRAM, ensuring peak performance without manual tweaking.
- Broad Model Support: All Ollama models are automatically discovered. While vLLM's GGUF support is experimental, many models, including top performers like Devstral and DeepSeek, are proven to work.
- Network-Wide Access: Serve models to your entire local network, enabling agents powered by local LLM and collaborative development.
- Advanced Model Techniques: Supports models using quantization, distilled models for local programming, and techniques like model pruning to run efficiently on your hardware.
- Quick Start
- Development
- Maintenance
- Supported Models
- Integrations with Programming Agents
- Frequently Asked Questions (FAQ)
- Updates
- Logging
- Vision
- Client Integration Notes
- How to contribute
Running vllama inside a Docker container is the recommended method as it provides a consistent and isolated environment.
- Docker: A working Docker installation.
- NVIDIA Container Toolkit: Required to run GPU-accelerated Docker containers. Please see the official installation guide for your distribution.
-
Pull the latest docker image
docker pull tomhimanen/vllama:latest -
Pull good-known images To make sure vllama works fine on your system, you probably want to pull a few ollama models that are known to be compatible with vllama and after that try other models.
ollama pull tom_himanen/deepseek-r1-roo-cline-tools:14b # proven to be compatible ollama pull huihui_ai/devstral-abliterated:latest # proven to be compatible -
Run the container with a single command: You can download and execute the helper script directly with a single command. This script will automatically detect your Ollama models path and launch the container as a background service. NB! Always review the script's content before executing it.
curl -fsSL https://raw.githubusercontent.com/erkkimon/vllama/refs/heads/main/helpers/start_dockerized_vllama.sh | bashvllamawill then be available athttp://localhost:11435/v1and will be exposed to all devices in the network by default. You can test if the server is alive by runningcurl http://localhost:11435/v1/models. You can also see the logs withdocker logs -f vllama-servicecommand.
If you want to run vllama directly from the source for development, follow these steps.
-
Clone the repository:
git clone https://github.com/erkkimon/vllama.git cd vllama -
Create a Virtual Environment and Install Dependencies: This requires Python 3.12 or newer.
python3 -m venv venv312 source venv312/bin/activate pip install -r requirements.txt -
Run the application:
python vllama.py
The server will start on
http://localhost:11435.
-
Build the Docker image: From within the repository directory, run the build command:
docker build -t vllama-dev . -
Run the container using the helper script: Run the helper script as in the quick start instructions, but comment out the default
docker runcommand in the end of the file and uncomment the command below it../helpers/start_dockerized_vllama.sh
Here are the common commands for managing the vllama container service.
-
View logs in real-time:
docker logs -f vllama-service
-
View the last 100 lines of logs:
docker logs --tail 100 vllama-service
-
Start the service (if it was previously stopped):
docker start vllama-service
-
Stop the service:
docker stop vllama-service
-
Remove the service permanently (stops it from starting on boot):
docker stop vllama-service docker rm vllama-service
To update vllama to the latest version, follow these steps:
-
Navigate to your project directory and pull the latest code:
cd /path/to/vllama git pull -
Stop and remove the old container:
docker stop vllama-service docker rm vllama-service
-
Rebuild the image with the new code:
docker build -t vllama . -
Start the new container using the helper script:
./helpers/start_dockerized_vllama.sh
vllama can run any GGUF model available on Ollama, but compatibility ultimately depends on vLLM's support for the model architecture. The table below lists models that have been tested or are good candidates for local coding tasks. If you have managed to run a GGUF model using vLLM, please open an issue with the command you have used for running the GGUF model on vLLM – it helps a lot in integration work. Let's make local programming happen!
| Model Family | Status | Notes |
|---|---|---|
| Devstral | ✅ Proven to Work | Excellent performance for coding and general tasks. |
| Magistral | ✅ Proven to Work | A powerful Mistral-family model, works great. |
| DeepSeek-R1 | ✅ Proven to Work | Great for complex programming and following instructions. |
| DeepSeek-V2 / V3 | ❔ Untested | Promising for code generation and debugging. |
| Mistral / Mistral-Instruct | ❔ Untested | Lightweight and fast, good for code completion. |
| CodeLlama / CodeLlama-Instruct | ❔ Untested | Specifically fine-tuned for programming tasks. |
| Phi-3 (Mini, Small, Medium) | ❔ Untested | Strong reasoning capabilities in a smaller package. |
| Llama-3-Code | ❔ Untested | A powerful contender for local coding performance. |
| Qwen (2.5, 3, 3-VL, 3-Coder) | ❔ Untested | Strong multilingual and coding abilities. |
| Gemma / Gemma-2 | ❔ Untested | Google's open models, good for general purpose and coding. |
| StarCoder / StarCoder2 | ❔ Untested | Trained on a massive corpus of code. |
| WizardCoder | ❔ Untested | Fine-tuned for coding proficiency. |
| GLM / GLM-4 | ❔ Untested | Bilingual models with strong performance. |
| Codestral | ❔ Untested | Mistral's first code-specific model. |
| Kimi K2 | ❔ Untested | Known for its large context window capabilities. |
| Granite-Code | ❔ Untested | IBM's open-source code models. |
| CodeBERT | ❔ Untested | An early but influential code model. |
| Pythia-Coder | ❔ Untested | A model for studying LLM development. |
| Stable-Code | ❔ Untested | From the creators of Stable Diffusion. |
| Mistral-Nemo | ❔ Untested | A powerful new model from Mistral. |
| Llama-3.1 | ❔ Untested | The latest iteration of the Llama family. |
| TabNine-Local | ❔ Untested | Open variants of the popular code completion tool. |
Additionally, vllama supports loading custom GGUF models. If you create a /opt/vllama/models directory on your host system, it will be automatically mounted as a read-only volume inside the Docker container. This feature allows you to use GGUF models that are not available on Ollama Hub. For example, you can download a smaller, efficient model like Devstral-Small-2505-abliterated.i1-IQ2_M.gguf. Using smaller models is particularly useful for GPUs with lower VRAM, as it can free up resources to allow for a larger context window.
Important Note on Custom Model Directory: If you create the /opt/vllama/models directory after the vllama-service container has been initially launched, you must stop and remove the existing container (e.g., docker stop vllama-service && docker rm vllama-service) and then re-run the start_dockerized_vllama.sh helper script. This ensures the new volume mount is correctly applied to the container.
One of the most powerful uses of vllama is to serve as the brain for local programming agents. This is how to use local llm for software development in a modern, automated way.
Roo Code, Cline, Kilo Code and Goose are powerful programming agents that can use vllama for inference. Since vllama provides an OpenAI-compatible API, setting them up is straightforward.
- Start vllama: Ensure your
vllama.serviceis running or start it manually. - Configure the Agent: In your agent's settings (e.g., in Roo Code's
config.toml), point the API endpoint to vllama's address.- API URL:
http://localhost:11435/v1 - Model Name: Select one of the models you have pulled with Ollama (e.g.,
huihui_ai/devstral-abliterated:latest). - API Key: You can typically leave this blank.
- API URL:
Now, your agent will use your local GPU for lightning-fast, private code generation and debugging.
The same principle applies to most modern AI agents. vllama can serve as a local, private backend for many popular tools, making it a fantastic alternative to cloud-based services.
- AI Agent Frameworks: For frameworks like LangChain agents, AutoGen, and CrewAI, you can configure the LLM client to point to the vllama endpoint (
http://localhost:11435/v1). This allows you to build complex workflows that run entirely on your hardware. - Interpreter-Style Agents: Tools like Open-Interpreter and open-source alternatives to Devin-AI can be configured to use a local OpenAI-compatible endpoint, making them perfect companions for vllama.
- IDE Plugins & Tools: Plugins and tools like Aider, Cursor-AI local, Tabby-ML, Continue-dev, and alternatives for CodeWhisperer local often support custom local endpoints. Point them to vllama to get superior performance and privacy compared to their default cloud services.
- Other Coding Assistants: The OpenAI-compatible API allows vllama to be a backend for many other tools, including experimental or less common ones like Claude Code and Kilo Code.
- Advanced Agent Architectures: If you are experimenting with Reflexion agents, ReAct agents for coding, or Tree-of-Thoughts coding, vllama provides the fast, local inference engine you need to power your research.
Q: Why is my model's context window smaller than expected?
A: vllama prioritizes speed and stability by running inference exclusively on your GPU. To prevent out-of-memory errors, it automatically calculates the maximum possible context window based on the VRAM available after the model is loaded. If you need a larger context window, you should try a smaller model or a version with more aggressive quantization (e.g., a 4-bit or 5-bit quantized model instead of a 7-bit or 8-bit one). Finding the best quantized llm for programming tasks often involves balancing performance with context size.
Q: Which local LLM is best for programming?
A: The "best" model depends heavily on your hardware, the specific task (e.g., code completion vs. complex debugging), and personal preference. The goal of vllama is to make it easy to experiment. Check the Supported Models table to explore options.
Q: How can I use a local LLM for software development?
A: Using a local large language model for programming offers several advantages:
- Privacy: Your code never leaves your machine.
- Speed: Inference is performed directly on your GPU, eliminating network latency.
- Offline Access: Continue working without an internet connection, making it a true offline AI solution.
- Customization: You can choose from dozens of open-source models, including fine-tuned or distilled models for local programming, to find the perfect fit for your needs. vllama is the engine that makes this practical and efficient.
Q: Is this a free local LLM for developers?
A: Yes. vllama is an open-source tool that is completely free to use. You provide the hardware, and vllama provides the high-performance inference server. It's part of a growing ecosystem of free, open-source tools designed to democratize access to powerful AI.
- Nov 21, 2025: Support for loading custom GGUF models from
/opt/vllama/models/folder has been added. - Nov 20, 2025: Simplified container installation method from Docker Hub, allowing direct execution of the helper script.
- Nov 17, 2025: Support for the Mistral model family is now officially proven to work, including models like Magistral and Devstral.
- Nov 15, 2025: Added Docker support for a consistent, portable environment and a new helper script to make running it easier.
- Nov 4, 2025: Support for the
deepseek-r1architecture has been added! 🧠 This allows models likehuihui_ai/deepseek-r1-abliterated:14bto be used withvllama. - Nov 3, 2025:
vllamais alive! 🎉 Devstral models are confirmed to work flawlessly, bringing high-speed local inference to the community. 🚀
vllama logs important events and errors to help with debugging and monitoring.
- Service (Production): When running as a systemd service, logs are located in
/opt/vllama/logs/. - Development: When running
vllama.pydirectly from the repository, logs are located in alogs/directory within the project root.
To monitor the logs in real-time, you can use the tail -f command on the appropriate log file.
The vision for vllama is to make high-performance AI inference accessible and efficient. By integrating vLLM's advanced GPU optimizations, it addresses common pain points like slow Ollama inference while maintaining Ollama's simple workflow. This makes it an ideal solution for local programming and local models powered software development, enabling agents powered by local LLM to run efficiently. Whether you're looking for an OpenAI-compatible vLLM server or a way to unload vLLM models when idle, vllama aims to be the go-to tool for users wanting faster inference with Ollama and vLLM.
All pull requests are welcome! Also, if you have succesfully run your favorite LLM in GGUF format with vLLM, please share it by creating and issue. It will help a lot integrating it into vllama!
If you are using clients like Roo Code or Cline, it is recommended to adjust the maximum context window length in the client's settings to match your available VRAM. Condensing at 80% of the context window size is recommended for optimal performance.