From 8bf6cd030eb87f957c749e4bb36d222e432e8ad4 Mon Sep 17 00:00:00 2001 From: Eric Curtin Date: Tue, 6 Jan 2026 16:46:13 +0000 Subject: [PATCH] Add Docker Model Runner documentation For configuration, IDE integrations, inference engines, and Open WebUI Signed-off-by: Eric Curtin --- .../manuals/ai/compose/models-and-compose.md | 8 +- content/manuals/ai/model-runner/_index.md | 41 ++- .../manuals/ai/model-runner/api-reference.md | 275 +++++++++++---- .../manuals/ai/model-runner/configuration.md | 305 +++++++++++++++++ .../manuals/ai/model-runner/get-started.md | 10 +- .../ai/model-runner/ide-integrations.md | 283 ++++++++++++++++ .../ai/model-runner/inference-engines.md | 319 ++++++++++++++++++ .../ai/model-runner/openwebui-integration.md | 293 ++++++++++++++++ 8 files changed, 1449 insertions(+), 85 deletions(-) create mode 100644 content/manuals/ai/model-runner/configuration.md create mode 100644 content/manuals/ai/model-runner/ide-integrations.md create mode 100644 content/manuals/ai/model-runner/inference-engines.md create mode 100644 content/manuals/ai/model-runner/openwebui-integration.md diff --git a/content/manuals/ai/compose/models-and-compose.md b/content/manuals/ai/compose/models-and-compose.md index 30cfc161a247..c65f342aad45 100644 --- a/content/manuals/ai/compose/models-and-compose.md +++ b/content/manuals/ai/compose/models-and-compose.md @@ -77,7 +77,7 @@ Common configuration options include: > as small as feasible for your specific needs. - `runtime_flags`: A list of raw command-line flags passed to the inference engine when the model is started. - For example, if you use llama.cpp, you can pass any of [the available parameters](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md). + See [Configuration options](/manuals/ai/model-runner/configuration.md) for commonly used parameters and examples. - Platform-specific options may also be available via extension attributes `x-*` > [!TIP] @@ -364,5 +364,7 @@ services: - [`models` top-level element](/reference/compose-file/models.md) - [`models` attribute](/reference/compose-file/services.md#models) -- [Docker Model Runner documentation](/manuals/ai/model-runner.md) -- [Compose Model Runner documentation](/manuals/ai/compose/models-and-compose.md) +- [Docker Model Runner documentation](/manuals/ai/model-runner/_index.md) +- [Configuration options](/manuals/ai/model-runner/configuration.md) - Context size and runtime parameters +- [Inference engines](/manuals/ai/model-runner/inference-engines.md) - llama.cpp and vLLM details +- [API reference](/manuals/ai/model-runner/api-reference.md) - OpenAI and Ollama-compatible APIs diff --git a/content/manuals/ai/model-runner/_index.md b/content/manuals/ai/model-runner/_index.md index 8c6b2e51e76b..60c95daaf399 100644 --- a/content/manuals/ai/model-runner/_index.md +++ b/content/manuals/ai/model-runner/_index.md @@ -6,7 +6,7 @@ params: group: AI weight: 30 description: Learn how to use Docker Model Runner to manage and run AI models. -keywords: Docker, ai, model runner, docker desktop, docker engine, llm, openai, llama.cpp, vllm, cpu, nvidia, cuda, amd, rocm, vulkan +keywords: Docker, ai, model runner, docker desktop, docker engine, llm, openai, ollama, llama.cpp, vllm, cpu, nvidia, cuda, amd, rocm, vulkan, cline, continue, cursor aliases: - /desktop/features/model-runner/ - /model-runner/ @@ -21,7 +21,7 @@ large language models (LLMs) and other AI models directly from Docker Hub or any OCI-compliant registry. With seamless integration into Docker Desktop and Docker -Engine, you can serve models via OpenAI-compatible APIs, package GGUF files as +Engine, you can serve models via OpenAI and Ollama-compatible APIs, package GGUF files as OCI Artifacts, and interact with models from both the command line and graphical interface. @@ -33,10 +33,13 @@ with AI models locally. ## Key features - [Pull and push models to and from Docker Hub](https://hub.docker.com/u/ai) -- Serve models on OpenAI-compatible APIs for easy integration with existing apps -- Support for both llama.cpp and vLLM inference engines (vLLM currently supported on Linux x86_64/amd64 with NVIDIA GPUs only) +- Serve models on [OpenAI and Ollama-compatible APIs](api-reference.md) for easy integration with existing apps +- Support for both [llama.cpp and vLLM inference engines](inference-engines.md) (vLLM on Linux x86_64/amd64 and Windows WSL2 with NVIDIA GPUs) - Package GGUF and Safetensors files as OCI Artifacts and publish them to any Container Registry - Run and interact with AI models directly from the command line or from the Docker Desktop GUI +- [Connect to AI coding tools](ide-integrations.md) like Cline, Continue, Cursor, and Aider +- [Configure context size and model parameters](configuration.md) to tune performance +- [Set up Open WebUI](openwebui-integration.md) for a ChatGPT-like web interface - Manage local models and display logs - Display prompt and response details - Conversational context support for multi-turn interactions @@ -82,9 +85,28 @@ locally. They load into memory only at runtime when a request is made, and unload when not in use to optimize resources. Because models can be large, the initial pull may take some time. After that, they're cached locally for faster access. You can interact with the model using -[OpenAI-compatible APIs](api-reference.md). +[OpenAI and Ollama-compatible APIs](api-reference.md). -Docker Model Runner supports both [llama.cpp](https://github.com/ggerganov/llama.cpp) and [vLLM](https://github.com/vllm-project/vllm) as inference engines, providing flexibility for different model formats and performance requirements. For more details, see the [Docker Model Runner repository](https://github.com/docker/model-runner). +### Inference engines + +Docker Model Runner supports two inference engines: + +| Engine | Best for | Model format | +|--------|----------|--------------| +| [llama.cpp](inference-engines.md#llamacpp) | Local development, resource efficiency | GGUF (quantized) | +| [vLLM](inference-engines.md#vllm) | Production, high throughput | Safetensors | + +llama.cpp is the default engine and works on all platforms. vLLM requires NVIDIA GPUs and is supported on Linux x86_64 and Windows with WSL2. See [Inference engines](inference-engines.md) for detailed comparison and setup. + +### Context size + +Models have a configurable context size (context length) that determines how many tokens they can process. The default varies by model but is typically 2,048-8,192 tokens. You can adjust this per-model: + +```console +$ docker model configure --context-size 8192 ai/qwen2.5-coder +``` + +See [Configuration options](configuration.md) for details on context size and other parameters. > [!TIP] > @@ -120,4 +142,9 @@ Thanks for trying out Docker Model Runner. To report bugs or request features, [ ## Next steps -[Get started with DMR](get-started.md) +- [Get started with DMR](get-started.md) - Enable DMR and run your first model +- [API reference](api-reference.md) - OpenAI and Ollama-compatible API documentation +- [Configuration options](configuration.md) - Context size and runtime parameters +- [Inference engines](inference-engines.md) - llama.cpp and vLLM details +- [IDE integrations](ide-integrations.md) - Connect Cline, Continue, Cursor, and more +- [Open WebUI integration](openwebui-integration.md) - Set up a web chat interface diff --git a/content/manuals/ai/model-runner/api-reference.md b/content/manuals/ai/model-runner/api-reference.md index c2b4abcc82d0..dd9bdf506bbc 100644 --- a/content/manuals/ai/model-runner/api-reference.md +++ b/content/manuals/ai/model-runner/api-reference.md @@ -1,30 +1,37 @@ --- title: DMR REST API -description: Reference documentation for the Docker Model Runner REST API endpoints and usage examples. +description: Reference documentation for the Docker Model Runner REST API endpoints, including OpenAI and Ollama compatibility. weight: 30 -keywords: Docker, ai, model runner, rest api, openai, endpoints, documentation +keywords: Docker, ai, model runner, rest api, openai, ollama, endpoints, documentation, cline, continue, cursor --- Once Model Runner is enabled, new API endpoints are available. You can use -these endpoints to interact with a model programmatically. +these endpoints to interact with a model programmatically. Docker Model Runner +provides compatibility with both OpenAI and Ollama API formats. -### Determine the base URL +## Determine the base URL -The base URL to interact with the endpoints depends -on how you run Docker: +The base URL to interact with the endpoints depends on how you run Docker and +which API format you're using. {{< tabs >}} {{< tab name="Docker Desktop">}} -- From containers: `http://model-runner.docker.internal/` -- From host processes: `http://localhost:12434/`, assuming TCP host access is - enabled on the default port (12434). +| Access from | Base URL | +|-------------|----------| +| Containers | `http://model-runner.docker.internal` | +| Host processes (TCP) | `http://localhost:12434` | + +> [!NOTE] +> TCP host access must be enabled. See [Enable Docker Model Runner](get-started.md#enable-docker-model-runner-in-docker-desktop). {{< /tab >}} {{< tab name="Docker Engine">}} -- From containers: `http://172.17.0.1:12434/` (with `172.17.0.1` representing the host gateway address) -- From host processes: `http://localhost:12434/` +| Access from | Base URL | +|-------------|----------| +| Containers | `http://172.17.0.1:12434` | +| Host processes | `http://localhost:12434` | > [!NOTE] > The `172.17.0.1` interface may not be available by default to containers @@ -35,77 +42,139 @@ on how you run Docker: > extra_hosts: > - "model-runner.docker.internal:host-gateway" > ``` -> Then you can access the Docker Model Runner APIs at http://model-runner.docker.internal:12434/ +> Then you can access the Docker Model Runner APIs at `http://model-runner.docker.internal:12434/` {{< /tab >}} {{}} -### Available DMR endpoints +### Base URLs for third-party tools -- Create a model: +When configuring third-party tools that expect OpenAI-compatible APIs, use these base URLs: - ```text - POST /models/create - ``` +| Tool type | Base URL format | +|-----------|-----------------| +| OpenAI SDK / clients | `http://localhost:12434/engines/v1` | +| Ollama-compatible clients | `http://localhost:12434` | -- List models: +See [IDE and tool integrations](ide-integrations.md) for specific configuration examples. - ```text - GET /models - ``` +## Supported APIs -- Get a model: +Docker Model Runner supports multiple API formats: - ```text - GET /models/{namespace}/{name} - ``` +| API | Description | Use case | +|-----|-------------|----------| +| [OpenAI API](#openai-compatible-api) | OpenAI-compatible chat completions, embeddings | Most AI frameworks and tools | +| [Ollama API](#ollama-compatible-api) | Ollama-compatible endpoints | Tools built for Ollama | +| [DMR API](#dmr-native-endpoints) | Native Docker Model Runner endpoints | Model management | -- Delete a local model: +## OpenAI-compatible API - ```text - DELETE /models/{namespace}/{name} - ``` +DMR implements the OpenAI API specification for maximum compatibility with existing tools and frameworks. -### Available OpenAI endpoints +### Endpoints -DMR supports the following OpenAI endpoints: +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/engines/v1/models` | GET | [List models](https://platform.openai.com/docs/api-reference/models/list) | +| `/engines/v1/models/{namespace}/{name}` | GET | [Retrieve model](https://platform.openai.com/docs/api-reference/models/retrieve) | +| `/engines/v1/chat/completions` | POST | [Create chat completion](https://platform.openai.com/docs/api-reference/chat/create) | +| `/engines/v1/completions` | POST | [Create completion](https://platform.openai.com/docs/api-reference/completions/create) | +| `/engines/v1/embeddings` | POST | [Create embeddings](https://platform.openai.com/docs/api-reference/embeddings/create) | -- [List models](https://platform.openai.com/docs/api-reference/models/list): +> [!NOTE] +> You can optionally include the engine name in the path: `/engines/llama.cpp/v1/chat/completions`. +> This is useful when running multiple inference engines. - ```text - GET /engines/llama.cpp/v1/models - ``` +### Model name format -- [Retrieve model](https://platform.openai.com/docs/api-reference/models/retrieve): +When specifying a model in API requests, use the full model identifier including the namespace: - ```text - GET /engines/llama.cpp/v1/models/{namespace}/{name} - ``` +```json +{ + "model": "ai/smollm2", + "messages": [...] +} +``` -- [List chat completions](https://platform.openai.com/docs/api-reference/chat/list): +Common model name formats: +- Docker Hub models: `ai/smollm2`, `ai/llama3.2`, `ai/qwen2.5-coder` +- Tagged versions: `ai/smollm2:360M-Q4_K_M` +- Custom models: `myorg/mymodel` - ```text - POST /engines/llama.cpp/v1/chat/completions - ``` +### Supported parameters -- [Create completions](https://platform.openai.com/docs/api-reference/completions/create): +The following OpenAI API parameters are supported: - ```text - POST /engines/llama.cpp/v1/completions - ``` +| Parameter | Type | Description | +|-----------|------|-------------| +| `model` | string | Required. The model identifier. | +| `messages` | array | Required for chat completions. The conversation history. | +| `prompt` | string | Required for completions. The prompt text. | +| `max_tokens` | integer | Maximum tokens to generate. | +| `temperature` | float | Sampling temperature (0.0-2.0). | +| `top_p` | float | Nucleus sampling parameter (0.0-1.0). | +| `stream` | Boolean | Enable streaming responses. | +| `stop` | string/array | Stop sequences. | +| `presence_penalty` | float | Presence penalty (-2.0 to 2.0). | +| `frequency_penalty` | float | Frequency penalty (-2.0 to 2.0). | +### Limitations and differences from OpenAI -- [Create embeddings](https://platform.openai.com/docs/api-reference/embeddings/create): +Be aware of these differences when using DMR's OpenAI-compatible API: - ```text - POST /engines/llama.cpp/v1/embeddings - ``` +| Feature | DMR behavior | +|---------|--------------| +| API key | Not required. DMR ignores the `Authorization` header. | +| Function calling | Supported with llama.cpp for compatible models. | +| Vision | Supported for multi-modal models (e.g., LLaVA). | +| JSON mode | Supported via `response_format: {"type": "json_object"}`. | +| Logprobs | Supported. | +| Token counting | Uses the model's native token encoder, which may differ from OpenAI's. | -To call these endpoints via a Unix socket (`/var/run/docker.sock`), prefix their path -with `/exp/vDD4.40`. +## Ollama-compatible API -> [!NOTE] -> You can omit `llama.cpp` from the path. For example: `POST /engines/v1/chat/completions`. +DMR also provides Ollama-compatible endpoints for tools and frameworks built for Ollama. + +### Endpoints + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/api/tags` | GET | List available models | +| `/api/show` | POST | Show model information | +| `/api/chat` | POST | Generate chat completion | +| `/api/generate` | POST | Generate completion | +| `/api/embeddings` | POST | Generate embeddings | + +### Example: Chat with Ollama API + +```bash +curl http://localhost:12434/api/chat \ + -H "Content-Type: application/json" \ + -d '{ + "model": "ai/smollm2", + "messages": [ + {"role": "user", "content": "Hello!"} + ] + }' +``` + +### Example: List models + +```bash +curl http://localhost:12434/api/tags +``` + +## DMR native endpoints + +These endpoints are specific to Docker Model Runner for model management: + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/models/create` | POST | Pull/create a model | +| `/models` | GET | List local models | +| `/models/{namespace}/{name}` | GET | Get model details | +| `/models/{namespace}/{name}` | DELETE | Delete a local model | ## REST API examples @@ -116,7 +185,7 @@ To call the `chat/completions` OpenAI endpoint from within another container usi ```bash #!/bin/sh -curl http://model-runner.docker.internal/engines/llama.cpp/v1/chat/completions \ +curl http://model-runner.docker.internal/engines/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "ai/smollm2", @@ -149,21 +218,21 @@ To call the `chat/completions` OpenAI endpoint from the host via TCP: ```bash #!/bin/sh - curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "ai/smollm2", - "messages": [ - { - "role": "system", - "content": "You are a helpful assistant." - }, - { - "role": "user", - "content": "Please write 500 words about the fall of Rome." - } - ] - }' +curl http://localhost:12434/engines/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "ai/smollm2", + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "Please write 500 words about the fall of Rome." + } + ] + }' ``` ### Request from the host using a Unix socket @@ -174,7 +243,7 @@ To call the `chat/completions` OpenAI endpoint through the Docker socket from th #!/bin/sh curl --unix-socket $HOME/.docker/run/docker.sock \ - localhost/exp/vDD4.40/engines/llama.cpp/v1/chat/completions \ + localhost/exp/vDD4.40/engines/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "ai/smollm2", @@ -190,3 +259,65 @@ curl --unix-socket $HOME/.docker/run/docker.sock \ ] }' ``` + +### Streaming responses + +To receive streaming responses, set `stream: true`: + +```bash +curl http://localhost:12434/engines/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "ai/smollm2", + "stream": true, + "messages": [ + {"role": "user", "content": "Count from 1 to 10"} + ] + }' +``` + +## Using with OpenAI SDKs + +### Python + +```python +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:12434/engines/v1", + api_key="not-needed" # DMR doesn't require an API key +) + +response = client.chat.completions.create( + model="ai/smollm2", + messages=[ + {"role": "user", "content": "Hello!"} + ] +) + +print(response.choices[0].message.content) +``` + +### Node.js + +```javascript +import OpenAI from 'openai'; + +const client = new OpenAI({ + baseURL: 'http://localhost:12434/engines/v1', + apiKey: 'not-needed', +}); + +const response = await client.chat.completions.create({ + model: 'ai/smollm2', + messages: [{ role: 'user', content: 'Hello!' }], +}); + +console.log(response.choices[0].message.content); +``` + +## What's next + +- [IDE and tool integrations](ide-integrations.md) - Configure Cline, Continue, Cursor, and other tools +- [Configuration options](configuration.md) - Adjust context size and runtime parameters +- [Inference engines](inference-engines.md) - Learn about llama.cpp and vLLM options diff --git a/content/manuals/ai/model-runner/configuration.md b/content/manuals/ai/model-runner/configuration.md new file mode 100644 index 000000000000..306240638b97 --- /dev/null +++ b/content/manuals/ai/model-runner/configuration.md @@ -0,0 +1,305 @@ +--- +title: Configuration options +description: Configure context size, runtime parameters, and model behavior in Docker Model Runner. +weight: 35 +keywords: Docker, ai, model runner, configuration, context size, context length, tokens, llama.cpp, parameters +--- + +Docker Model Runner provides several configuration options to tune model behavior, +memory usage, and inference performance. This guide covers the key settings and +how to apply them. + +## Context size (context length) + +The context size determines the maximum number of tokens a model can process in +a single request, including both the input prompt and generated output. This is +one of the most important settings affecting memory usage and model capabilities. + +### Default context size + +By default, Docker Model Runner uses a context size that balances capability with +resource efficiency: + +| Engine | Default behavior | +|--------|------------------| +| llama.cpp | 4096 tokens | +| vLLM | Uses the model's maximum trained context size | + +> [!NOTE] +> The actual default varies by model. Most models support between 2,048 and 8,192 +> tokens by default. Some newer models support 32K, 128K, or even larger contexts. + +### Configure context size + +You can adjust context size per model using the `docker model configure` command: + +```console +$ docker model configure --context-size 8192 ai/qwen2.5-coder +``` + +Or in a Compose file: + +```yaml +models: + llm: + model: ai/qwen2.5-coder + context_size: 8192 +``` + +### Context size guidelines + +| Context size | Typical use case | Memory impact | +|--------------|------------------|---------------| +| 2,048 | Simple queries, short code snippets | Low | +| 4,096 | Standard conversations, medium code files | Moderate | +| 8,192 | Long conversations, larger code files | Higher | +| 16,384+ | Extended documents, multi-file context | High | + +> [!IMPORTANT] +> Larger context sizes require more memory (RAM/VRAM). If you experience out-of-memory +> errors, reduce the context size. As a rough guide, each additional 1,000 tokens +> requires approximately 100-500 MB of additional memory, depending on the model size. + +### Check a model's maximum context + +To see a model's configuration including context size: + +```console +$ docker model inspect ai/qwen2.5-coder +``` + +> [!NOTE] +> The `docker model inspect` command shows the model's maximum supported context length +> (e.g., `gemma3.context_length`), not the configured context size. The configured context +> size is what you set with `docker model configure --context-size` and represents the +> actual limit used during inference, which should be less than or equal to the model's +> maximum supported context length. + +## Runtime flags + +Runtime flags let you pass parameters directly to the underlying inference engine. +This provides fine-grained control over model behavior. + +### Using runtime flags + +Runtime flags can be provided through multiple mechanisms: + +#### Using Docker Compose + +In a Compose file: + +```yaml +models: + llm: + model: ai/qwen2.5-coder + context_size: 4096 + runtime_flags: + - "--temp" + - "0.7" + - "--top-p" + - "0.9" +``` + +#### Using Command Line + +With the `docker model configure` command: + +```console +$ docker model configure --runtime-flag "--temp" --runtime-flag "0.7" --runtime-flag "--top-p" --runtime-flag "0.9" ai/qwen2.5-coder +``` + +### Common llama.cpp parameters + +These are the most commonly used llama.cpp parameters. You don't need to look up +the llama.cpp documentation for typical use cases. + +#### Sampling parameters + +| Flag | Description | Default | Range | +|------|-------------|---------|-------| +| `--temp` | Temperature for sampling. Lower = more deterministic, higher = more creative | 0.8 | 0.0-2.0 | +| `--top-k` | Limit sampling to top K tokens. Lower = more focused | 40 | 1-100 | +| `--top-p` | Nucleus sampling threshold. Lower = more focused | 0.9 | 0.0-1.0 | +| `--min-p` | Minimum probability threshold | 0.05 | 0.0-1.0 | +| `--repeat-penalty` | Penalty for repeating tokens | 1.1 | 1.0-2.0 | + +**Example: Deterministic output (for code generation)** + +```yaml +runtime_flags: + - "--temp" + - "0" + - "--top-k" + - "1" +``` + +**Example: Creative output (for storytelling)** + +```yaml +runtime_flags: + - "--temp" + - "1.2" + - "--top-p" + - "0.95" +``` + +#### Performance parameters + +| Flag | Description | Default | Notes | +|------|-------------|---------|-------| +| `--threads` | CPU threads for generation | Auto | Set to number of performance cores | +| `--threads-batch` | CPU threads for batch processing | Auto | Usually same as `--threads` | +| `--batch-size` | Batch size for prompt processing | 512 | Higher = faster prompt processing | +| `--mlock` | Lock model in memory | Off | Prevents swapping, requires sufficient RAM | +| `--no-mmap` | Disable memory mapping | Off | May improve performance on some systems | + +**Example: Optimized for multi-core CPU** + +```yaml +runtime_flags: + - "--threads" + - "8" + - "--batch-size" + - "1024" +``` + +#### GPU parameters + +| Flag | Description | Default | Notes | +|------|-------------|---------|-------| +| `--n-gpu-layers` | Layers to offload to GPU | All (if GPU available) | Reduce if running out of VRAM | +| `--main-gpu` | GPU to use for computation | 0 | For multi-GPU systems | +| `--split-mode` | How to split across GPUs | layer | Options: `none`, `layer`, `row` | + +**Example: Partial GPU offload (limited VRAM)** + +```yaml +runtime_flags: + - "--n-gpu-layers" + - "20" +``` + +#### Advanced parameters + +| Flag | Description | Default | +|------|-------------|---------| +| `--rope-scaling` | RoPE scaling method | Auto | +| `--rope-freq-base` | RoPE base frequency | Model default | +| `--rope-freq-scale` | RoPE frequency scale | Model default | +| `--no-prefill-assistant` | Disable assistant pre-fill | Off | +| `--reasoning-budget` | Token budget for reasoning models | 0 (disabled) | + +### vLLM parameters + +When using the vLLM backend, different parameters are available. + +Use `--hf_overrides` to pass HuggingFace model config overrides as JSON: + +```console +$ docker model configure --hf_overrides '{"rope_scaling": {"type": "dynamic", "factor": 2.0}}' ai/model-vllm +``` + +## Configuration presets + +Here are complete configuration examples for common use cases. + +### Code completion (fast, deterministic) + +```yaml +models: + coder: + model: ai/qwen2.5-coder + context_size: 4096 + runtime_flags: + - "--temp" + - "0.1" + - "--top-k" + - "1" + - "--batch-size" + - "1024" +``` + +### Chat assistant (balanced) + +```yaml +models: + assistant: + model: ai/llama3.2 + context_size: 8192 + runtime_flags: + - "--temp" + - "0.7" + - "--top-p" + - "0.9" + - "--repeat-penalty" + - "1.1" +``` + +### Creative writing (high temperature) + +```yaml +models: + writer: + model: ai/llama3.2 + context_size: 8192 + runtime_flags: + - "--temp" + - "1.2" + - "--top-p" + - "0.95" + - "--repeat-penalty" + - "1.0" +``` + +### Long document analysis (large context) + +```yaml +models: + analyzer: + model: ai/qwen2.5-coder:14B + context_size: 32768 + runtime_flags: + - "--mlock" + - "--batch-size" + - "2048" +``` + +### Low memory system + +```yaml +models: + efficient: + model: ai/smollm2:360M-Q4_K_M + context_size: 2048 + runtime_flags: + - "--threads" + - "4" +``` + +## Environment-based configuration + +You can also configure models via environment variables in containers: + +| Variable | Description | +|----------|-------------| +| `LLM_URL` | Auto-injected URL of the model endpoint | +| `LLM_MODEL` | Auto-injected model identifier | + +See [Models and Compose](/manuals/ai/compose/models-and-compose.md) for details on how these are populated. + +## Reset configuration + +Configuration set via `docker model configure` persists until the model is removed. +To reset configuration: + +```console +$ docker model configure --context-size -1 ai/qwen2.5-coder +``` + +Using `-1` resets to the default value. + +## What's next + +- [Inference engines](inference-engines.md) - Learn about llama.cpp and vLLM +- [API reference](api-reference.md) - API parameters for per-request configuration +- [Models and Compose](/manuals/ai/compose/models-and-compose.md) - Configure models in Compose applications diff --git a/content/manuals/ai/model-runner/get-started.md b/content/manuals/ai/model-runner/get-started.md index 52df8cc8ee91..efd256b23c1c 100644 --- a/content/manuals/ai/model-runner/get-started.md +++ b/content/manuals/ai/model-runner/get-started.md @@ -221,6 +221,10 @@ In Docker Desktop, to inspect the requests and responses for each model: ## Related pages -- [Interact with your model programmatically](./api-reference.md) -- [Models and Compose](../compose/models-and-compose.md) -- [Docker Model Runner CLI reference documentation](/reference/cli/docker/model) \ No newline at end of file +- [API reference](./api-reference.md) - OpenAI and Ollama-compatible API documentation +- [Configuration options](./configuration.md) - Context size and runtime parameters +- [Inference engines](./inference-engines.md) - llama.cpp and vLLM details +- [IDE integrations](./ide-integrations.md) - Connect Cline, Continue, Cursor, and more +- [Open WebUI integration](./openwebui-integration.md) - Set up a web chat interface +- [Models and Compose](../compose/models-and-compose.md) - Use models in Compose applications +- [Docker Model Runner CLI reference](/reference/cli/docker/model) - Complete CLI documentation \ No newline at end of file diff --git a/content/manuals/ai/model-runner/ide-integrations.md b/content/manuals/ai/model-runner/ide-integrations.md new file mode 100644 index 000000000000..dec313f6a641 --- /dev/null +++ b/content/manuals/ai/model-runner/ide-integrations.md @@ -0,0 +1,283 @@ +--- +title: IDE and tool integrations +description: Configure popular AI coding assistants and tools to use Docker Model Runner as their backend. +weight: 40 +keywords: Docker, ai, model runner, cline, continue, cursor, vscode, ide, integration, openai, ollama +--- + +Docker Model Runner can serve as a local backend for popular AI coding assistants +and development tools. This guide shows how to configure common tools to use +models running in DMR. + +## Prerequisites + +Before configuring any tool: + +1. [Enable Docker Model Runner](get-started.md#enable-docker-model-runner) in Docker Desktop or Docker Engine. +2. Enable TCP host access: + - Docker Desktop: Enable **host-side TCP support** in Settings > AI, or run: + ```console + $ docker desktop enable model-runner --tcp 12434 + ``` + - Docker Engine: TCP is enabled by default on port 12434. +3. Pull a model: + ```console + $ docker model pull ai/qwen2.5-coder + ``` + +## Cline (VS Code) + +[Cline](https://github.com/cline/cline) is an AI coding assistant for VS Code. + +### Configuration + +1. Open VS Code and go to the Cline extension settings. +2. Select **OpenAI Compatible** as the API provider. +3. Configure the following settings: + +| Setting | Value | +|---------|-------| +| Base URL | `http://localhost:12434/engines/v1` | +| API Key | `not-needed` (or any placeholder value) | +| Model ID | `ai/qwen2.5-coder` (or your preferred model) | + +> [!IMPORTANT] +> The base URL must include `/engines/v1` at the end. Do not include a trailing slash. + +### Troubleshooting Cline + +If Cline fails to connect: + +1. Verify DMR is running: + ```console + $ docker model status + ``` + +2. Test the endpoint directly: + ```console + $ curl http://localhost:12434/engines/v1/models + ``` + +3. Check that CORS is configured if running a web-based version: + - In Docker Desktop Settings > AI, add your origin to **CORS Allowed Origins** + +## Continue (VS Code / JetBrains) + +[Continue](https://continue.dev) is an open-source AI code assistant that works with VS Code and JetBrains IDEs. + +### Configuration + +Edit your Continue configuration file (`~/.continue/config.json`): + +```json +{ + "models": [ + { + "title": "Docker Model Runner", + "provider": "openai", + "model": "ai/qwen2.5-coder", + "apiBase": "http://localhost:12434/engines/v1", + "apiKey": "not-needed" + } + ] +} +``` + +### Using Ollama provider + +Continue also supports the Ollama provider, which works with DMR: + +```json +{ + "models": [ + { + "title": "Docker Model Runner (Ollama)", + "provider": "ollama", + "model": "ai/qwen2.5-coder", + "apiBase": "http://localhost:12434" + } + ] +} +``` + +## Cursor + +[Cursor](https://cursor.sh) is an AI-powered code editor. + +### Configuration + +1. Open Cursor Settings (Cmd/Ctrl + ,). +2. Navigate to **Models** > **OpenAI API Key**. +3. Configure: + + | Setting | Value | + |---------|-------| + | OpenAI API Key | `not-needed` | + | Override OpenAI Base URL | `http://localhost:12434/engines/v1` | + +4. In the model drop-down, enter your model name: `ai/qwen2.5-coder` + +> [!NOTE] +> Some Cursor features may require models with specific capabilities (e.g., function calling). +> Use capable models like `ai/qwen2.5-coder` or `ai/llama3.2` for best results. + +## Zed + +[Zed](https://zed.dev) is a high-performance code editor with AI features. + +### Configuration + +Edit your Zed settings (`~/.config/zed/settings.json`): + +```json +{ + "language_models": { + "openai": { + "api_url": "http://localhost:12434/engines/v1", + "available_models": [ + { + "name": "ai/qwen2.5-coder", + "display_name": "Qwen 2.5 Coder (DMR)", + "max_tokens": 8192 + } + ] + } + } +} +``` + +## Open WebUI + +[Open WebUI](https://github.com/open-webui/open-webui) provides a ChatGPT-like interface for local models. + +See [Open WebUI integration](openwebui-integration.md) for detailed setup instructions. + +## Aider + +[Aider](https://aider.chat) is an AI pair programming tool for the terminal. + +### Configuration + +Set environment variables or use command-line flags: + +```bash +export OPENAI_API_BASE=http://localhost:12434/engines/v1 +export OPENAI_API_KEY=not-needed + +aider --model openai/ai/qwen2.5-coder +``` + +Or in a single command: + +```console +$ aider --openai-api-base http://localhost:12434/engines/v1 \ + --openai-api-key not-needed \ + --model openai/ai/qwen2.5-coder +``` + +## LangChain + +### Python + +```python +from langchain_openai import ChatOpenAI + +llm = ChatOpenAI( + base_url="http://localhost:12434/engines/v1", + api_key="not-needed", + model="ai/qwen2.5-coder" +) + +response = llm.invoke("Write a hello world function in Python") +print(response.content) +``` + +### JavaScript/TypeScript + +```typescript +import { ChatOpenAI } from "@langchain/openai"; + +const model = new ChatOpenAI({ + configuration: { + baseURL: "http://localhost:12434/engines/v1", + }, + apiKey: "not-needed", + modelName: "ai/qwen2.5-coder", +}); + +const response = await model.invoke("Write a hello world function"); +console.log(response.content); +``` + +## LlamaIndex + +```python +from llama_index.llms.openai_like import OpenAILike + +llm = OpenAILike( + api_base="http://localhost:12434/engines/v1", + api_key="not-needed", + model="ai/qwen2.5-coder" +) + +response = llm.complete("Write a hello world function") +print(response.text) +``` + +## Common issues + +### "Connection refused" errors + +1. Ensure Docker Model Runner is enabled and running: + ```console + $ docker model status + ``` + +2. Verify TCP access is enabled: + ```console + $ curl http://localhost:12434/engines/v1/models + ``` + +3. Check if another service is using port 12434. + +### "Model not found" errors + +1. Verify the model is pulled: + ```console + $ docker model list + ``` + +2. Use the full model name including namespace (e.g., `ai/qwen2.5-coder`, not just `qwen2.5-coder`). + +### Slow responses or timeouts + +1. For first requests, models need to load into memory. Subsequent requests are faster. + +2. Consider using a smaller model or adjusting the context size: + ```console + $ docker model configure --context-size 4096 ai/qwen2.5-coder + ``` + +3. Check available system resources (RAM, GPU memory). + +### CORS errors (web-based tools) + +If using browser-based tools, add the origin to CORS allowed origins: + +1. Docker Desktop: Settings > AI > CORS Allowed Origins +2. Add your tool's URL (e.g., `http://localhost:3000`) + +## Recommended models by use case + +| Use case | Recommended model | Notes | +|----------|-------------------|-------| +| Code completion | `ai/qwen2.5-coder` | Optimized for coding tasks | +| General assistant | `ai/llama3.2` | Good balance of capabilities | +| Small/fast | `ai/smollm2` | Low resource usage | +| Embeddings | `ai/all-minilm` | For RAG and semantic search | + +## What's next + +- [API reference](api-reference.md) - Full API documentation +- [Configuration options](configuration.md) - Tune model behavior +- [Open WebUI integration](openwebui-integration.md) - Set up a web interface diff --git a/content/manuals/ai/model-runner/inference-engines.md b/content/manuals/ai/model-runner/inference-engines.md new file mode 100644 index 000000000000..8ce93ce278e3 --- /dev/null +++ b/content/manuals/ai/model-runner/inference-engines.md @@ -0,0 +1,319 @@ +--- +title: Inference engines +description: Learn about the llama.cpp and vLLM inference engines in Docker Model Runner. +weight: 50 +keywords: Docker, ai, model runner, llama.cpp, vllm, inference, gguf, safetensors, cuda, gpu +--- + +Docker Model Runner supports two inference engines: **llama.cpp** and **vLLM**. +Each engine has different strengths, supported platforms, and model format +requirements. This guide helps you choose the right engine and configure it for +your use case. + +## Engine comparison + +| Feature | llama.cpp | vLLM | +|---------|-----------|------| +| **Model formats** | GGUF | Safetensors, HuggingFace | +| **Platforms** | All (macOS, Windows, Linux) | Linux x86_64 only | +| **GPU support** | NVIDIA, AMD, Apple Silicon, Vulkan | NVIDIA CUDA only | +| **CPU inference** | Yes | No | +| **Quantization** | Built-in (Q4, Q5, Q8, etc.) | Limited | +| **Memory efficiency** | High (with quantization) | Moderate | +| **Throughput** | Good | High (with batching) | +| **Best for** | Local development, resource-constrained environments | Production, high throughput | + +## llama.cpp + +[llama.cpp](https://github.com/ggerganov/llama.cpp) is the default inference +engine in Docker Model Runner. It's designed for efficient local inference and +supports a wide range of hardware configurations. + +### Platform support + +| Platform | GPU support | Notes | +|----------|-------------|-------| +| macOS (Apple Silicon) | Metal | Automatic GPU acceleration | +| Windows (x64) | NVIDIA CUDA | Requires NVIDIA drivers 576.57+ | +| Windows (ARM64) | Adreno OpenCL | Qualcomm 6xx series and later | +| Linux (x64) | NVIDIA, AMD, Vulkan | Multiple backend options | +| Linux | CPU only | Works on any x64/ARM64 system | + +### Model format: GGUF + +llama.cpp uses the GGUF format, which supports efficient quantization for reduced +memory usage without significant quality loss. + +#### Quantization levels + +| Quantization | Bits per weight | Memory usage | Quality | +|--------------|-----------------|--------------|---------| +| Q2_K | ~2.5 | Lowest | Reduced | +| Q3_K_M | ~3.5 | Minimal | Acceptable | +| Q4_K_M | ~4.5 | Low | Good | +| Q5_K_M | ~5.5 | Moderate | Excellent | +| Q6_K | ~6.5 | Higher | Excellent | +| Q8_0 | 8 | High | Near-original | +| F16 | 16 | Highest | Original | + +**Recommended**: Q4_K_M offers the best balance of quality and memory usage for +most use cases. + +#### Pulling quantized models + +Models on Docker Hub often include quantization in the tag: + +```console +$ docker model pull ai/llama3.2:3B-Q4_K_M +``` + +### Using llama.cpp + +llama.cpp is the default engine. No special configuration is required: + +```console +$ docker model run ai/smollm2 +``` + +To explicitly specify llama.cpp when running models: + +```console +$ docker model run ai/smollm2 --backend llama.cpp +``` + +### llama.cpp API endpoints + +When using llama.cpp, API calls use the llama.cpp engine path: + +```text +POST /engines/llama.cpp/v1/chat/completions +``` + +Or without the engine prefix: + +```text +POST /engines/v1/chat/completions +``` + +## vLLM + +[vLLM](https://github.com/vllm-project/vllm) is a high-performance inference +engine optimized for production workloads with high throughput requirements. + +### Platform support + +| Platform | GPU | Support status | +|----------|-----|----------------| +| Linux x86_64 | NVIDIA CUDA | Supported | +| Windows with WSL2 | NVIDIA CUDA | Supported (Docker Desktop 4.54+) | +| macOS | - | Not supported | +| Linux ARM64 | - | Not supported | +| AMD GPUs | - | Not supported | + +> [!IMPORTANT] +> vLLM requires an NVIDIA GPU with CUDA support. It does not support CPU-only +> inference. + +### Model format: Safetensors + +vLLM works with models in Safetensors format, which is the standard format for +HuggingFace models. These models typically use more memory than quantized GGUF +models but may offer better quality and faster inference on powerful hardware. + +### Setting up vLLM + +#### Docker Engine (Linux) + +Install the Model Runner with vLLM backend: + +```console +$ docker model install-runner --backend vllm --gpu cuda +``` + +Verify the installation: + +```console +$ docker model status +Docker Model Runner is running + +Status: +llama.cpp: running llama.cpp version: c22473b +vllm: running vllm version: 0.11.0 +``` + +#### Docker Desktop (Windows with WSL2) + +1. Ensure you have: + - Docker Desktop 4.54 or later + - NVIDIA GPU with updated drivers + - WSL2 enabled + +2. Install vLLM backend: + ```console + $ docker model install-runner --backend vllm --gpu cuda + ``` + +### Running models with vLLM + +vLLM models are typically tagged with `-vllm` suffix: + +```console +$ docker model run ai/smollm2-vllm +``` + +To specify the vLLM backend explicitly: + +```console +$ docker model run ai/model --backend vllm +``` + +### vLLM API endpoints + +When using vLLM, specify the engine in the API path: + +```text +POST /engines/vllm/v1/chat/completions +``` + +### vLLM configuration + +#### HuggingFace overrides + +Use `--hf_overrides` to pass model configuration overrides: + +```console +$ docker model configure --hf_overrides '{"max_model_len": 8192}' ai/model-vllm +``` + +#### Common vLLM settings + +| Setting | Description | Example | +|---------|-------------|---------| +| `max_model_len` | Maximum context length | 8192 | +| `gpu_memory_utilization` | Fraction of GPU memory to use | 0.9 | +| `tensor_parallel_size` | GPUs for tensor parallelism | 2 | + +### vLLM and llama.cpp performance comparison + +| Scenario | Recommended engine | +|----------|-------------------| +| Single user, local development | llama.cpp | +| Multiple concurrent requests | vLLM | +| Limited GPU memory | llama.cpp (with quantization) | +| Maximum throughput | vLLM | +| CPU-only system | llama.cpp | +| Apple Silicon Mac | llama.cpp | +| Production deployment | vLLM (if hardware supports it) | + +## Running both engines + +You can run both llama.cpp and vLLM simultaneously. Docker Model Runner routes +requests to the appropriate engine based on the model or explicit engine selection. + +Check which engines are running: + +```console +$ docker model status +Docker Model Runner is running + +Status: +llama.cpp: running llama.cpp version: c22473b +vllm: running vllm version: 0.11.0 +``` + +### Engine-specific API paths + +| Engine | API path | +|--------|----------| +| llama.cpp | `/engines/llama.cpp/v1/...` | +| vLLM | `/engines/vllm/v1/...` | +| Auto-select | `/engines/v1/...` | + +## Managing inference engines + +### Install an engine + +```console +$ docker model install-runner --backend [--gpu ] +``` + +Options: +- `--backend`: `llama.cpp` or `vllm` +- `--gpu`: `cuda`, `rocm`, `vulkan`, or `metal` (depends on platform) + +### Reinstall an engine + +```console +$ docker model reinstall-runner --backend +``` + +### Check engine status + +```console +$ docker model status +``` + +### View engine logs + +```console +$ docker model logs +``` + +## Packaging models for each engine + +### Package a GGUF model (llama.cpp) + +```console +$ docker model package --gguf ./model.gguf --push myorg/mymodel:Q4_K_M +``` + +### Package a Safetensors model (vLLM) + +```console +$ docker model package --safetensors ./model/ --push myorg/mymodel-vllm +``` + +## Troubleshooting + +### vLLM won't start + +1. Verify NVIDIA GPU is available: + ```console + $ nvidia-smi + ``` + +2. Check Docker has GPU access: + ```console + $ docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi + ``` + +3. Verify you're on a supported platform (Linux x86_64 or Windows WSL2). + +### llama.cpp is slow + +1. Ensure GPU acceleration is working (check logs for Metal/CUDA messages). + +2. Try a more aggressive quantization: + ```console + $ docker model pull ai/model:Q4_K_M + ``` + +3. Reduce context size: + ```console + $ docker model configure --context-size 2048 ai/model + ``` + +### Out of memory errors + +1. Use a smaller quantization (Q4 instead of Q8). +2. Reduce context size. +3. For vLLM, adjust `gpu_memory_utilization`: + ```console + $ docker model configure --hf_overrides '{"gpu_memory_utilization": 0.8}' ai/model + ``` + +## What's next + +- [Configuration options](configuration.md) - Detailed parameter reference +- [API reference](api-reference.md) - API documentation +- [GPU support](/manuals/desktop/features/gpu.md) - GPU configuration for Docker Desktop diff --git a/content/manuals/ai/model-runner/openwebui-integration.md b/content/manuals/ai/model-runner/openwebui-integration.md new file mode 100644 index 000000000000..1e8cdd5805ad --- /dev/null +++ b/content/manuals/ai/model-runner/openwebui-integration.md @@ -0,0 +1,293 @@ +--- +title: Open WebUI integration +description: Set up Open WebUI as a ChatGPT-like interface for Docker Model Runner. +weight: 45 +keywords: Docker, ai, model runner, open webui, openwebui, chat interface, ollama, ui +--- + +[Open WebUI](https://github.com/open-webui/open-webui) is an open-source, +self-hosted web interface that provides a ChatGPT-like experience for local +AI models. You can connect it to Docker Model Runner to get a polished chat +interface for your models. + +## Prerequisites + +- Docker Model Runner enabled with TCP access +- A model pulled (e.g., `docker model pull ai/llama3.2`) + +## Quick start with Docker Compose + +The easiest way to run Open WebUI with Docker Model Runner is using Docker Compose. + +Create a `compose.yaml` file: + +```yaml +services: + open-webui: + image: ghcr.io/open-webui/open-webui:main + ports: + - "3000:8080" + environment: + - OLLAMA_BASE_URL=http://host.docker.internal:12434 + - WEBUI_AUTH=false + extra_hosts: + - "host.docker.internal:host-gateway" + volumes: + - open-webui:/app/backend/data + +volumes: + open-webui: +``` + +Start the services: + +```console +$ docker compose up -d +``` + +Open your browser to [http://localhost:3000](http://localhost:3000). + +## Configuration options + +### Environment variables + +| Variable | Description | Default | +|----------|-------------|---------| +| `OLLAMA_BASE_URL` | URL of Docker Model Runner | Required | +| `WEBUI_AUTH` | Enable authentication | `true` | +| `OPENAI_API_BASE_URL` | Use OpenAI-compatible API instead | - | +| `OPENAI_API_KEY` | API key (use any value for DMR) | - | + +### Using OpenAI-compatible API + +If you prefer to use the OpenAI-compatible API instead of the Ollama API: + +```yaml +services: + open-webui: + image: ghcr.io/open-webui/open-webui:main + ports: + - "3000:8080" + environment: + - OPENAI_API_BASE_URL=http://host.docker.internal:12434/engines/v1 + - OPENAI_API_KEY=not-needed + - WEBUI_AUTH=false + extra_hosts: + - "host.docker.internal:host-gateway" + volumes: + - open-webui:/app/backend/data + +volumes: + open-webui: +``` + +## Network configuration + +### Docker Desktop + +On Docker Desktop, `host.docker.internal` automatically resolves to the host machine. +The previous example works without modification. + +### Docker Engine (Linux) + +On Docker Engine, you may need to configure the network differently: + +```yaml +services: + open-webui: + image: ghcr.io/open-webui/open-webui:main + network_mode: host + environment: + - OLLAMA_BASE_URL=http://localhost:12434 + - WEBUI_AUTH=false + volumes: + - open-webui:/app/backend/data + +volumes: + open-webui: +``` + +Or use the host gateway: + +```yaml +services: + open-webui: + image: ghcr.io/open-webui/open-webui:main + ports: + - "3000:8080" + environment: + - OLLAMA_BASE_URL=http://172.17.0.1:12434 + - WEBUI_AUTH=false + volumes: + - open-webui:/app/backend/data + +volumes: + open-webui: +``` + +## Using Open WebUI + +### Select a model + +1. Open [http://localhost:3000](http://localhost:3000) +2. Select the model drop-down in the top-left +3. Select from your pulled models (they appear with `ai/` prefix) + +### Pull models through the UI + +Open WebUI can pull models directly: + +1. Select the model drop-down +2. Enter a model name: `ai/llama3.2` +3. Select the download icon + +### Chat features + +Open WebUI provides: + +- Multi-turn conversations with context +- Message editing and regeneration +- Code syntax highlighting +- Markdown rendering +- Conversation history and search +- Export conversations + +## Complete example with multiple models + +This example sets up Open WebUI with Docker Model Runner and pre-pulls several models: + +```yaml +services: + open-webui: + image: ghcr.io/open-webui/open-webui:main + ports: + - "3000:8080" + environment: + - OLLAMA_BASE_URL=http://host.docker.internal:12434 + - WEBUI_AUTH=false + - DEFAULT_MODELS=ai/llama3.2 + extra_hosts: + - "host.docker.internal:host-gateway" + volumes: + - open-webui:/app/backend/data + depends_on: + model-setup: + condition: service_completed_successfully + + model-setup: + image: docker:cli + volumes: + - /var/run/docker.sock:/var/run/docker.sock + command: > + sh -c " + docker model pull ai/llama3.2 && + docker model pull ai/qwen2.5-coder && + docker model pull ai/smollm2 + " + +volumes: + open-webui: +``` + +## Enabling authentication + +For multi-user setups or security, enable authentication: + +```yaml +services: + open-webui: + image: ghcr.io/open-webui/open-webui:main + ports: + - "3000:8080" + environment: + - OLLAMA_BASE_URL=http://host.docker.internal:12434 + - WEBUI_AUTH=true + extra_hosts: + - "host.docker.internal:host-gateway" + volumes: + - open-webui:/app/backend/data + +volumes: + open-webui: +``` + +On first visit, you'll create an admin account. + +## Troubleshooting + +### Models don't appear in the drop-down + +1. Verify Docker Model Runner is accessible: + ```console + $ curl http://localhost:12434/api/tags + ``` + +2. Check that models are pulled: + ```console + $ docker model list + ``` + +3. Verify the `OLLAMA_BASE_URL` is correct and accessible from the container. + +### "Connection refused" errors + +1. Ensure TCP access is enabled for Docker Model Runner. + +2. On Docker Desktop, verify `host.docker.internal` resolves: + ```console + $ docker run --rm alpine ping -c 1 host.docker.internal + ``` + +3. On Docker Engine, try using `network_mode: host` or the explicit host IP. + +### Slow response times + +1. First requests load the model into memory, which takes time. + +2. Subsequent requests are much faster. + +3. If consistently slow, consider: + - Using a smaller model + - Reducing context size + - Checking GPU acceleration is working + +### CORS errors + +If running Open WebUI on a different host: + +1. In Docker Desktop, go to Settings > AI +2. Add the Open WebUI URL to **CORS Allowed Origins** + +## Customization + +### Custom system prompts + +Open WebUI supports setting system prompts per model. Configure these in the UI under Settings > Models. + +### Model parameters + +Adjust model parameters in the chat interface: + +1. Select the settings icon next to the model name +2. Adjust temperature, top-p, max tokens, etc. + +These settings are passed through to Docker Model Runner. + +## Running on a different port + +To run Open WebUI on a different port: + +```yaml +services: + open-webui: + image: ghcr.io/open-webui/open-webui:main + ports: + - "8080:8080" # Change first port number + # ... rest of config +``` + +## What's next + +- [API reference](api-reference.md) - Learn about the APIs Open WebUI uses +- [Configuration options](configuration.md) - Tune model behavior +- [IDE integrations](ide-integrations.md) - Connect other tools to DMR