Ollama Service

Ollama is a self-hosted LLM inference server that provides OpenAI-compatible API endpoints for running large language models locally with GPU acceleration.

Overview

Purpose: LLM inference server for AI model serving Port: 11434 (application), 443 (nginx-tls) Storage: Configured via ollama_storage (emptyDir) Access: VPN-only via HTTPS (Tailscale required) GPU Support: Prefers GPU-enabled nodes via node affinity

Features

OpenAI-Compatible API - Drop-in replacement for OpenAI API endpoints
GPU Acceleration - Automatic node affinity for GPU workloads
VPN-Only Access - Secure access via Tailscale mesh network
TLS Termination - nginx reverse proxy with Let's Encrypt certificates
Lifecycle Automation - Automatic DNS and node cleanup
Model Storage - Persistent model storage with configurable size

Architecture

┌─────────────────────────────────────────────┐
│           Ollama Pod                        │
├─────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────────┐     │
│  │ nginx-tls    │─▶│     Ollama       │     │
│  │ (443)        │  │   (port 11434)   │     │
│  └──────────────┘  └──────────────────┘     │
│                                              │
│  ┌──────────────┐  ┌──────────────────┐     │
│  │ Tailscale    │  │ lifecycle-dns-   │     │
│  │ (VPN sidecar)│  │ create (sidecar) │     │
│  └──────────────┘  └──────────────────┘     │
│                                              │
│  Init Container:                             │
│  ┌──────────────────────────────────────┐   │
│  │ lifecycle-cleanup                    │   │
│  │ - Cleanup old Headscale nodes        │   │
│  │ - Remove stale DNS records           │   │
│  │ - Generate pre-auth key → file      │   │
│  └──────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

Multi-Container Pattern

Ollama uses the modern lifecycle pattern with automated key generation:

Init Container (lifecycle-cleanup):
- Cleans up old Headscale registrations
- Removes stale DNS records
- Generates pre-auth key and writes to /tailscale-auth/authkey
- No more data.external Terraform dependencies!
Application Container (ollama):
- Runs Ollama API server on localhost:11434
- Stores models in /root/.ollama (emptyDir volume)
- Startup probe ensures service readiness
nginx-tls Container:
- HTTPS termination with wildcard certificate
- Proxies requests to localhost:11434
- IP allowlisting (100.64.0.0/10 VPN range only)
Tailscale Sidecar:
- Reads pre-auth key from file (generated by init container)
- Registers with Headscale VPN mesh
- Provides secure network connectivity
lifecycle-dns-create Sidecar:
- Waits for Tailscale to register and get IP
- Creates DNS A record pointing to VPN IP
- Continuous monitoring and DNS updates

Configuration

Terraform Variables

# terraform.tfvars
ollama_enabled           = true
ollama_hostname          = "ollama.example.com"
ollama_image             = "ollama/ollama:latest"
ollama_storage           = "50Gi"
ollama_tailscale_enabled = true  # Required for external access

# Resource limits
ollama_cpu_request    = "2"
ollama_memory_request = "4Gi"
ollama_cpu_limit      = "8"
ollama_memory_limit   = "16Gi"

# GPU node affinity (optional)
gpu_accelerator_labels = ["nvidia.com/gpu", "amd.com/gpu"]
cpu_node_label_key     = "node-role.kubernetes.io/control-plane"
cpu_node_label_value   = ""

GPU Node Affinity

Ollama strongly prefers GPU-enabled nodes:

affinity {
  node_affinity {
    preferred_during_scheduling_ignored_during_execution {
      weight = 100  # Strongly prefer GPU nodes
      preference {
        match_expressions {
          key      = "accelerator"
          operator = "In"
          values   = var.gpu_accelerator_labels
        }
      }
    }
    preferred_during_scheduling_ignored_during_execution {
      weight = 50  # Avoid CPU-only nodes
      preference {
        match_expressions {
          key      = var.cpu_node_label_key
          operator = "NotIn"
          values   = [var.cpu_node_label_value]
        }
      }
    }
  }
}

Access

API Endpoint

# Via VPN (Tailscale required)
curl https://ollama.example.com/api/tags

# Via kubectl port-forward (local development)
kubectl port-forward -n inference ollama-0 11434:11434
curl http://localhost:11434/api/tags

OpenAI-Compatible API

# Chat completion
curl https://ollama.example.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# List models
curl https://ollama.example.com/v1/models

Model Management

Pull Models

# Via kubectl exec
kubectl exec -n inference ollama-0 -c ollama -- \
  ollama pull llama2

kubectl exec -n inference ollama-0 -c ollama -- \
  ollama pull codellama

kubectl exec -n inference ollama-0 -c ollama -- \
  ollama pull mistral

List Models

kubectl exec -n inference ollama-0 -c ollama -- ollama list

Delete Model

kubectl exec -n inference ollama-0 -c ollama -- \
  ollama rm <model-name>

Model Information

kubectl exec -n inference ollama-0 -c ollama -- \
  ollama show llama2

Common Operations

View Logs

# Ollama application logs
kubectl logs -n inference ollama-0 -c ollama -f

# nginx TLS logs
kubectl logs -n inference ollama-0 -c nginx-tls -f

# Tailscale VPN logs
kubectl logs -n inference ollama-0 -c tailscale -f

# DNS creation logs
kubectl logs -n inference ollama-0 -c lifecycle-dns-create -f

# Init container logs (cleanup and key generation)
kubectl logs -n inference ollama-0 -c lifecycle-cleanup

Check Tailscale Status

kubectl exec -n inference ollama-0 -c tailscale -- tailscale status

Check Headscale Registrations

kubectl exec -n core headscale-0 -c headscale -- \
  headscale nodes list | grep ollama

Test API Connectivity

# Health check
curl -I https://ollama.example.com/api/tags

# Generate text
curl https://ollama.example.com/api/generate \
  -d '{
    "model": "llama2",
    "prompt": "Why is the sky blue?"
  }'

Integration with Open-WebUI

Open-WebUI can use Ollama as an external backend:

# terraform.tfvars
open_webui_ollama_enabled = false  # Use external Ollama instead of bundled

# In Open-WebUI UI:
# Settings → Connections → Ollama API
# URL: https://ollama.example.com

Benefits of external Ollama:

GPU acceleration on dedicated nodes
Shared across multiple services
Independent scaling
Better resource isolation

Troubleshooting

Pod Not Starting

# Check pod status
kubectl get pods -n inference ollama-0

# Describe pod for events
kubectl describe pod -n inference ollama-0

# Check init container logs
kubectl logs -n inference ollama-0 -c lifecycle-cleanup

# Check storage
kubectl get pvc -n inference

Tailscale Not Connecting

# Check Tailscale logs
kubectl logs -n inference ollama-0 -c tailscale

# Verify auth key was generated
kubectl exec -n inference ollama-0 -c tailscale -- \
  cat /tailscale-auth/authkey

# Check Headscale is running
kubectl get pods -n core headscale-0

DNS Not Resolving

# Check DNS creation logs
kubectl logs -n inference ollama-0 -c lifecycle-dns-create

# Verify Cloudflare DNS record
dig ollama.example.com

# Check DNS A record
kubectl exec -n inference ollama-0 -c lifecycle-dns-create -- \
  env | grep CLOUDFLARE

Models Not Loading

# Check storage space
kubectl exec -n inference ollama-0 -c ollama -- df -h /root/.ollama

# Verify model files
kubectl exec -n inference ollama-0 -c ollama -- \
  ls -lh /root/.ollama/models

# Pull model manually
kubectl exec -n inference ollama-0 -c ollama -- \
  ollama pull llama2

High Memory Usage

Ollama models are memory-intensive. Adjust limits:

# terraform.tfvars
ollama_memory_request = "8Gi"
ollama_memory_limit   = "32Gi"
ollama_storage        = "100Gi"  # For larger models

Cannot Access via HTTPS

# Check nginx is running
kubectl get pod -n inference ollama-0 -o jsonpath='{.status.containerStatuses[?(@.name=="nginx-tls")].ready}'

# Verify TLS certificate
kubectl get secret -n inference wildcard -o yaml

# Check VPN connection
tailscale status

# Test from VPN client
curl -I https://ollama.example.com

Performance Tuning

GPU Workloads

For GPU-accelerated inference:

ollama_cpu_limit      = "16"
ollama_memory_limit   = "64Gi"
ollama_storage        = "200Gi"

Ensure GPU nodes have appropriate labels:

kubectl label nodes <gpu-node> accelerator=nvidia.com/gpu

Resource Limits

Small models (7B parameters):

ollama_memory_request = "4Gi"
ollama_memory_limit   = "8Gi"

Medium models (13B parameters):

ollama_memory_request = "8Gi"
ollama_memory_limit   = "16Gi"

Large models (70B+ parameters):

ollama_memory_request = "32Gi"
ollama_memory_limit   = "64Gi"

Security

VPN-Only Access - nginx restricts to 100.64.0.0/10 (Tailscale range)
TLS Everywhere - Let's Encrypt certificates via cert-manager
Pre-auth Keys - Generated per-pod, short-lived (1 hour expiry)
RBAC - Minimal permissions for lifecycle containers
Network Isolation - No external ingress, VPN-only access

Lifecycle Pattern Migration

Ollama is the reference implementation for the modern lifecycle pattern:

Old Pattern (deprecated):

Terraform data.external calls Python script
Script generates pre-auth key via kubectl exec
Key stored in Kubernetes secret
Race conditions and Windows compatibility issues

New Pattern (current):

Init container generates pre-auth key at pod startup
Key written to shared emptyDir volume
Tailscale reads from file (TS_AUTHKEY=file:///tailscale-auth/authkey)
No Terraform external dependencies
Portable and race-free

See REFACTOR_CHECKLIST.md for migration status of other services.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ollama Service

Overview

Features

Architecture

Multi-Container Pattern

Configuration

Terraform Variables

GPU Node Affinity

Access

API Endpoint

OpenAI-Compatible API

Model Management

Pull Models

List Models

Delete Model

Model Information

Common Operations

View Logs

Check Tailscale Status

Check Headscale Registrations

Test API Connectivity

Integration with Open-WebUI

Troubleshooting

Pod Not Starting

Tailscale Not Connecting

DNS Not Resolving

Models Not Loading

High Memory Usage

Cannot Access via HTTPS

Performance Tuning

GPU Workloads

Resource Limits

Security

Lifecycle Pattern Migration

Related Documentation

FilesExpand file tree

ollama.md

Latest commit

History

ollama.md

File metadata and controls

Ollama Service

Overview

Features

Architecture

Multi-Container Pattern

Configuration

Terraform Variables

GPU Node Affinity

Access

API Endpoint

OpenAI-Compatible API

Model Management

Pull Models

List Models

Delete Model

Model Information

Common Operations

View Logs

Check Tailscale Status

Check Headscale Registrations

Test API Connectivity

Integration with Open-WebUI

Troubleshooting

Pod Not Starting

Tailscale Not Connecting

DNS Not Resolving

Models Not Loading

High Memory Usage

Cannot Access via HTTPS

Performance Tuning

GPU Workloads

Resource Limits

Security

Lifecycle Pattern Migration

Related Documentation