Skip to content

Latest commit

Β 

History

History
426 lines (320 loc) Β· 11.3 KB

File metadata and controls

426 lines (320 loc) Β· 11.3 KB

Ollama Service

Ollama is a self-hosted LLM inference server that provides OpenAI-compatible API endpoints for running large language models locally with GPU acceleration.

Overview

Purpose: LLM inference server for AI model serving Port: 11434 (application), 443 (nginx-tls) Storage: Configured via ollama_storage (emptyDir) Access: VPN-only via HTTPS (Tailscale required) GPU Support: Prefers GPU-enabled nodes via node affinity

Features

  • OpenAI-Compatible API - Drop-in replacement for OpenAI API endpoints
  • GPU Acceleration - Automatic node affinity for GPU workloads
  • VPN-Only Access - Secure access via Tailscale mesh network
  • TLS Termination - nginx reverse proxy with Let's Encrypt certificates
  • Lifecycle Automation - Automatic DNS and node cleanup
  • Model Storage - Persistent model storage with configurable size

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Ollama Pod                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚ nginx-tls    │─▢│     Ollama       β”‚     β”‚
β”‚  β”‚ (443)        β”‚  β”‚   (port 11434)   β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚ Tailscale    β”‚  β”‚ lifecycle-dns-   β”‚     β”‚
β”‚  β”‚ (VPN sidecar)β”‚  β”‚ create (sidecar) β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                              β”‚
β”‚  Init Container:                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ lifecycle-cleanup                    β”‚   β”‚
β”‚  β”‚ - Cleanup old Headscale nodes        β”‚   β”‚
β”‚  β”‚ - Remove stale DNS records           β”‚   β”‚
β”‚  β”‚ - Generate pre-auth key β†’ file      β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Multi-Container Pattern

Ollama uses the modern lifecycle pattern with automated key generation:

  1. Init Container (lifecycle-cleanup):

    • Cleans up old Headscale registrations
    • Removes stale DNS records
    • Generates pre-auth key and writes to /tailscale-auth/authkey
    • No more data.external Terraform dependencies!
  2. Application Container (ollama):

    • Runs Ollama API server on localhost:11434
    • Stores models in /root/.ollama (emptyDir volume)
    • Startup probe ensures service readiness
  3. nginx-tls Container:

    • HTTPS termination with wildcard certificate
    • Proxies requests to localhost:11434
    • IP allowlisting (100.64.0.0/10 VPN range only)
  4. Tailscale Sidecar:

    • Reads pre-auth key from file (generated by init container)
    • Registers with Headscale VPN mesh
    • Provides secure network connectivity
  5. lifecycle-dns-create Sidecar:

    • Waits for Tailscale to register and get IP
    • Creates DNS A record pointing to VPN IP
    • Continuous monitoring and DNS updates

Configuration

Terraform Variables

# terraform.tfvars
ollama_enabled           = true
ollama_hostname          = "ollama.example.com"
ollama_image             = "ollama/ollama:latest"
ollama_storage           = "50Gi"
ollama_tailscale_enabled = true  # Required for external access

# Resource limits
ollama_cpu_request    = "2"
ollama_memory_request = "4Gi"
ollama_cpu_limit      = "8"
ollama_memory_limit   = "16Gi"

# GPU node affinity (optional)
gpu_accelerator_labels = ["nvidia.com/gpu", "amd.com/gpu"]
cpu_node_label_key     = "node-role.kubernetes.io/control-plane"
cpu_node_label_value   = ""

GPU Node Affinity

Ollama strongly prefers GPU-enabled nodes:

affinity {
  node_affinity {
    preferred_during_scheduling_ignored_during_execution {
      weight = 100  # Strongly prefer GPU nodes
      preference {
        match_expressions {
          key      = "accelerator"
          operator = "In"
          values   = var.gpu_accelerator_labels
        }
      }
    }
    preferred_during_scheduling_ignored_during_execution {
      weight = 50  # Avoid CPU-only nodes
      preference {
        match_expressions {
          key      = var.cpu_node_label_key
          operator = "NotIn"
          values   = [var.cpu_node_label_value]
        }
      }
    }
  }
}

Access

API Endpoint

# Via VPN (Tailscale required)
curl https://ollama.example.com/api/tags

# Via kubectl port-forward (local development)
kubectl port-forward -n inference ollama-0 11434:11434
curl http://localhost:11434/api/tags

OpenAI-Compatible API

# Chat completion
curl https://ollama.example.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# List models
curl https://ollama.example.com/v1/models

Model Management

Pull Models

# Via kubectl exec
kubectl exec -n inference ollama-0 -c ollama -- \
  ollama pull llama2

kubectl exec -n inference ollama-0 -c ollama -- \
  ollama pull codellama

kubectl exec -n inference ollama-0 -c ollama -- \
  ollama pull mistral

List Models

kubectl exec -n inference ollama-0 -c ollama -- ollama list

Delete Model

kubectl exec -n inference ollama-0 -c ollama -- \
  ollama rm <model-name>

Model Information

kubectl exec -n inference ollama-0 -c ollama -- \
  ollama show llama2

Common Operations

View Logs

# Ollama application logs
kubectl logs -n inference ollama-0 -c ollama -f

# nginx TLS logs
kubectl logs -n inference ollama-0 -c nginx-tls -f

# Tailscale VPN logs
kubectl logs -n inference ollama-0 -c tailscale -f

# DNS creation logs
kubectl logs -n inference ollama-0 -c lifecycle-dns-create -f

# Init container logs (cleanup and key generation)
kubectl logs -n inference ollama-0 -c lifecycle-cleanup

Check Tailscale Status

kubectl exec -n inference ollama-0 -c tailscale -- tailscale status

Check Headscale Registrations

kubectl exec -n core headscale-0 -c headscale -- \
  headscale nodes list | grep ollama

Test API Connectivity

# Health check
curl -I https://ollama.example.com/api/tags

# Generate text
curl https://ollama.example.com/api/generate \
  -d '{
    "model": "llama2",
    "prompt": "Why is the sky blue?"
  }'

Integration with Open-WebUI

Open-WebUI can use Ollama as an external backend:

# terraform.tfvars
open_webui_ollama_enabled = false  # Use external Ollama instead of bundled

# In Open-WebUI UI:
# Settings β†’ Connections β†’ Ollama API
# URL: https://ollama.example.com

Benefits of external Ollama:

  • GPU acceleration on dedicated nodes
  • Shared across multiple services
  • Independent scaling
  • Better resource isolation

Troubleshooting

Pod Not Starting

# Check pod status
kubectl get pods -n inference ollama-0

# Describe pod for events
kubectl describe pod -n inference ollama-0

# Check init container logs
kubectl logs -n inference ollama-0 -c lifecycle-cleanup

# Check storage
kubectl get pvc -n inference

Tailscale Not Connecting

# Check Tailscale logs
kubectl logs -n inference ollama-0 -c tailscale

# Verify auth key was generated
kubectl exec -n inference ollama-0 -c tailscale -- \
  cat /tailscale-auth/authkey

# Check Headscale is running
kubectl get pods -n core headscale-0

DNS Not Resolving

# Check DNS creation logs
kubectl logs -n inference ollama-0 -c lifecycle-dns-create

# Verify Cloudflare DNS record
dig ollama.example.com

# Check DNS A record
kubectl exec -n inference ollama-0 -c lifecycle-dns-create -- \
  env | grep CLOUDFLARE

Models Not Loading

# Check storage space
kubectl exec -n inference ollama-0 -c ollama -- df -h /root/.ollama

# Verify model files
kubectl exec -n inference ollama-0 -c ollama -- \
  ls -lh /root/.ollama/models

# Pull model manually
kubectl exec -n inference ollama-0 -c ollama -- \
  ollama pull llama2

High Memory Usage

Ollama models are memory-intensive. Adjust limits:

# terraform.tfvars
ollama_memory_request = "8Gi"
ollama_memory_limit   = "32Gi"
ollama_storage        = "100Gi"  # For larger models

Cannot Access via HTTPS

# Check nginx is running
kubectl get pod -n inference ollama-0 -o jsonpath='{.status.containerStatuses[?(@.name=="nginx-tls")].ready}'

# Verify TLS certificate
kubectl get secret -n inference wildcard -o yaml

# Check VPN connection
tailscale status

# Test from VPN client
curl -I https://ollama.example.com

Performance Tuning

GPU Workloads

For GPU-accelerated inference:

ollama_cpu_limit      = "16"
ollama_memory_limit   = "64Gi"
ollama_storage        = "200Gi"

Ensure GPU nodes have appropriate labels:

kubectl label nodes <gpu-node> accelerator=nvidia.com/gpu

Resource Limits

Small models (7B parameters):

ollama_memory_request = "4Gi"
ollama_memory_limit   = "8Gi"

Medium models (13B parameters):

ollama_memory_request = "8Gi"
ollama_memory_limit   = "16Gi"

Large models (70B+ parameters):

ollama_memory_request = "32Gi"
ollama_memory_limit   = "64Gi"

Security

  • VPN-Only Access - nginx restricts to 100.64.0.0/10 (Tailscale range)
  • TLS Everywhere - Let's Encrypt certificates via cert-manager
  • Pre-auth Keys - Generated per-pod, short-lived (1 hour expiry)
  • RBAC - Minimal permissions for lifecycle containers
  • Network Isolation - No external ingress, VPN-only access

Lifecycle Pattern Migration

Ollama is the reference implementation for the modern lifecycle pattern:

Old Pattern (deprecated):

  • Terraform data.external calls Python script
  • Script generates pre-auth key via kubectl exec
  • Key stored in Kubernetes secret
  • Race conditions and Windows compatibility issues

New Pattern (current):

  • Init container generates pre-auth key at pod startup
  • Key written to shared emptyDir volume
  • Tailscale reads from file (TS_AUTHKEY=file:///tailscale-auth/authkey)
  • No Terraform external dependencies
  • Portable and race-free

See REFACTOR_CHECKLIST.md for migration status of other services.

Related Documentation


Navigation: Documentation Index | Home