Ollama is a self-hosted LLM inference server that provides OpenAI-compatible API endpoints for running large language models locally with GPU acceleration.
Purpose: LLM inference server for AI model serving
Port: 11434 (application), 443 (nginx-tls)
Storage: Configured via ollama_storage (emptyDir)
Access: VPN-only via HTTPS (Tailscale required)
GPU Support: Prefers GPU-enabled nodes via node affinity
- OpenAI-Compatible API - Drop-in replacement for OpenAI API endpoints
- GPU Acceleration - Automatic node affinity for GPU workloads
- VPN-Only Access - Secure access via Tailscale mesh network
- TLS Termination - nginx reverse proxy with Let's Encrypt certificates
- Lifecycle Automation - Automatic DNS and node cleanup
- Model Storage - Persistent model storage with configurable size
βββββββββββββββββββββββββββββββββββββββββββββββ
β Ollama Pod β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β ββββββββββββββββ ββββββββββββββββββββ β
β β nginx-tls βββΆβ Ollama β β
β β (443) β β (port 11434) β β
β ββββββββββββββββ ββββββββββββββββββββ β
β β
β ββββββββββββββββ ββββββββββββββββββββ β
β β Tailscale β β lifecycle-dns- β β
β β (VPN sidecar)β β create (sidecar) β β
β ββββββββββββββββ ββββββββββββββββββββ β
β β
β Init Container: β
β ββββββββββββββββββββββββββββββββββββββββ β
β β lifecycle-cleanup β β
β β - Cleanup old Headscale nodes β β
β β - Remove stale DNS records β β
β β - Generate pre-auth key β file β β
β ββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββ
Ollama uses the modern lifecycle pattern with automated key generation:
-
Init Container (lifecycle-cleanup):
- Cleans up old Headscale registrations
- Removes stale DNS records
- Generates pre-auth key and writes to
/tailscale-auth/authkey - No more
data.externalTerraform dependencies!
-
Application Container (ollama):
- Runs Ollama API server on localhost:11434
- Stores models in
/root/.ollama(emptyDir volume) - Startup probe ensures service readiness
-
nginx-tls Container:
- HTTPS termination with wildcard certificate
- Proxies requests to localhost:11434
- IP allowlisting (100.64.0.0/10 VPN range only)
-
Tailscale Sidecar:
- Reads pre-auth key from file (generated by init container)
- Registers with Headscale VPN mesh
- Provides secure network connectivity
-
lifecycle-dns-create Sidecar:
- Waits for Tailscale to register and get IP
- Creates DNS A record pointing to VPN IP
- Continuous monitoring and DNS updates
# terraform.tfvars
ollama_enabled = true
ollama_hostname = "ollama.example.com"
ollama_image = "ollama/ollama:latest"
ollama_storage = "50Gi"
ollama_tailscale_enabled = true # Required for external access
# Resource limits
ollama_cpu_request = "2"
ollama_memory_request = "4Gi"
ollama_cpu_limit = "8"
ollama_memory_limit = "16Gi"
# GPU node affinity (optional)
gpu_accelerator_labels = ["nvidia.com/gpu", "amd.com/gpu"]
cpu_node_label_key = "node-role.kubernetes.io/control-plane"
cpu_node_label_value = ""Ollama strongly prefers GPU-enabled nodes:
affinity {
node_affinity {
preferred_during_scheduling_ignored_during_execution {
weight = 100 # Strongly prefer GPU nodes
preference {
match_expressions {
key = "accelerator"
operator = "In"
values = var.gpu_accelerator_labels
}
}
}
preferred_during_scheduling_ignored_during_execution {
weight = 50 # Avoid CPU-only nodes
preference {
match_expressions {
key = var.cpu_node_label_key
operator = "NotIn"
values = [var.cpu_node_label_value]
}
}
}
}
}# Via VPN (Tailscale required)
curl https://ollama.example.com/api/tags
# Via kubectl port-forward (local development)
kubectl port-forward -n inference ollama-0 11434:11434
curl http://localhost:11434/api/tags# Chat completion
curl https://ollama.example.com/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# List models
curl https://ollama.example.com/v1/models# Via kubectl exec
kubectl exec -n inference ollama-0 -c ollama -- \
ollama pull llama2
kubectl exec -n inference ollama-0 -c ollama -- \
ollama pull codellama
kubectl exec -n inference ollama-0 -c ollama -- \
ollama pull mistralkubectl exec -n inference ollama-0 -c ollama -- ollama listkubectl exec -n inference ollama-0 -c ollama -- \
ollama rm <model-name>kubectl exec -n inference ollama-0 -c ollama -- \
ollama show llama2# Ollama application logs
kubectl logs -n inference ollama-0 -c ollama -f
# nginx TLS logs
kubectl logs -n inference ollama-0 -c nginx-tls -f
# Tailscale VPN logs
kubectl logs -n inference ollama-0 -c tailscale -f
# DNS creation logs
kubectl logs -n inference ollama-0 -c lifecycle-dns-create -f
# Init container logs (cleanup and key generation)
kubectl logs -n inference ollama-0 -c lifecycle-cleanupkubectl exec -n inference ollama-0 -c tailscale -- tailscale statuskubectl exec -n core headscale-0 -c headscale -- \
headscale nodes list | grep ollama# Health check
curl -I https://ollama.example.com/api/tags
# Generate text
curl https://ollama.example.com/api/generate \
-d '{
"model": "llama2",
"prompt": "Why is the sky blue?"
}'Open-WebUI can use Ollama as an external backend:
# terraform.tfvars
open_webui_ollama_enabled = false # Use external Ollama instead of bundled
# In Open-WebUI UI:
# Settings β Connections β Ollama API
# URL: https://ollama.example.comBenefits of external Ollama:
- GPU acceleration on dedicated nodes
- Shared across multiple services
- Independent scaling
- Better resource isolation
# Check pod status
kubectl get pods -n inference ollama-0
# Describe pod for events
kubectl describe pod -n inference ollama-0
# Check init container logs
kubectl logs -n inference ollama-0 -c lifecycle-cleanup
# Check storage
kubectl get pvc -n inference# Check Tailscale logs
kubectl logs -n inference ollama-0 -c tailscale
# Verify auth key was generated
kubectl exec -n inference ollama-0 -c tailscale -- \
cat /tailscale-auth/authkey
# Check Headscale is running
kubectl get pods -n core headscale-0# Check DNS creation logs
kubectl logs -n inference ollama-0 -c lifecycle-dns-create
# Verify Cloudflare DNS record
dig ollama.example.com
# Check DNS A record
kubectl exec -n inference ollama-0 -c lifecycle-dns-create -- \
env | grep CLOUDFLARE# Check storage space
kubectl exec -n inference ollama-0 -c ollama -- df -h /root/.ollama
# Verify model files
kubectl exec -n inference ollama-0 -c ollama -- \
ls -lh /root/.ollama/models
# Pull model manually
kubectl exec -n inference ollama-0 -c ollama -- \
ollama pull llama2Ollama models are memory-intensive. Adjust limits:
# terraform.tfvars
ollama_memory_request = "8Gi"
ollama_memory_limit = "32Gi"
ollama_storage = "100Gi" # For larger models# Check nginx is running
kubectl get pod -n inference ollama-0 -o jsonpath='{.status.containerStatuses[?(@.name=="nginx-tls")].ready}'
# Verify TLS certificate
kubectl get secret -n inference wildcard -o yaml
# Check VPN connection
tailscale status
# Test from VPN client
curl -I https://ollama.example.comFor GPU-accelerated inference:
ollama_cpu_limit = "16"
ollama_memory_limit = "64Gi"
ollama_storage = "200Gi"Ensure GPU nodes have appropriate labels:
kubectl label nodes <gpu-node> accelerator=nvidia.com/gpuSmall models (7B parameters):
ollama_memory_request = "4Gi"
ollama_memory_limit = "8Gi"Medium models (13B parameters):
ollama_memory_request = "8Gi"
ollama_memory_limit = "16Gi"Large models (70B+ parameters):
ollama_memory_request = "32Gi"
ollama_memory_limit = "64Gi"- VPN-Only Access - nginx restricts to 100.64.0.0/10 (Tailscale range)
- TLS Everywhere - Let's Encrypt certificates via cert-manager
- Pre-auth Keys - Generated per-pod, short-lived (1 hour expiry)
- RBAC - Minimal permissions for lifecycle containers
- Network Isolation - No external ingress, VPN-only access
Ollama is the reference implementation for the modern lifecycle pattern:
Old Pattern (deprecated):
- Terraform
data.externalcalls Python script - Script generates pre-auth key via kubectl exec
- Key stored in Kubernetes secret
- Race conditions and Windows compatibility issues
New Pattern (current):
- Init container generates pre-auth key at pod startup
- Key written to shared emptyDir volume
- Tailscale reads from file (
TS_AUTHKEY=file:///tailscale-auth/authkey) - No Terraform external dependencies
- Portable and race-free
See REFACTOR_CHECKLIST.md for migration status of other services.
- Open-WebUI Service - AI chat interface that uses Ollama
- StatefulSet Pattern - Multi-container architecture
- Terraform Variables - Complete variable reference
- REFACTOR_CHECKLIST.md - Lifecycle pattern migration
Navigation: Documentation Index | Home