This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This is a K3s homelab repository containing Kubernetes manifests for various self-hosted applications including media servers, home automation, and supporting infrastructure services. The repository currently uses manual kubectl/helm deployments with infrastructure ready for GitOps migration.
/Users/eriksimko/github/homelab/k3s/apps/
├── apps/ # Consumer applications
│ ├── calibre/ # E-book management
│ ├── filebot/ # Media file organization
│ ├── filebrowser/ # Web-based file manager
│ ├── flaresolverr/ # CloudFlare bypass proxy
│ ├── home-assistant/ # Smart home platform
│ ├── iceberg/ # Trading application
│ ├── kometa/ # Plex metadata manager
│ ├── maria-db/ # MySQL database
│ ├── mqtt/ # MQTT broker (Mosquitto)
│ ├── open-webui/ # AI chat interface
│ ├── plex/ # Media server
│ ├── plextratksync/ # Plex-Trakt sync
│ ├── prowlarr/ # Indexer manager
│ ├── radarr/ # Movie management
│ ├── rclone/ # Cloud storage sync
│ ├── rdt-client/ # Real-Debrid client
│ ├── readarr/ # Book management
│ ├── sonarr/ # TV show management
│ ├── truenas-jackettio-ingress/ # TrueNAS ingress
│ ├── truenas-minio-secret/ # TrueNAS secrets
│ ├── truenas-plex-ingress/ # TrueNAS Plex ingress
│ ├── unifi/ # Network controller
│ ├── whoami/ # Test application
│ ├── zigbee2mqtt/ # Zigbee bridge
│ └── zurg/ # Real-Debrid WebDAV
├── infrastructure/ # Cluster infrastructure
│ ├── alert-manager/ # Alert management
│ ├── ansible/ # Ansible inventory and playbooks
│ ├── argocd-apps/ # GitOps applications
│ ├── grafana/ # Dashboards
│ ├── longhorn/ # Distributed storage
│ ├── metallb/ # Load balancer
│ ├── nfs-share/ # NFS storage provisioner
│ ├── prometheus/ # Monitoring and metrics
│ ├── samba-share/ # SMB storage provisioner
│ ├── traefik/ # Ingress controller
│ ├── apply-topology-spread.sh # Pod distribution utility
│ ├── force-delete-terminating.sh # Emergency cleanup script
│ └── metallb_logs.sh # MetalLB diagnostics
└── docs/ # Cluster documentation
├── diagrams/ # D2 architecture diagrams
├── CASCADING_FAILURE_ANALYSIS.md
├── FIXES_APPLIED.md
├── HEALTH_CHECK_ANALYSIS.md
├── POD_DISTRIBUTION_FIX_SUMMARY.md
├── POD_DISTRIBUTION_STRATEGY.md
├── PREVENTION_PLAN.md
├── README.md # Main documentation
├── deploy_with_common.md # Helm chart guide
├── homelab-monitoring-dashboard.json
└── network-diagram.md # Network architecture
./infrastructure/metallb_logs.sh # Collects comprehensive MetalLB logs and creates metallb_report.tgz# Apply manifests manually
kubectl apply -f apps/<application-name>/
kubectl apply -f infrastructure/<component-name>/
# Check deployments
kubectl get deployments -n default
kubectl get statefulsets -n default
kubectl get pods -n default -o wide
# Check services and ingresses
kubectl get svc
kubectl get ingress
# View logs
kubectl logs -f <pod-name>
# Check node status
kubectl get nodes
# Describe resources
kubectl describe pod <pod-name>
kubectl describe svc <service-name>
# Execute commands in pods
kubectl exec -it <pod-name> -- /bin/bash# Deploy using homelab-app chart
helm install <app-name> ./helm/homelab-app -f ./helm/homelab-app/values/<app-name>.yaml
# Upgrade existing deployment
helm upgrade <app-name> ./helm/homelab-app -f ./helm/homelab-app/values/<app-name>.yaml
# List helm releases
helm list
# Check helm values
helm get values <app-name># Expose a service via Tailscale
kubectl annotate service <service-name> -n <namespace> tailscale.com/expose=true
# Remove Tailscale exposure
kubectl annotate service <service-name> -n <namespace> tailscale.com/expose-
# List all Tailscale proxies
kubectl get pods -n tailscale
# Check Tailscale status of a proxy
kubectl exec -n tailscale ts-<service>-xxxxx-0 -c tailscale -- tailscale status
# View all exposed services
kubectl get svc -A -o json | jq '.items[] | select(.metadata.annotations."tailscale.com/expose" == "true") | {namespace:.metadata.namespace, name:.metadata.name}'- homelab-control (192.168.11.11): Control plane (Raspberry Pi)
- homelab-02 (192.168.11.12): Worker node (Currently NotReady)
- homelab-03 (192.168.11.13): Worker node (Hosts Zigbee USB device)
- homelab-04 (192.168.11.14): Worker node (Database workloads)
- K3s: Lightweight Kubernetes distribution
- Rancher: Kubernetes management platform (cattle-* namespaces)
- Traefik: Ingress controller (deployed via Helm)
- MetalLB: Load balancer for bare metal (IP range: 192.168.11.200-250)
- Longhorn: Distributed storage solution with 3-way replication
- Cert-Manager: SSL certificate management (Let's Encrypt + Cloudflare)
- Sealed Secrets: Secret encryption (deployed via Helm)
- Tailscale Operator: Secure remote access to services via Tailscale VPN (zero-trust networking)
- longhorn (default): General purpose distributed storage
- longhorn-static: For applications requiring specific volume binding
- longhorn-db-storage: Optimized for database workloads
- nfs-books-csi: NFS storage for books/media
- nfs-downloads-csi: NFS storage for downloads
- smb: SMB/CIFS network storage
Each application typically includes:
*-deployment.yaml: Kubernetes Deployment or StatefulSet*-service.yaml: Service definition (ClusterIP/LoadBalancer)*-ingress.yaml: Ingress rules for external access*-pvc.yaml: PersistentVolumeClaim if stateful
- Plex: Media server (configured but not deployed)
- Radarr: Movie management (configured but not deployed)
- Sonarr: TV show management (configured but not deployed)
- Prowlarr: Indexer manager (running)
- RDT-Client: Real-Debrid torrent client (configured but not deployed)
- Overseerr: Media request management (running at 192.168.11.202:5055)
- Calibre: E-book management (running at 192.168.11.209:8080)
- Calibre-Web: Web interface for Calibre (running at 192.168.11.210:8083)
- Readarr: Book management (running)
- Kometa: Plex metadata manager
- PlexTraktSync: Plex-Trakt synchronization (cronjob)
- Zurg: Real-Debrid WebDAV server (running at 192.168.11.208:9999)
- Home Assistant: Smart home platform (StatefulSet, host networking at 192.168.11.207:8123)
- Zigbee2MQTT: Zigbee device bridge (running at 192.168.11.206:8080, nodeSelector: homelab-03)
- Mosquitto: MQTT broker (running at 192.168.11.230:8883)
- MariaDB: MySQL-compatible database (StatefulSet at 192.168.11.203:3306, nodeSelector: homelab-04)
- MongoDB: NoSQL database (StatefulSet, nodeSelector: homelab-04)
- Pi-hole: Network-wide ad blocking (running at 192.168.11.222:53, NodePort for DHCP)
- Unifi Controller: Network management (configured but not deployed)
- Open-WebUI: AI chat interface (running)
- Flaresolverr: CloudFlare bypass proxy (running)
- Filebrowser: Web-based file manager
- Algo-trader: Trading application
- External Access: Via Traefik ingress with SSL
- LoadBalancer Services: Using MetalLB for direct access
- Host Networking: Used by Home Assistant for device discovery
- NodePort: Used by Pi-hole for DHCP
- Tailscale VPN: Secure remote access without public exposure (annotate services with
tailscale.com/expose: "true")
- Manual kubectl apply: Primary method for applying manifests
- Helm Charts: Used for infrastructure components (Traefik, Longhorn, etc.)
- ArgoCD Ready: Repository structure supports GitOps but not currently active
- Common Helm Chart: Custom chart at
helm/homelab-app/for standardized deployments
- Edit YAML manifests in appropriate directory (apps/ or infrastructure/)
- Apply changes:
kubectl apply -f apps/<app-name>/orkubectl apply -f infrastructure/<component>/ - Monitor deployment:
kubectl get pods -n default -w - Check logs:
kubectl logs -f <pod-name> - Verify ingress:
kubectl get ingress
- Node Placement:
- Zigbee2MQTT: Must run on homelab-03 (USB device access)
- MariaDB/MongoDB: Pinned to homelab-04 for database workloads
- Check nodeSelector in deployments
- Host Devices: Zigbee2MQTT requires USB device access (
/dev/ttyUSB0) - IP Reservations: LoadBalancer services use MetalLB IP pool (192.168.11.200-250)
- Persistent Storage: StatefulSets maintain pod identity for storage consistency
- Sealed Secrets: Use
kubesealto encrypt sensitive data before committing - Host Networking: Home Assistant uses host network mode for device discovery
- Domain: All ingresses use
*.erix-homelab.sitewith wildcard TLS certificate - Tailscale Access: Services can be exposed securely via Tailscale by adding annotation
tailscale.com/expose: "true"(see infrastructure/tailscale/README.md)
- Pod not starting: Check
kubectl describe pod <pod-name>for events - Storage issues: Verify PVC is bound with
kubectl get pvc - Network connectivity: Check service endpoints with
kubectl get endpoints - MetalLB issues: Run
./infrastructure/metallb_logs.shto collect diagnostic information - Node issues: Check node status with
kubectl describe node <node-name>
A standardized Helm chart is available at helm/homelab-app/ to reduce YAML duplication across applications.
- Smart Defaults: Single replica, Traefik ingress class, erix-homelab.site domain
- Wildcard TLS: Automatically uses
erix-homelab-site-tlssecret for all ingresses - Flexible Storage: Supports PVCs, NFS, hostPath, and existing volumes
- Minimal Config: Apps only need to specify unique values (image, ports, volumes)
# Deploy an application
helm install radarr ./helm/homelab-app -f ./helm/homelab-app/values/radarr.yaml
# Upgrade an application
helm upgrade radarr ./helm/homelab-app -f ./helm/homelab-app/values/radarr.yaml
# Deploy with custom values
helm install myapp ./helm/homelab-app --set name=myapp --set image.repository=myimageCreate a minimal values file focusing only on app-specific settings:
name: prowlarr
image:
repository: linuxserver/prowlarr
service:
ports:
- name: http
port: 9696
targetPort: 9696
ingress:
enabled: true # Automatically creates prowlarr.erix-homelab.site
persistence:
config:
enabled: true
size: 10Gi
storageClassName: longhorn- Ingress: Disabled by default, when enabled uses
{app}.erix-homelab.sitewith TLS - Service: ClusterIP by default, supports LoadBalancer with MetalLB annotations
- Storage: Multiple volume types supported in a single deployment
- Environment: Standard PUID/PGID/TZ variables for LinuxServer.io images
The repository includes a homelab-common library chart to standardize deployments:
# In Chart.yaml
dependencies:
- name: homelab-common
version: "0.1.0"
repository: "file://../homelab-common"
# In templates/deployment.yaml
{{- include "homelab-common.deployment" (dict "root" $ "kind" "Deployment" "values" .Values.deployment) -}}
# In templates/service.yaml
{{- include "homelab-common.service" (dict "root" $ "values" .Values.service) -}}This reduces template duplication and ensures consistency across all applications.
- CPU: Raspberry Pi 4, 4-core ARM64
- Memory: 4GB RAM per node
- Storage: USB 3.0 flash drives (~150MB/s read, variable write speeds)
- Constraint: Resource-limited hardware requires careful tuning
The cluster experienced recurring cascading failures where:
- Initial trigger (snapshot, probe timeout, or load spike)
- Health probes fail → pods restart → more load → more failures
- Node becomes NotReady → pods migrate to other nodes
- Pods permanently concentrate on one node (e.g., 36 pods on homelab-03)
- Overloaded node at risk of another cascade
Root Causes Identified:
- Aggressive health check timeouts (1s) too strict for ARM hardware under load
- Overlapping Longhorn snapshots at midnight-1 AM causing I/O spikes
- High concurrent Longhorn operations (5 rebuilds) overwhelming nodes
- No resource limits on Prometheus allowing unbounded memory/CPU consumption
- No pod distribution policy - Kubernetes doesn't auto-rebalance pods
- No swap - OOM killer activates under memory pressure
- Slow iptables operations under load (91 seconds for ChainExists)
Critical Fixes (Applied 2025-10-20):
-
Health Check Optimization (
prometheus/health-check-values.yaml,metallb/values.yaml)- Increased probe timeouts: 1s → 5s
- Increased failure threshold: 3 → 5 failures
- Increased period: 10s → 15s
- Grace period before restart: 3s → 25s
- Impact: Eliminated false-positive restarts (7,444 restarts → 0)
-
Longhorn Snapshot Staggering (via kubectl patch)
database-snapshot: 0 0 * * ? (midnight) app-snapshot: 0 3 * * ? (3 AM, was 1 AM) database-backup: 0 2 ? * MON (Monday 2 AM) app-backup: 0 2 ? * WED (Wednesday 2 AM, was Monday)- Impact: Eliminated midnight I/O spike pattern
-
Longhorn Concurrent Operations (
longhorn/current-values.yaml)- Reduced concurrent replica rebuilds: 5 → 2 per node
- Reduced concurrent backup/restore: 5 → 2 per node
- Increased rebuild wait interval: 600s (10 minutes)
- Allow degraded volume creation: true
- Impact: Prevents rebuild cascades
-
Prometheus Resource Limits (
prometheus/health-check-values.yaml)prometheus: 500m-2 CPU, 2-4Gi memory alertmanager: 100m-500m CPU, 256-512Mi memory grafana: 250m-1 CPU, 512Mi-1Gi memory
- Pod anti-affinity to spread across nodes
- Impact: Prevents unbounded resource consumption
-
Automatic Pod Distribution (Topology Spread Constraints)
- Applied to ALL 32 deployments + 2 StatefulSets
- Configuration:
maxSkew: 1, whenUnsatisfiable: ScheduleAnyway - Tool:
apply-topology-spread.shfor automation - Impact: Future pods automatically spread evenly across nodes
How to Trigger Rebalancing:
cd /Users/eriksimko/github/homelab/k3s/apps
./infrastructure/apply-topology-spread.sh restartCheck Pod Distribution:
kubectl get pods -A -o wide --no-headers | awk '{print $8}' | grep -E "homelab" | sort | uniq -c | sort -rnExpected: ~16 pods per node (±2)
Check Node Load:
kubectl top nodes
# Healthy: Load <8 (2x CPU count)
# Warning: Load >8
# Critical: Load >16Check for Terminating Pods (sign of zombie pod issue):
kubectl get pods -A | grep TerminatingForce Delete Terminating Pods (if node is NotReady):
kubectl get pods -A --field-selector spec.nodeName=homelab-02 -o json | \
jq -r '.items[] | select(.metadata.deletionTimestamp != null) |
"\(.metadata.namespace) \(.metadata.name)"' | \
while read ns pod; do
kubectl delete pod -n $ns $pod --grace-period=0 --force
doneIf Node Goes NotReady:
- Don't panic - Let it settle for 5-10 minutes
- Check load:
ssh <node> "uptime" - Look for zombie pods:
kubectl get pods -A | grep Terminating - If zombies exist: Force-delete them (command above)
- If still stuck after 15 min: Restart k3s-agent on the node
If Pod Concentration Occurs:
- Verify topology spread is applied:
kubectl get deployment <name> -o jsonpath='{.spec.template.spec.topologySpreadConstraints}' - Trigger rebalancing:
./infrastructure/apply-topology-spread.sh restart - Monitor distribution:
kubectl get pods -A -o wide
For detailed information about cluster stability:
docs/CASCADING_FAILURE_ANALYSIS.md- Root cause analysis of 7 failure triggersdocs/PREVENTION_PLAN.md- Comprehensive prevention strategy (11 fixes)docs/FIXES_APPLIED.md- Detailed changelog of applied fixesdocs/POD_DISTRIBUTION_STRATEGY.md- Long-term pod distribution strategydocs/POD_DISTRIBUTION_FIX_SUMMARY.md- Implementation summaryinfrastructure/metallb/README.md,infrastructure/prometheus/README.md,infrastructure/longhorn/README.md- Component-specific docs
Before (2025-10-20 morning):
- Prometheus node-exporter: 7,444 restarts on homelab-03
- MetalLB speaker: 21 restarts on homelab-02
- homelab-02 NotReady event with load 60.09 (15x normal)
- Pod distribution: 36 pods on homelab-03, 12 on homelab-02, 12 on homelab-04
After (2025-10-20 evening):
- All components stable with 0 restarts
- Cluster recovered from NotReady in <10 minutes
- Topology spread constraints applied to all deployments
- I/O load staggered across different times/days
Target Ongoing:
- Zero NotReady events over 7 days
- Balanced pod distribution (±2 pods per node)
- Node load <8 during normal operations
- No probe timeout restarts