CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Repository Overview

This is a K3s homelab repository containing Kubernetes manifests for various self-hosted applications including media servers, home automation, and supporting infrastructure services. The repository currently uses manual kubectl/helm deployments with infrastructure ready for GitOps migration.

Repository Layout

/Users/eriksimko/github/homelab/k3s/apps/
├── apps/                            # Consumer applications
│   ├── calibre/                    # E-book management
│   ├── filebot/                    # Media file organization
│   ├── filebrowser/                # Web-based file manager
│   ├── flaresolverr/               # CloudFlare bypass proxy
│   ├── home-assistant/             # Smart home platform
│   ├── iceberg/                    # Trading application
│   ├── kometa/                     # Plex metadata manager
│   ├── maria-db/                   # MySQL database
│   ├── mqtt/                       # MQTT broker (Mosquitto)
│   ├── open-webui/                 # AI chat interface
│   ├── plex/                       # Media server
│   ├── plextratksync/              # Plex-Trakt sync
│   ├── prowlarr/                   # Indexer manager
│   ├── radarr/                     # Movie management
│   ├── rclone/                     # Cloud storage sync
│   ├── rdt-client/                 # Real-Debrid client
│   ├── readarr/                    # Book management
│   ├── sonarr/                     # TV show management
│   ├── truenas-jackettio-ingress/  # TrueNAS ingress
│   ├── truenas-minio-secret/       # TrueNAS secrets
│   ├── truenas-plex-ingress/       # TrueNAS Plex ingress
│   ├── unifi/                      # Network controller
│   ├── whoami/                     # Test application
│   ├── zigbee2mqtt/                # Zigbee bridge
│   └── zurg/                       # Real-Debrid WebDAV
├── infrastructure/                  # Cluster infrastructure
│   ├── alert-manager/              # Alert management
│   ├── ansible/                    # Ansible inventory and playbooks
│   ├── argocd-apps/                # GitOps applications
│   ├── grafana/                    # Dashboards
│   ├── longhorn/                   # Distributed storage
│   ├── metallb/                    # Load balancer
│   ├── nfs-share/                  # NFS storage provisioner
│   ├── prometheus/                 # Monitoring and metrics
│   ├── samba-share/                # SMB storage provisioner
│   ├── traefik/                    # Ingress controller
│   ├── apply-topology-spread.sh    # Pod distribution utility
│   ├── force-delete-terminating.sh # Emergency cleanup script
│   └── metallb_logs.sh             # MetalLB diagnostics
└── docs/                           # Cluster documentation
    ├── diagrams/                   # D2 architecture diagrams
    ├── CASCADING_FAILURE_ANALYSIS.md
    ├── FIXES_APPLIED.md
    ├── HEALTH_CHECK_ANALYSIS.md
    ├── POD_DISTRIBUTION_FIX_SUMMARY.md
    ├── POD_DISTRIBUTION_STRATEGY.md
    ├── PREVENTION_PLAN.md
    ├── README.md                   # Main documentation
    ├── deploy_with_common.md       # Helm chart guide
    ├── homelab-monitoring-dashboard.json
    └── network-diagram.md          # Network architecture

Common Commands

MetalLB Troubleshooting

./infrastructure/metallb_logs.sh  # Collects comprehensive MetalLB logs and creates metallb_report.tgz

Kubernetes Management

# Apply manifests manually
kubectl apply -f apps/<application-name>/
kubectl apply -f infrastructure/<component-name>/

# Check deployments
kubectl get deployments -n default
kubectl get statefulsets -n default
kubectl get pods -n default -o wide

# Check services and ingresses
kubectl get svc
kubectl get ingress

# View logs
kubectl logs -f <pod-name>

# Check node status
kubectl get nodes

# Describe resources
kubectl describe pod <pod-name>
kubectl describe svc <service-name>

# Execute commands in pods
kubectl exec -it <pod-name> -- /bin/bash

Helm Operations

# Deploy using homelab-app chart
helm install <app-name> ./helm/homelab-app -f ./helm/homelab-app/values/<app-name>.yaml

# Upgrade existing deployment
helm upgrade <app-name> ./helm/homelab-app -f ./helm/homelab-app/values/<app-name>.yaml

# List helm releases
helm list

# Check helm values
helm get values <app-name>

Tailscale - Secure Remote Access

# Expose a service via Tailscale
kubectl annotate service <service-name> -n <namespace> tailscale.com/expose=true

# Remove Tailscale exposure
kubectl annotate service <service-name> -n <namespace> tailscale.com/expose-

# List all Tailscale proxies
kubectl get pods -n tailscale

# Check Tailscale status of a proxy
kubectl exec -n tailscale ts-<service>-xxxxx-0 -c tailscale -- tailscale status

# View all exposed services
kubectl get svc -A -o json | jq '.items[] | select(.metadata.annotations."tailscale.com/expose" == "true") | {namespace:.metadata.namespace, name:.metadata.name}'

Current Architecture

Cluster Nodes

homelab-control (192.168.11.11): Control plane (Raspberry Pi)
homelab-02 (192.168.11.12): Worker node (Currently NotReady)
homelab-03 (192.168.11.13): Worker node (Hosts Zigbee USB device)
homelab-04 (192.168.11.14): Worker node (Database workloads)

Cluster Components

K3s: Lightweight Kubernetes distribution
Rancher: Kubernetes management platform (cattle-* namespaces)
Traefik: Ingress controller (deployed via Helm)
MetalLB: Load balancer for bare metal (IP range: 192.168.11.200-250)
Longhorn: Distributed storage solution with 3-way replication
Cert-Manager: SSL certificate management (Let's Encrypt + Cloudflare)
Sealed Secrets: Secret encryption (deployed via Helm)
Tailscale Operator: Secure remote access to services via Tailscale VPN (zero-trust networking)

Storage Classes

longhorn (default): General purpose distributed storage
longhorn-static: For applications requiring specific volume binding
longhorn-db-storage: Optimized for database workloads
nfs-books-csi: NFS storage for books/media
nfs-downloads-csi: NFS storage for downloads
smb: SMB/CIFS network storage

Application Structure

Each application typically includes:

*-deployment.yaml: Kubernetes Deployment or StatefulSet
*-service.yaml: Service definition (ClusterIP/LoadBalancer)
*-ingress.yaml: Ingress rules for external access
*-pvc.yaml: PersistentVolumeClaim if stateful

Deployed Applications

Media Stack

Plex: Media server (configured but not deployed)
Radarr: Movie management (configured but not deployed)
Sonarr: TV show management (configured but not deployed)
Prowlarr: Indexer manager (running)
RDT-Client: Real-Debrid torrent client (configured but not deployed)
Overseerr: Media request management (running at 192.168.11.202:5055)
Calibre: E-book management (running at 192.168.11.209:8080)
Calibre-Web: Web interface for Calibre (running at 192.168.11.210:8083)
Readarr: Book management (running)
Kometa: Plex metadata manager
PlexTraktSync: Plex-Trakt synchronization (cronjob)
Zurg: Real-Debrid WebDAV server (running at 192.168.11.208:9999)

Home Automation

Home Assistant: Smart home platform (StatefulSet, host networking at 192.168.11.207:8123)
Zigbee2MQTT: Zigbee device bridge (running at 192.168.11.206:8080, nodeSelector: homelab-03)
Mosquitto: MQTT broker (running at 192.168.11.230:8883)

Infrastructure

MariaDB: MySQL-compatible database (StatefulSet at 192.168.11.203:3306, nodeSelector: homelab-04)
MongoDB: NoSQL database (StatefulSet, nodeSelector: homelab-04)
Pi-hole: Network-wide ad blocking (running at 192.168.11.222:53, NodePort for DHCP)
Unifi Controller: Network management (configured but not deployed)
Open-WebUI: AI chat interface (running)
Flaresolverr: CloudFlare bypass proxy (running)
Filebrowser: Web-based file manager
Algo-trader: Trading application

Networking Patterns

External Access: Via Traefik ingress with SSL
LoadBalancer Services: Using MetalLB for direct access
Host Networking: Used by Home Assistant for device discovery
NodePort: Used by Pi-hole for DHCP
Tailscale VPN: Secure remote access without public exposure (annotate services with tailscale.com/expose: "true")

Deployment Methods

Manual kubectl apply: Primary method for applying manifests
Helm Charts: Used for infrastructure components (Traefik, Longhorn, etc.)
ArgoCD Ready: Repository structure supports GitOps but not currently active
Common Helm Chart: Custom chart at helm/homelab-app/ for standardized deployments

Development Workflow

Edit YAML manifests in appropriate directory (apps/ or infrastructure/)
Apply changes: kubectl apply -f apps/<app-name>/ or kubectl apply -f infrastructure/<component>/
Monitor deployment: kubectl get pods -n default -w
Check logs: kubectl logs -f <pod-name>
Verify ingress: kubectl get ingress

Special Considerations

Node Placement:
- Zigbee2MQTT: Must run on homelab-03 (USB device access)
- MariaDB/MongoDB: Pinned to homelab-04 for database workloads
- Check nodeSelector in deployments
Host Devices: Zigbee2MQTT requires USB device access (/dev/ttyUSB0)
IP Reservations: LoadBalancer services use MetalLB IP pool (192.168.11.200-250)
Persistent Storage: StatefulSets maintain pod identity for storage consistency
Sealed Secrets: Use kubeseal to encrypt sensitive data before committing
Host Networking: Home Assistant uses host network mode for device discovery
Domain: All ingresses use *.erix-homelab.site with wildcard TLS certificate
Tailscale Access: Services can be exposed securely via Tailscale by adding annotation tailscale.com/expose: "true" (see infrastructure/tailscale/README.md)

Troubleshooting

Common Issues

Pod not starting: Check kubectl describe pod <pod-name> for events
Storage issues: Verify PVC is bound with kubectl get pvc
Network connectivity: Check service endpoints with kubectl get endpoints
MetalLB issues: Run ./infrastructure/metallb_logs.sh to collect diagnostic information
Node issues: Check node status with kubectl describe node <node-name>

Common Helm Chart

A standardized Helm chart is available at helm/homelab-app/ to reduce YAML duplication across applications.

Chart Features

Smart Defaults: Single replica, Traefik ingress class, erix-homelab.site domain
Wildcard TLS: Automatically uses erix-homelab-site-tls secret for all ingresses
Flexible Storage: Supports PVCs, NFS, hostPath, and existing volumes
Minimal Config: Apps only need to specify unique values (image, ports, volumes)

Usage Examples

# Deploy an application
helm install radarr ./helm/homelab-app -f ./helm/homelab-app/values/radarr.yaml

# Upgrade an application
helm upgrade radarr ./helm/homelab-app -f ./helm/homelab-app/values/radarr.yaml

# Deploy with custom values
helm install myapp ./helm/homelab-app --set name=myapp --set image.repository=myimage

Creating New App Values

Create a minimal values file focusing only on app-specific settings:

name: prowlarr
image:
  repository: linuxserver/prowlarr
service:
  ports:
    - name: http
      port: 9696
      targetPort: 9696
ingress:
  enabled: true  # Automatically creates prowlarr.erix-homelab.site
persistence:
  config:
    enabled: true
    size: 10Gi
    storageClassName: longhorn

Common Patterns

Ingress: Disabled by default, when enabled uses {app}.erix-homelab.site with TLS
Service: ClusterIP by default, supports LoadBalancer with MetalLB annotations
Storage: Multiple volume types supported in a single deployment
Environment: Standard PUID/PGID/TZ variables for LinuxServer.io images

Library Chart Migration

The repository includes a homelab-common library chart to standardize deployments:

Using homelab-common

# In Chart.yaml
dependencies:
  - name: homelab-common
    version: "0.1.0"
    repository: "file://../homelab-common"

# In templates/deployment.yaml
{{- include "homelab-common.deployment" (dict "root" $ "kind" "Deployment" "values" .Values.deployment) -}}

# In templates/service.yaml
{{- include "homelab-common.service" (dict "root" $ "values" .Values.service) -}}

This reduces template duplication and ensures consistency across all applications.

Cluster Stability and Known Issues

Hardware Characteristics

CPU: Raspberry Pi 4, 4-core ARM64
Memory: 4GB RAM per node
Storage: USB 3.0 flash drives (~150MB/s read, variable write speeds)
Constraint: Resource-limited hardware requires careful tuning

Known Stability Issues (Resolved 2025-10-20)

Problem: Cascading Failures and Pod Concentration

The cluster experienced recurring cascading failures where:

Initial trigger (snapshot, probe timeout, or load spike)
Health probes fail → pods restart → more load → more failures
Node becomes NotReady → pods migrate to other nodes
Pods permanently concentrate on one node (e.g., 36 pods on homelab-03)
Overloaded node at risk of another cascade

Root Causes Identified:

Aggressive health check timeouts (1s) too strict for ARM hardware under load
Overlapping Longhorn snapshots at midnight-1 AM causing I/O spikes
High concurrent Longhorn operations (5 rebuilds) overwhelming nodes
No resource limits on Prometheus allowing unbounded memory/CPU consumption
No pod distribution policy - Kubernetes doesn't auto-rebalance pods
No swap - OOM killer activates under memory pressure
Slow iptables operations under load (91 seconds for ChainExists)

Solutions Implemented

Critical Fixes (Applied 2025-10-20):

Health Check Optimization (prometheus/health-check-values.yaml, metallb/values.yaml)
- Increased probe timeouts: 1s → 5s
- Increased failure threshold: 3 → 5 failures
- Increased period: 10s → 15s
- Grace period before restart: 3s → 25s
- Impact: Eliminated false-positive restarts (7,444 restarts → 0)

Longhorn Snapshot Staggering (via kubectl patch)

database-snapshot: 0 0 * * ?     (midnight)
app-snapshot:      0 3 * * ?     (3 AM, was 1 AM)
database-backup:   0 2 ? * MON   (Monday 2 AM)
app-backup:        0 2 ? * WED   (Wednesday 2 AM, was Monday)

Impact: Eliminated midnight I/O spike pattern

Longhorn Concurrent Operations (longhorn/current-values.yaml)
- Reduced concurrent replica rebuilds: 5 → 2 per node
- Reduced concurrent backup/restore: 5 → 2 per node
- Increased rebuild wait interval: 600s (10 minutes)
- Allow degraded volume creation: true
- Impact: Prevents rebuild cascades

Prometheus Resource Limits (prometheus/health-check-values.yaml)

prometheus:     500m-2 CPU, 2-4Gi memory
alertmanager:   100m-500m CPU, 256-512Mi memory
grafana:        250m-1 CPU, 512Mi-1Gi memory

Pod anti-affinity to spread across nodes
Impact: Prevents unbounded resource consumption

Automatic Pod Distribution (Topology Spread Constraints)
- Applied to ALL 32 deployments + 2 StatefulSets
- Configuration: maxSkew: 1, whenUnsatisfiable: ScheduleAnyway
- Tool: apply-topology-spread.sh for automation
- Impact: Future pods automatically spread evenly across nodes

How to Trigger Rebalancing:

cd /Users/eriksimko/github/homelab/k3s/apps
./infrastructure/apply-topology-spread.sh restart

Monitoring Cluster Health

Check Pod Distribution:

kubectl get pods -A -o wide --no-headers | awk '{print $8}' | grep -E "homelab" | sort | uniq -c | sort -rn

Expected: ~16 pods per node (±2)

Check Node Load:

kubectl top nodes
# Healthy: Load <8 (2x CPU count)
# Warning: Load >8
# Critical: Load >16

Check for Terminating Pods (sign of zombie pod issue):

kubectl get pods -A | grep Terminating

Force Delete Terminating Pods (if node is NotReady):

kubectl get pods -A --field-selector spec.nodeName=homelab-02 -o json | \
  jq -r '.items[] | select(.metadata.deletionTimestamp != null) |
  "\(.metadata.namespace) \(.metadata.name)"' | \
  while read ns pod; do
    kubectl delete pod -n $ns $pod --grace-period=0 --force
  done

Emergency Recovery Procedures

If Node Goes NotReady:

Don't panic - Let it settle for 5-10 minutes
Check load: ssh <node> "uptime"
Look for zombie pods: kubectl get pods -A | grep Terminating
If zombies exist: Force-delete them (command above)
If still stuck after 15 min: Restart k3s-agent on the node

If Pod Concentration Occurs:

Verify topology spread is applied: kubectl get deployment <name> -o jsonpath='{.spec.template.spec.topologySpreadConstraints}'
Trigger rebalancing: ./infrastructure/apply-topology-spread.sh restart
Monitor distribution: kubectl get pods -A -o wide

Documentation References

For detailed information about cluster stability:

docs/CASCADING_FAILURE_ANALYSIS.md - Root cause analysis of 7 failure triggers
docs/PREVENTION_PLAN.md - Comprehensive prevention strategy (11 fixes)
docs/FIXES_APPLIED.md - Detailed changelog of applied fixes
docs/POD_DISTRIBUTION_STRATEGY.md - Long-term pod distribution strategy
docs/POD_DISTRIBUTION_FIX_SUMMARY.md - Implementation summary
infrastructure/metallb/README.md, infrastructure/prometheus/README.md, infrastructure/longhorn/README.md - Component-specific docs

Success Metrics (Post-Fix)

Before (2025-10-20 morning):

Prometheus node-exporter: 7,444 restarts on homelab-03
MetalLB speaker: 21 restarts on homelab-02
homelab-02 NotReady event with load 60.09 (15x normal)
Pod distribution: 36 pods on homelab-03, 12 on homelab-02, 12 on homelab-04

After (2025-10-20 evening):

All components stable with 0 restarts
Cluster recovered from NotReady in <10 minutes
Topology spread constraints applied to all deployments
I/O load staggered across different times/days

Target Ongoing:

Zero NotReady events over 7 days
Balanced pod distribution (±2 pods per node)
Node load <8 during normal operations
No probe timeout restarts

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Repository Overview

Repository Layout

Common Commands

MetalLB Troubleshooting

Kubernetes Management

Helm Operations

Tailscale - Secure Remote Access

Current Architecture

Cluster Nodes

Cluster Components

Storage Classes

Application Structure

Deployed Applications

Media Stack

Home Automation

Infrastructure

Networking Patterns

Deployment Methods

Development Workflow

Special Considerations

Troubleshooting

Common Issues

Common Helm Chart

Chart Features

Usage Examples

Creating New App Values

Common Patterns

Library Chart Migration

Using homelab-common

Cluster Stability and Known Issues

Hardware Characteristics

Known Stability Issues (Resolved 2025-10-20)

Problem: Cascading Failures and Pod Concentration

Solutions Implemented

Monitoring Cluster Health

Emergency Recovery Procedures

Documentation References

Success Metrics (Post-Fix)