This homelab runs on Raspberry Pi hardware with kube-vip for control plane high availability. The following sections document known stability issues and their resolutions.
Raspberry Pi nodes require adequate power and cooling to prevent thermal throttling and stability issues:
- Power Supply: Use official Raspberry Pi power supplies (5V/3A minimum)
- Cooling: Active cooling (fans/heat sinks) recommended, aim for < 70°C under load
- Monitoring: Temperature and throttling metrics are collected via prometheus-rpi-exporter
UniFi network switches require specific configuration for reliable Kubernetes operation:
- Disable EEE: Energy Efficient Ethernet must be disabled on all Pi switch ports
- Link Speed: Ensure 1000/full duplex operation
- Port Configuration: Use standard port settings (no STP modifications due to Sonos compatibility)
- MTU: Use 1500 end-to-end (flannel vxlan default)
On Raspberry Pi nodes:
# EEE is automatically disabled via systemd unit (disable-eth0-eee.service)
sudo systemctl status disable-eth0-eee.service- SSD Recommended: Use USB SSD storage for better I/O performance and reliability
- Health Monitoring: Monitor for I/O errors in kernel logs
- Space: Ensure adequate free space (>10GB) and inodes
The cluster uses Cilium L2 announcements with the following configuration:
- L2 Announcements: Enabled for LoadBalancer service IPs
- Leader Election: 15s lease duration, 5s renew deadline, 2s retry period
- ARP Response: Nodes respond to ARP requests for virtual IPs
- Metrics: Prometheus metrics enabled via Cilium agent
Current Setup: 2 control plane nodes with embedded etcd
Recommended: Upgrade to 3 masters for proper etcd quorum and HA
Option 1: Add Third Master (Recommended)
- Add third Raspberry Pi to inventory as master
- Run
make k3s-installto join as third control plane node - etcd will automatically rebalance to 3-node cluster
- Benefits: Full HA, proper quorum, no SPOF
Option 2: Single Master + Agents (Temporary)
- Convert one master to agent:
k3s server --disable-etcd --server https://VIP:6443 - Benefits: Simpler topology, stable single etcd
- Drawbacks: Control plane SPOF
Option 3: External etcd (Advanced)
- Deploy separate 3-node etcd cluster
- Configure k3s to use external etcd
- Benefits: Separates control plane from datastore
- Drawbacks: Complex, requires additional infrastructure
# Add third master (recommended)
# 1. Update inventory.yml with third master
# 2. Bootstrap third host
make host-bootstrap
# 3. Install k3s on third master (will auto-join etcd cluster)
make k3s-install
# Verify etcd cluster health
kubectl get nodes
kubectl -n kube-system exec -it etcd-pi-homelab.local -- etcdctl member list- prometheus-rpi-exporter: Raspberry Pi specific metrics (temperature, throttling, voltage)
- node-exporter: Standard system metrics
- blackbox-exporter: API server VIP availability probing
Critical alerts configured:
- KubernetesAPIServerDown: API VIP unreachable >5 minutes
- RaspberryPiUnderVoltage: Power supply issues detected
- RaspberryPiThrottling: CPU frequency limiting active
- RaspberryPiHighTemperature: >75°C sustained
Run these on control plane nodes during outages:
# Basic health check
date; uptime; uname -a
k3s -v
# Temperature and power
vcgencmd measure_temp
vcgencmd get_throttled
# Network diagnostics
ip -br a
sudo ethtool -S eth0 | egrep -i 'err|drop|crc|miss'
# Storage health
./scripts/check-storage-health.sh
# k3s service status
sudo journalctl -u k3s -n 100 --no-pager
sudo systemctl status k3s
# Cilium status
cilium status
kubectl get pods -n kube-system -l app=cilium
kubectl get ciliuml2announcementpolicy -A
# Check LoadBalancer services and IP allocation
kubectl get svc -A | grep LoadBalancer
kubectl get ciliumloadbalancerippool -n kube-system-
Check Cilium L2 announcement status:
kubectl get ciliuml2announcementpolicy -n kube-system kubectl get svc apiserver-lb -n kube-system
-
Verify VIP ARP table:
arp -an | grep 192.168.10.110 ping -c3 192.168.10.110 -
Test API endpoints:
curl -k https://192.168.10.110:6443/readyz kubectl get --raw='/readyz?verbose'
If rx_errors > 1000 or mdf_err_cnt > 10000:
-
Verify EEE is disabled:
sudo ethtool --show-eee eth0
-
Check UniFi port configuration (EEE disabled, verify link speed)
-
Replace Ethernet cable and test different switch port
If get_throttled shows non-zero values:
- Verify cooling solution and ambient temperature
- Check power supply quality and cable
- Monitor with:
vcgencmd get_throttled
If you encounter "undefined variable" errors when deploying monitoring:
# Test template generation
helm template . --name-template monitoring-system --namespace monitoring-system --dry-run
# If you see "$value" undefined errors, the Prometheus template variables
# in alert annotations need to be properly escaped using: {{ "text" | quote }}If rpi-exporter pods are not running or not collecting metrics:
# Check pod status
kubectl get pods -n monitoring-system -l app=prometheus-rpi-exporter
# Check pod logs
kubectl logs -n monitoring-system -l app=prometheus-rpi-exporter
# Verify node labels match nodeSelector
kubectl get nodes --show-labels | grep node.homelab.io/arch
# Test direct access to metrics
kubectl port-forward -n monitoring-system svc/prometheus-rpi-exporter 9211:9211
# Then visit: http://localhost:9211/metricsCommon Issues:
- Image pull failures: If you see 401 UNAUTHORIZED, the image registry/tag has changed. Current image:
edgd1er/rpi_exporter:latest - NodeSelector mismatch: Ensure nodes have
node.homelab.io/arch: arm64label - Missing privileges: Exporter needs privileged access and
/sys,/procmounts - Hardware access: Some RPi models may need additional configuration
Image Registry Changes:
The original quay.io/prometheuscommunity/rpi-exporter:v0.3.0 returned 401 UNAUTHORIZED. Using edgd1er/rpi_exporter:latest as a working alternative with compatible metrics.
When using Cilium L2 announcements for LoadBalancer services and API server VIP:
L2 Announcement Status:
# Check L2 announcement policy
kubectl get ciliuml2announcementpolicy -A
# Check which nodes are announcing IPs
kubectl get ciliuml2announcementpolicy loadbalancer-ips -n kube-system -o yamlService IP Allocation:
# Check LoadBalancer services
kubectl get svc -A -o wide | grep LoadBalancer
# Check IP pool allocation
kubectl get ciliumloadbalancerippool -n kube-system
kubectl describe ciliumloadbalancerippool homelab-lb-pool -n kube-systemCilium Status:
# Overall Cilium health
cilium status
# Cilium agent logs
kubectl logs -n kube-system -l app=cilium --tail=100ARP Resolution:
# From external machine, check ARP table for VIP
arp -an | grep 192.168.10.110
# Test connectivity to LoadBalancer IP
curl -k https://192.168.10.110:6443Common L2 Issues:
- VIP not reachable: Check L2 announcement policy nodeSelector matches control plane nodes
- IP not allocated: Check CiliumLoadBalancerIPPool range and availability
- ARP not responding: Verify l2announcements.enabled: true in Cilium config
- Interface mismatch: Check L2 policy interfaces match node network interfaces (eth*, en*)
# Update monitoring system with new exporters and alerts
helm upgrade --install monitoring-system ./the-lab/system/monitoring-system# Apply EEE disable systemd unit and other system optimizations
make host-bootstrap
# Redeploy kube-vip with updated configuration
make k3s-install
# Deploy monitoring updates with Raspberry Pi and API server monitoring
helm upgrade monitoring-system ./the-lab/system/monitoring-systemIn UniFi Controller:
- Navigate to Settings → Networks → LAN
- For each Raspberry Pi switch port:
- Disable "Energy Efficient Ethernet (EEE)"
- Verify "Speed/Duplex" is set to "Auto" or "1000Mbps/Full-Duplex"
- Keep default STP settings (do not enable RSTP due to Sonos compatibility)
All hosts are defined in src/inventory.yml.
Test connectivity to inventory hosts.
Gather and display facts about inventory hosts.
Run Ansible Lint against the playbooks.
Initializes the remote hosts that comprise the cluster.
The cluster uses Cilium L2 announcements for LoadBalancer services and API server VIP:
- L2 Announcements: Nodes respond to ARP requests for LoadBalancer IPs
- LoadBalancer IPs: 192.168.10.96-111 range (16 addresses) managed by Cilium LB IPAM
- API Server VIP: 192.168.10.110 accessible via L2 announcement
- Benefits: Simple single-subnet setup, automatic failover, no router configuration needed
Deploy with:
make k3s-installIf a master node fails, traffic automatically fails over within seconds to maintain control plane availability through Cilium's lease-based leader election.
The cluster uses a comprehensive node labeling strategy to optimize workload placement based on Raspberry Pi hardware capabilities. This ensures that resource-intensive workloads (databases, storage services) are scheduled on appropriate nodes.
All nodes are automatically labeled during k3s installation based on their configuration in src/inventory.yml:
-
node.homelab.io/performance: high|mid|low- Indicates CPU and RAM capacity
- High: Raspberry Pi 5 with 8GB RAM
- Mid: Raspberry Pi 4 with 4GB RAM
- Low: Older models
-
node.homelab.io/storage-type: ssd|mmc- Storage backend type
- SSD: Nodes with SSD or NVMe storage
- MMC: Nodes with SD card storage
-
node.homelab.io/storage-tier: fast|standard- Storage performance classification
- Fast: SSD-backed nodes suitable for databases
- Standard: SD card backed nodes
-
node.homelab.io/arch: arm64- CPU architecture identifier
node.longhorn.io/create-default-disk: "true"- Enables automatic Longhorn disk creation
- Only nodes with this label will have Longhorn storage configured
Workloads are configured with node affinity to prefer appropriate nodes:
Database and stateful services prefer high-performance nodes with SSD storage:
- InfluxDB: Prefers high-performance, SSD nodes
- PostgreSQL (CloudNative-PG): Prefers high-performance, SSD nodes
- Valkey/Redis: Prefers high-performance, SSD nodes
Application workloads with moderate resource requirements:
- n8n main application: Prefers high-performance, SSD nodes
- n8n workers: Accepts any SSD node
- Prometheus: Prefers high-performance, SSD nodes (metrics database)
- Grafana: Prefers SSD nodes (dashboard storage)
- Longhorn components: Required to run on SSD nodes
When adding a new node to the cluster:
- Update inventory (
src/inventory.yml):
new-node.local:
ansible_host: 192.168.10.XX
mac: 'xx:xx:xx:xx:xx:xx'
ansible_connection: ssh
performance: high|mid|low
disctype: 'ssd|mmc'
k8s_node_labels:
node.homelab.io/performance: high|mid|low
node.homelab.io/storage-type: ssd|mmc
node.homelab.io/storage-tier: fast|standard
node.homelab.io/arch: arm64
node.longhorn.io/create-default-disk: 'true'- Run k3s installation:
make k3s-installLabels will be automatically applied during installation and existing nodes will be updated.
Check that labels are correctly applied:
kubectl get nodes --show-labelsView labels for a specific node:
kubectl describe node <node-name>Verify that pods are scheduled on appropriate nodes:
# Show all pods with their assigned nodes
kubectl get pods -A -o wide
# Check specific namespace
kubectl get pods -n influx -o wideIf a pod is pending, check if node affinity requirements can be satisfied:
kubectl describe pod <pod-name> -n <namespace>Look for events indicating node affinity issues. You may need to:
- Verify node labels are correct
- Adjust affinity rules in the workload's
values.yaml - Add more nodes with required labels
Node affinity uses preferredDuringSchedulingIgnoredDuringExecution, which means:
- Kubernetes will try to place pods on preferred nodes
- If preferred nodes are unavailable/full, pods may run elsewhere
- This ensures availability over strict placement
To force strict placement, change to requiredDuringSchedulingIgnoredDuringExecution in the workload's affinity configuration.