This repository contains the bootstrapping procedure needed to get a Kubernetes environement up and running using K3s.
NOTE: This is not production ready at this time.
Table of Contents
- Dynamic provisioning of a Hetzner Cloud VM
- Utility to destroy created resources
- Basic setup of Rocky Linux 9 (SSH, Updates, Firewalld, Fail2Ban)
- Installation of a single, standalone K3s node
- Copying the resulting Kubeconfig over to the local system]
- Installation of ArgoCD or FluxCD, whatever takes less resources
- Thorough Documentation
- An Ansible Collection Repository people can install independently of this setup
- Template for a simple FluxCD repo that installs an Nginx ingress controller and Cert-Manager
- Setup of Grafana, Prometheus, Loki stack within K3s
- Backup & Restore of the K3s installation (K3s token)
- Backup & Restore of workloads (Persistent Volumes, most notably)
- Hardening against CIS benchmark
- Optional High-Availability setup with up to 3 nodes]
Goal:
- Single-node k8s
- Nginx + cert-manager instead of Traefik (to be more compatible with the rest of the world)
- Secure (only ssh, kube api and http(s) open to the world, passing kube security benchmarks)
- Observable (Prometheus, Grafana, Loki Stack)
- "GitOps" setup using ArgoCD or FluxCD, Host setup using "Ol' Reliable" Ansible
Environment:
- Rocky Linux 9
- Hetzner Cloud (VM: cax11, AArch64-based)
- Dual-Stack Networking
- Domain that points with itself and all of it's sub-paths to the VM
Motivation
Any server with sufficient automation (Ansible, Terraform, Chef, Puppet...) would be capable to serve a few websites for "home use", this is not the point. I understand the complexities that a Kubernetes cluster brings with it and I'm all for it. Most of those complexities fall right in line with what I'd want anyway.The last few years have shown me that I really enjoy the affordances of Kubernetes: Describing a desired state and a cluster figuring out how to get there for me, declaratively. This is a unique aspect with Kubernetes that no other system or automation method has given me over the past decade. Also, Kubernetes makes stuff like rolling deployments without any downtime really simple and easy, where I would have had to write a myriad of automation previously. From storage to IPs, it all works the same and it does so repeatably.
OCI Containers have also proven as the best artifact type to me. They can range from pure data bundles (FROM: scratch) to any application type I would ever desire (Java, PHP, Elixir - Doesn't matter).
To give a one-line explanation: Why would I do Linux From Scratch when I could use a binary distribution instead? - Kubernetes is the binary distribution of server automation, including a package manager (Helm).
Lastly, abstracting over Kubernetes (sometimes shortened to k8s in this document) gives me the unique opportunity to easily distro hop away from Debian and to potentially re-use the same setup locally as well as in any size of cloud. The Proof-of-Concept would be to run the same manifests (k8s configuration) locally, on a single node somewhere and then scaled up in AWS or the likes.
- Ansible >= 2.15
- (Optional) Hetzner Cloud Account w/ Billig Details
- (Optional) kubectl
$ git clone git@github.com:winfr34k/k3s-mini.git
$ ansible-galaxy install -r requirements.ymlNOTE: This step can be skipped, if you don't run the hcloud_server role!
In this case, provide a machine yourself with a DNS name pointing to it.
- Setup a Project in the Hetzner Cloud Console
- Go to
Security > SSH keysand create one nameddefault. - Set the
SSH keylabeleddefaultto be yourDefaultSSH key. - Go to
Security > API tokensand create one withReadandWritepermissions. - Place the token inside of
files/hcloud.tokenand encrpyt it using Ansible Vault:
$ ansible-vault encrypt files/hcloud.tokenChoose a secure password, you'll need it later.
Replace the hostname, DNS zone and FQDN with values of your chosing.
Please also take a look at all roles/.../defaults/main.yml properties to discovery overridable settings.
$ ansible-playbook -i inventory/hcloud.yml --ask-vault-pass setup.ymlAfter around 5 minutes, you should have your shiny new VM with an operational K3s setup. Try to use it locally:
$ kubectl get pods -ATBD
TBD
Here are a few neat tips/tricks/observations I've made along the way to aid in debugging.
kubectl itself is really only an API client. For it to operate, it essentially formats API requests. Which objetcs are available can be discovered using kubectl api-resources:
kubectl api-resources
#NAME SHORTNAMES APIVERSION NAMESPACED KIND
#bindings v1 true Binding
#componentstatuses cs v1 false ComponentStatus
#configmaps cm v1 true ConfigMap
#...Then, you can use the get and describe verbs on them if they are to be found in a namespace, like:
kubectl -n kube-system get pod # or podsfor a list and
kubectl -n kube-system get pod coredns-697968c856-hngzxfor a single one. Alternatively, pod/coredns-697968c856-hngzx is also valid.
One cannot simply get all objects inside a namespace, though:
kubectl -n kube-system get allThis command exists and "works", but all is not ... all, it's a misnomer. See also: https://www.baeldung.com/ops/kubernetes-list-all-resources
Still Neat.
- K3s comes with an included Helm Chart Controller and CRDs
- Helm Charts can be dropped into
/var/lib/rancher/k3s/server/manifests, the Helm Controller will pick them up after a while and try installing them (see nginx). - All of those manifests become the API type
AddOnin thekube-systemnamespace. - Therefore, install your "bootstrap" Helm Charts into
kube-system. - DO NOT remove files from that folder that aren't yours -- They're also used for core k3s operation!
kubectl -n kube-system get addons
#NAME SOURCE CHECKSUM
#aggregated-metrics-reader /var/lib/rancher/k3s/server/manifests/metrics-server/aggregated-metrics-reader.yaml 63b058ebe88b8179b459777318be49b591fefe629d9258b10974aeed224b5530
#auth-delegator /var/lib/rancher/k3s/server/manifests/metrics-server/auth-delegator.yaml f55fee16219349d1d803b7025f9fde2b6fca741e3900bfedcf64f34dc1c23786
#auth-reader /var/lib/rancher/k3s/server/manifests/metrics-server/auth-reader.yaml 97a74f054fe2972fcc6ffb909224d9cb27f136cc152983c9ac96ff45579015c2
#ccm /var/lib/rancher/k3s/server/manifests/ccm.yaml 15c8482702cd79ec145e960ab92791a0e73b6e7577df7729ae8b523483e3cc93
#coredns /var/lib/rancher/k3s/server/manifests/coredns.yaml 0c491e46aaa795ff5d1e8e2e63057d02d0ea3e94905326fe43cf9a1e59efff73
#local-storage /var/lib/rancher/k3s/server/manifests/local-storage.yaml 95a587e641bed9d12223d9204d73fb251b065b5a73f8dc63f30997bd56663850
#metrics-apiservice /var/lib/rancher/k3s/server/manifests/metrics-server/metrics-apiservice.yaml 03266df7891b56dfffebb855a031b524ed20c6736846417ccd491dd91c6c0ec3
#metrics-server-deployment /var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml ba9c558aad213754bfb7b14491a091d21891aa22b54343697186eb5db48ee2bd
#metrics-server-service /var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-service.yaml 0ba8f8a9133f38c226e33384b8019a882440480cb8b7b45b0cd9f62bebb06a6d
#resource-reader /var/lib/rancher/k3s/server/manifests/metrics-server/resource-reader.yaml d7e21b07d9edf7a1d4e169342f470405970621de59864f998dd84c2048fb9e64
#rolebindings /var/lib/rancher/k3s/server/manifests/rolebindings.yaml 83a63e513426fad207199a04a54392846e87923ecf31e3c0f1ff3e1ba7a9d1f4
#runtimes /var/lib/rancher/k3s/server/manifests/runtimes.yaml 4ea3be58c76250c61e65a257c8279b42d11568f913233eda98ac0d77ef2a8e37If a deployment fails, you'll see the helm-install-<chart-name>-<random-postfix> enter the Error state:
kubectl -n kube-system get pods
#NAMESPACE NAME READY STATUS RESTARTS AGE
#...
#kube-system helm-install-cert-manager-6w8vn 0/1 Error 0 8m
#...You can figure out what happened by reading the logs of that container:
kubectl -n kube-system logs helm-install-cert-manager-6w8vnUsually, it's down to writing the wrong chart values somewhere or a type error in an applied manifest.
Compare this to a fixed/successful deployment:
kubectl -n kube-system logs helm-install-cert-manager-6w8vn
#...
#+ helm repo add cert-manager https://charts.jetstack.io
#"cert-manager" has been added to your repositories
#+ helm repo update
#Hang tight while we grab the latest from your chart repositories...
#...Successfully got an update from the "cert-manager" chart repository
#Update Complete. ⎈Happy Helming!⎈
#+ helm_update install --namespace cert-manager --create-namespace --version v1.17.2 --set crds.enabled=true --set-string ingressShim.defaultIssuerGroup=cert-manager.io --set-string #ingressShim.defaultIssuerKind=ClusterIssuer --set-string ingressShim.defaultIssuerName=letsencrypt-production
#...
#+ exitkubectl run busybox --image busybox --rm -it -- ping google.comThis looks relatively similar to a regular docker run, but it needs a pod name as the first argument. Once the command is completed, the pod is deleted.
busybox is extremely useful here because it has all important networking commands aboard.
Note: If not otherwise specifified, Kubernetes schedules these pods on the default namespace (the only one that always exists besides kube-system).
Important for things like CoreDNS because they don't ship one. This then allows to either run commands in the pods' namespace or inspect the container's filesystem:
kubectl -n kube-system debug -it coredns-697968c856-f9kjz --image=busyboxAfter leaving the shell, the container is quit, but it stays around as an "ephemeral container" and can be restarted if you want to (check kubectl describe on the pod).
To get rid of it, the pod needs to be restarted. In case of CoreDNS, restarting the deployment is the simplest:
kubectl -n kube-system rollout restart deployment/corednsTo gather system logs is relatively simple in k3s land. Just defer to our trusty friend systemd-journald, they've got you covered:
journalctl -[f]lu k3s.serviceNote: This could be k3s-server.service and k3s-agent.service for worker nodes in Multi-Node clusters, dependening on the node's role within the cluster.
When it comes to pod logs, kubectl log is the right thing to use. Don't forget to specify the resource type:
kubectl -n cert-manager logs deployment/cert-manager --tail 10
#I0525 02:50:41.045762 1 conditions.go:285] "Setting lastTransitionTime for CertificateRequest condition" logger="cert-manager" certificateRequest="default/ingress-simple-web-tls-1" condition="Approved" #lastTransitionTime="2025-05-25 02:50:41.045737611 +0000 UTC m=+130.016044477"
#I0525 02:50:41.073937 1 conditions.go:285] "Setting lastTransitionTime for CertificateRequest condition" logger="cert-manager" certificateRequest="default/ingress-simple-web-tls-1" condition="Ready" #lastTransitionTime="2025-05-25 02:50:41.073919108 +0000 UTC m=+130.044225974"
#I0525 02:50:41.087083 1 conditions.go:285] "Setting lastTransitionTime for CertificateRequest condition" logger="cert-manager" certificateRequest="default/ingress-simple-web-tls-1" condition="Ready" #lastTransitionTime="2025-05-25 02:50:41.087063663 +0000 UTC m=+130.057370529"
#I0525 02:50:41.091371 1 controller.go:152] "re-queuing item due to optimistic locking on resource" logger="cert-manager.controller" error="Operation cannot be fulfilled on certificaterequests.cert-manager.io \"#ingress-simple-web-tls-1\": the object has been modified; please apply your changes to the latest version and try again"
#I0525 02:50:43.362918 1 acme.go:236] "certificate issued" logger="cert-manager.controller.sign" resource_name="ingress-simple-web-tls-1" resource_namespace="default" resource_kind="CertificateRequest" #resource_version="v1" related_resource_name="ingress-simple-web-tls-1-3482182461" related_resource_namespace="default" related_resource_kind="Order" related_resource_version="v1"
#I0525 02:50:43.363466 1 conditions.go:269] "Found status change for CertificateRequest condition; setting lastTransitionTime" logger="cert-manager" certificateRequest="default/ingress-simple-web-tls-1" #condition="Ready" oldStatus="False" status="True" lastTransitionTime="2025-05-25 02:50:43.363449866 +0000 UTC m=+132.333756732"
#I0525 02:50:43.390061 1 conditions.go:201] "Found status change for Certificate condition; setting lastTransitionTime" logger="cert-manager" certificate="default/ingress-simple-web-tls" condition="Ready" #oldStatus="False" status="True" lastTransitionTime="2025-05-25 02:50:43.39004782 +0000 UTC m=+132.360354686"
#I0525 02:50:43.401309 1 controller.go:152] "re-queuing item due to optimistic locking on resource" logger="cert-manager.controller" error="Operation cannot be fulfilled on certificates.cert-manager.io \"#ingress-simple-web-tls\": the object has been modified; please apply your changes to the latest version and try again"
#I0525 02:50:43.401904 1 conditions.go:201] "Found status change for Certificate condition; setting lastTransitionTime" logger="cert-manager" certificate="default/ingress-simple-web-tls" condition="Ready" #oldStatus="False" status="True" lastTransitionTime="2025-05-25 02:50:43.401895635 +0000 UTC m=+132.372202501"
#I0525 02:50:43.413244 1 controller.go:152] "re-queuing item due to optimistic locking on resource" logger="cert-manager.controller" error="Operation cannot be fulfilled on certificates.cert-manager.io \"ingress-simple-web-tls\": the object has been modified; please apply your changes to the latest version and try again"-
dnf provides <binary>searches for packages that provide certain bianry files, whilednf search <keyword>fuzzy-searches through names and descriptions. -
dnf list --installedis used to list installed packages, without--installed, it lists all packages in the package cache.