Warning: This is a proof-of-concept experiment. Use on a production system at your own risk!
Authors: Ferenc Orosi, Ferenc Fejes
TSN metadata proxy is a thin CNI plugin enable time-sensitive microservices to use the TSN network. Linux provide time-sensitive APIs through the socket API. With socket options or ancillary messages the application can set priority or scheduling metadata. If the application deployed as a microservice e.g.: Kubernetes pod, this metadata removed by the Linux kernel's network namespace isolation mechanism. This CNI plugin ensures the metadata is preserved even after the packets leave the pod.
Main features:
- Does not require modification of the time-sensitive microservice
- Compatible with various Kubernetes distributions (like normal deployment, Minikube or Kind)
- Act as secondary CNI plugin used together with a primary plugin like Calico, Kindnet, Flannel, etc.
- Single
kubectl applyconfiguration
- Kubernetes with Docker container engine
- Linux with eBPF enabled
A minimal example deployment consist:
talker- a time-sensitive microservice pod with application(s) insidebridge- a bridge created by the primary CNI plugin in host (node) network namespacenode- Linux host Kubernetes node with physical TSN network interfacetsn0
┌─────────────────────────────────────────────────────────────────┐
│ node│
│ ┏━━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━━━┓ │
│ ┃talker (pod) ┃ ┃bridge ┃ ┌───────────┤
│ ┃ ┌───────────┨ ┠───────────┐ ┃ │ │
│ ┃ │eth0 (veth)┃ ┃veth_xxxxx │ ┠ ─ ─ ─ ▶│tsn0 (NIC) │
│ ┃ │ ┠───────▶┃(veth) │ ┃ └────●──────┤
│ ┗━━━━┷━━━●━━━━━━━┛ ┗━━━━━━●━━━━┷━━━━┛ ┌──┘ │
│ ┌┘ ┌─┘ TSN compatible │
│ eBPF program (saver) eBPF program network card, │
│ to store TSN metadata (restorer) with TSN Qdisc │
│ to retore TSN │
│ metadata │
└─────────────────────────────────────────────────────────────────┘
In this deployment there is a talker which is a pod.
The application inside talker will generate traffic with priority metadata.
The destination address of the metadata is outside of the node.
Therefore the packet leaves the pod network namespace,
goes straight to the CNI bridge in the node network namespace,
where it will be NAT-ed and routed and sent towards the tsn0 NIC.
On the NIC traffic control Qdisc configured such as taprio or mqprio.
First, build the proxy container image that compile BPF programs
and include all the necessary components of the TSN proxy.
This required for every Kubernetes setup, same image
can be used for node deployment or Minikube, Kind based test setup.
docker build . -t tsn-metadata-proxy:latest -f tsn-metadata-proxy.DockerfileIn the rest of the guide, we only focusing on Kind and Minikube.
These are popular Kubernetes distributions for quick deployment testing.
Compared to the example deployment diagram above,
if Kind or Minikube used, the node itself also a pod.
In normal production deployments kubelet and other core components,
are running withing the node's root network namespace.
Start a Kubernetes host with a custom configuration that mounts the BPF file system:
kind create cluster --name=demo --config cluster.yamlThe content of the cluster.yaml is the following:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
disableDefaultCNI: false
nodes:
- role: control-plane
extraMounts:
- hostPath: /sys/fs/bpf
containerPath: /sys/fs/bpf
- role: worker
extraMounts:
- hostPath: /sys/fs/bpf
containerPath: /sys/fs/bpfThere are two nodes, a worker and a control-plane.
The BPF filesystem /sys/fs/bpf/ exposed for both node in the default path.
This setup will create a cluster using the default
CNI, which is Kindnet. To replace it with anotherCNI, setdisableDefaultCNI: falsetotruein thecluster.yamlfile. This change only disables the defaultCNI. However, for functional networking basic CNI plugins likebridgemust be downloaded separately:docker exec -it demo-control-plane /bin/bash curl -LO https://github.com/containernetworking/plugins/releases/download/v1.1.1/cni-plugins-linux-amd64-v1.1.1.tgz tar -xvzf cni-plugins-linux-amd64-v1.1.1.tgz -C /opt/cni/binOnce ready, you can deploy the
DaemonSetof your desiredCNIfor example Flannel:kubectl apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.ymlSimilar for Calico (only choose one, do not install multiple primary CNI plugins)
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
Copy the custom Docker images to the Kubernetes control plane:
kind load docker-image tsn-metadata-proxy:latest --name demoOn older kind versions this method might not work.
If that is the case, try the following two commands:
TMPDIR=<any folder> kind load docker-image tsn-metadata-proxy:latest --name demoIf TMPDIR=/tmp the metadata proxy CNI will be deleted on node restart.
It is advised to use different folder!
Check if the Docker image is loaded:
docker exec -it demo-control-plane crictl imagesTo check the running pods in a Kubernetes cluster use the following command (in new terminal):
watch -n 0.1 kubectl get pods
# or kubect get pods -wWith having the Kubernetes setup ready deploy the TSN metadata proxy:
kubectl apply -f daemonset.yamlVerify the necessary files are present on the control plane,
the BPF map is created, and the CNI plgins config includes the TSN plugin.
(Note: if you use other CNI, replace 10-kindnet.conflist):
docker exec -it demo-control-plane ls -la /opt/cni/bin/
docker exec -it demo-control-plane ls -la /sys/fs/bpf
docker exec -it demo-control-plane ls -la /sys/fs/bpf/demo-control-plane
docker exec -it demo-control-plane cat /etc/cni/net.d/10-kindnet.conflistNormally on a VM or bare-metal node the testing would require the config of the TSN NIC.
We have an Kind cluster, therefore the nodes are emulated as well.
To continue, configure the taprio qdisc on the tsn0 NIC.
Note: by default tsn0 is likely named as eth0, the tsn0 used here to stick with the example figure
Right now this happened to have on the demo-control-plane node:
docker exec -it demo-control-plane ethtool -L tsn0 tx 4
docker exec -it demo-control-plane tc qdisc replace dev tsn0 parent root handle 100 stab overhead 24 taprio \
num_tc 4 \
map 0 1 2 3 \
queues 1@0 1@1 1@2 1@3 \
base-time 1693996801300000000 \
sched-entry S 03 10000 \
sched-entry S 05 10000 \
sched-entry S 09 20000 \
clockid CLOCK_TAIFor testing purposes, we deploy a pod using the alpine/socat image as the base image
With this Kind based Kubernetes setup, we need to tell where we want to schedule the pod.
For this, modify nodeSelector in talker.yaml to select the control plane:
kubectl apply -f talker.yamlWhen deploying a pod on the control plane in our multi-node setup, we have to remove NoSchedule taint.
This taint set by default on nodes with role control-plane.
To remove this taint, we can use the - (minus) sing with the NoSchedule role:
kubectl taint nodes demo-control-plane node-role.kubernetes.io/control-plane:NoSchedule-Now we have to generate traffic, and send packets with different priorities.
To do that, execute socat from a terminal in the talker node:
kubectl exec -it talker -- socat udp:8.8.8.8:1234,so-priority=1 -Verify that packet metadata were preserved (restored), we can check statistics of the taprio qdisc:
docker exec -it demo-control-plane tc -s -d qdisc show dev tsn0As in the output below, packets with priority 1 are transmitted through the second queue.
In the output queue numbering starts with 1 and priorities starting with 0.
Therefore qdisc pfifo 0: parent 100:2 is the second taprio leaf (priority 1).
qdisc taprio 100: root refcnt 9 tc 4 map 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0
queues offset 0 count 1 offset 1 count 1 offset 2 count 1 offset 3 count 1
clockid TAI base-time 1693996801300000000 cycle-time 40000 cycle-time-extension 0
index 0 cmd S gatemask 0x3 interval 10000
index 1 cmd S gatemask 0x5 interval 10000
index 2 cmd S gatemask 0x9 interval 20000
overhead 24
Sent 10717397 bytes 17292 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:4 limit 1000p
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:3 limit 1000p
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:2 limit 1000p
Sent 2010 bytes 30 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:1 limit 1000p
Sent 10713846 bytes 17239 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0Delete the whole cluster:
kind delete clusters --allIf the BPF programs remain loaded, you can remove them using the following commands:
On a real (not Kind emulated cluster) the rm commands must be executed on every node.
sudo rm /sys/fs/bpf/restorer
sudo rm /sys/fs/bpf/saver
sudo rm /sys/fs/bpf/tracking
sudo rm /sys/fs/bpf/garbage_collector
sudo rm /sys/fs/bpf/timer_map
sudo rm /sys/fs/bpf/tsn_map
sudo rm -r /sys/fs/bpf/demo-control-plane
sudo rm -r /sys/fs/bpf/demo-workerStart a Kubernetes cluster, mount the BPF file system, and set the CNI plugin to Flannel.
With Minikube no cluster manifest required.
As a limitation Minikube can only emulate single node cluster.
minikube start --mount-string="/sys/fs/bpf:/sys/fs/bpf" --mount --cni=flannelAdd the TSN metadata proxy container image to the Kubernetes local repository:
minikube image load tsn-metadata-proxy:latestCheck if the Docker image is loaded:
docker exec -it minikube crictl imagesTo watch the running pods in a Kubernetes cluster, we can open a new terminal and run the following command:
watch -n 0.1 kubectl get pods
# or kubect get pods -wWith having the Kubernetes setup ready deploy the TSN metadata proxy:
kubectl apply -f daemonset.yamlVerify the necessary files are present on the control plane,
the BPF map is created, and the CNI plgins config includes the TSN plugin.
(Note: if you use other CNI, replace 10-flannel.conflist):
docker exec -it minikube ls -la /opt/cni/bin/
docker exec -it minikube ls -la /sys/fs/bpf
docker exec -it minikube ls -la /sys/fs/bpf/minikube
docker exec -it minikube cat /etc/cni/net.d/10-flannel.conflist
Normally on a VM or bare-metal node the testing would require the config of the TSN NIC.
We have a Minikube cluster, therefore the nodes are emulated as well.
To continue, configure the taprio qdisc on the tsn0 NIC.
Note: by default tsn0 is likely named as eth0, the tsn0 used here to stick with the example figure.
Right now this happened to have on the minikube node:
docker exec -it minikube ethtool -L tsn0 tx 4
docker exec -it minikube tc qdisc replace dev tsn0 parent root handle 100 stab overhead 24 taprio \
num_tc 4 \
map 0 1 2 3 \
queues 1@0 1@1 1@2 1@3 \
base-time 1693996801300000000 \
sched-entry S 03 10000 \
sched-entry S 05 10000 \
sched-entry S 09 20000 \
clockid CLOCK_TAIFor testing purposes, we deploy a pod using the alpine/socat image as the base image.
kubectl apply -f talker.yamlNow we have to generate traffic, and send packets with different priorities.
To do that, execute socat from a terminal in the talker node:
kubectl exec -it talker -- socat udp:8.8.8.8:1234,so-priority=1 -Verify that packet metadata were preserved (restored), we can check statistics of the taprio qdisc:
docker exec -it minikube tc -s -d qdisc show dev tsn0As in the output below, packets with priority 1 are transmitted through the second queue.
In the output queue numbering starts with 1 and priorities starting with 0.
Therefore qdisc pfifo 0: parent 100:2 is the second taprio leaf (priority 1).
qdisc taprio 100: root refcnt 9 tc 4 map 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0
queues offset 0 count 1 offset 1 count 1 offset 2 count 1 offset 3 count 1
clockid TAI base-time 1693996801300000000 cycle-time 40000 cycle-time-extension 0
index 0 cmd S gatemask 0x3 interval 10000
index 1 cmd S gatemask 0x5 interval 10000
index 2 cmd S gatemask 0x9 interval 20000
overhead 24
Sent 3366617 bytes 5240 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:4 limit 1000p
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:3 limit 1000p
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:2 limit 1000p
Sent 2709 bytes 40 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:1 limit 1000p
Sent 3363908 bytes 5200 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0Delete the whole cluster (sudo is important!!!):
sudo minikube delete --allIf there is an error, and the cleanup cannot complete, we can remove every remnant of the TSN metadata proxy's with the following commands. This does not uninstall the TSN metadata proxy CNI plugin itself, only the running configs.
sudo rm /sys/fs/bpf/restorer
sudo rm /sys/fs/bpf/saver
sudo rm /sys/fs/bpf/tracking
sudo rm /sys/fs/bpf/garbage_collector
sudo rm /sys/fs/bpf/timer_map
sudo rm /sys/fs/bpf/tsn_map
sudo rm -r /sys/fs/bpf/minikube/This work was supported by the European Union’s Horizon 2020 research and innovation programme through DETERMINISTIC6G project under Grant Agreement no. 101096504