Accurate per-workload power metering for bare metal, VMs, and Kubernetes.
Named after the W. M. Keck Observatory, Keck measures energy consumption at every level — from individual CPU cores to fleet-wide ESG reports — with hardware-grounded accuracy and transparent error bounds.
Existing tools attribute power using a single signal:
process_energy = node_energy × (process_cpu_time / total_cpu_time)
This is inaccurate. Two processes with the same CPU time can consume 3× different power depending on frequency, workload type, and memory behavior. Keck fixes this with bottom-up attribution: per-core energy, weighted by multiple hardware signals, reconciled against PSU ground truth.
| Aspect | Traditional approach | Keck |
|---|---|---|
| Attribution | Node-level CPU time ratio | Per-core, frequency-weighted, multi-signal |
| Data source | /proc/[pid]/stat polling |
eBPF sched_switch tracepoint (in-kernel) |
| Frequency awareness | None | Per-core time-at-frequency tracking |
| Hardware counters | None | Instructions, cycles, LLC misses per core |
| Accuracy validation | None | PSU reconciliation with error bounds |
| Architecture | Passive exporter (scrape) | Active agent (aggregate, push, query) |
| VM support | Limited | Host/guest agent coordination |
| Scale | Per-node only | Node → Cluster → Fleet |
| Language | Go | Rust (no GC, eBPF via Aya) |
Layer 4: keck-fleet Fleet manager (multi-cluster governance)
|
Layer 3: keck-controller Cluster controller (aggregation, scheduler, carbon)
| gRPC
Layer 2: keck-agent Attribution engine, K8s enrichment, store, outputs
| BPF maps
Layer 1: keck-ebpf Kernel programs (sched_switch, cpu_frequency)
| sysfs/MSR
Layer 0: keck-agent Hardware readers (RAPL, hwmon, GPU, Redfish)
|
keck-common Shared types (no_std, eBPF + userspace)
Runs as a DaemonSet. Collects and attributes power on each node.
Layer 0 — Hardware signals:
- RAPL energy counters (per-socket CPU + DRAM)
- hwmon power sensors (direct electrical measurements)
- GPU power (NVIDIA DCGM per-pod measured)
- Platform power via Redfish/IPMI (PSU ground truth)
- Tiered polling: fast (100ms), medium (500ms), slow (3s), heartbeat (5s)
- Reconciliation:
Σ(components) vs PSU_input → error_ratio
Layer 1 — Kernel observation (eBPF):
sched_switchtracepoint: per-PID per-core CPU time (nanosecond precision)cpu_frequencytracepoint: per-core time-at-frequency trackingperf_event_open: per-core hardware counters (instructions, cycles, LLC misses)cgroup_idcapture: pid → container mapping without/procreads
Layer 2 — Attribution engine:
- Splits socket RAPL energy to per-core using frequency-weighted model
- Three attribution models (auto-selected by available data):
- FullModel: time × freq² × (1 + α·IPC + β·cache_miss_rate)
- FrequencyWeighted: time × freq²
- CpuTimeRatio: time only (basic fallback)
- Normalization:
Σ(process_energy) = core_energy(energy conservation guaranteed) - Memory attribution: 60% PSS + 40% LLC miss ratio (with PSS caching)
- Aggregation: process → container → pod → namespace
- Local ring buffer store with drill-down query API
Runs as a single Deployment. Aggregates power data across all nodes.
- Receives pod-level summaries from node agents via HTTP POST
- Aggregates: pod → namespace → cluster
- Carbon intensity integration (Electricity Maps, WattTime, or static)
- Cost calculation: energy × $/kWh (configurable per region)
- K8s custom metrics API (enables HPA scaling on power)
- Power-aware scheduler extender:
- Filter: reject pods that would exceed namespace power budget
- Prioritize: score nodes by power headroom and metering accuracy
- Strategies: BinPack (reduce idle waste) or Spread (avoid hotspots)
Runs standalone. Multi-cluster observability and governance.
- Unified fleet dashboard: power, carbon, cost per cluster/team
- Team views: namespace → team mapping across clusters
- Policy engine: power budgets, carbon budgets, metering quality, staleness alerts
- Carbon-aware routing: recommend lowest-carbon cluster for new workloads
- ESG reporting: daily/monthly reports with energy (kWh), carbon (kgCO2), cost
Keck adapts to deployment size — from edge nodes to large datacenters:
| Aspect | Minimal | Standard | Full |
|---|---|---|---|
| Memory | ~10MB | ~50MB | ~200MB |
| eBPF map size | 1K PIDs | 10K PIDs | 100K PIDs |
| Fast poll | 1s | 500ms | 100ms |
| Report upstream | Pod-level, 60s | Pod-level, 10s | Process-level, 5s |
| Attribution model | CpuTimeRatio | FrequencyWeighted | FullModel |
The agent self-monitors and automatically downgrades its profile if it exceeds its resource budget.
Full detail never leaves the node unless requested. Drill down on demand:
Fleet (45kW total)
└── Cluster "prod-east" (18kW)
└── Namespace "ml-training" (4.2kW)
└── Pod "trainer-7" (380W)
└── Container "train" (340W: GPU 280W, CPU 55W, Mem 5W)
└── PID 4521: python train.py
├── Core 12: 22W (3.4GHz, 1.2B insn, 40K LLC miss)
└── Core 13: 18W (3.1GHz, 0.9B insn, 12K LLC miss)
keck/
├── keck-common/ Shared types (no_std, works in kernel + userspace)
├── keck-ebpf/ eBPF programs (sched_switch, cpu_frequency)
├── keck-agent/ Node agent
│ └── src/
│ ├── hardware/ Layer 0: RAPL, hwmon, GPU, Redfish, tiered collector
│ ├── ebpf/ Layer 1: eBPF loader, map drainer, perf counters
│ ├── attribution/ Layer 2: models, engine, types
│ ├── k8s/ Layer 2: cgroup → container → pod enrichment
│ ├── store/ Layer 2: ring buffer with outbox
│ └── output/ Layer 2: Prometheus, query API
├── keck-controller/ Cluster controller
│ └── src/
│ ├── aggregator/ Cluster-wide state
│ ├── api/ REST API (axum) + bearer token auth
│ ├── carbon/ Carbon intensity tracking
│ └── scheduler/ Power-aware scheduler extender
└── keck-fleet/ Fleet manager
└── src/
├── registry/ Multi-cluster state
├── api/ gRPC + REST
├── policy/ Budget and carbon policy engine
└── reporting/ ESG report generation
Requires:
- Rust nightly (for eBPF target)
- Linux (eBPF programs target the Linux kernel)
# Build agent (includes eBPF compilation)
cargo build -p keck-agent
# Build cluster controller
cargo build -p keck-controller
# Build fleet manager
cargo build -p keck-fleet
# Build operator
cd keck-operator && make buildInstall the Keck operator with one command:
oc apply -f https://raw.githubusercontent.com/avivgt/keck/main/install.yamlThis creates the keck-system namespace, adds the Keck catalog to OLM,
and installs the operator automatically. After ~60 seconds:
- Go to Operators → Installed Operators (namespace:
keck-system) - Click Keck Operator
- Click Create KeckCluster to deploy agents and controller
To remove:
oc delete sub keck-operator -n keck-system
oc delete csv keck-operator.v0.1.0 -n keck-system
oc delete operatorgroup keck-operator-group -n keck-system
oc delete catalogsource keck-operator-catalog -n openshift-marketplace
oc delete ns keck-systemFor detailed step-by-step deployment instructions, see docs/openshift-deployment.md.
The Keck operator follows the Red Hat Operator Lifecycle Manager (OLM) standard. Use the quick install above, or build from source:
# Build and push operator, bundle, and catalog images to quay.io
./scripts/release.sh 0.1.0
# Users then install via: oc apply -f install.yamlAfter the operator is installed, create a KeckCluster resource to
deploy Keck to your cluster:
apiVersion: keck.io/v1alpha1
kind: KeckCluster
metadata:
name: keck
spec:
agent:
defaultProfile: standard
gpuEnabled: false
controller:
replicas: 1
schedulerEnabled: false
carbonRegion: "US-CAL-CISO"
energyCostPerKWh: "0.10"
image:
repository: quay.io/aguetta/keck
tag: latestkubectl apply -f keck-operator/config/samples/keckcluster.yamlThe operator will create:
keck-systemnamespacekeck-agentDaemonSet (one agent per node, privileged)keck-controllerDeployment- ServiceAccount, ClusterRole, ClusterRoleBinding
- Services for controller gRPC and HTTP endpoints
Verify:
kubectl get keckclusters
# NAME AGENTS CONTROLLER PHASE AGE
# keck 12 true Running 2m
kubectl get pods -n keck-system
# NAME READY STATUS RESTARTS AGE
# keck-agent-xxxxx 1/1 Running 0 2m
# keck-agent-yyyyy 1/1 Running 0 2m
# keck-controller-zzzzz-aaaaa 1/1 Running 0 2mFor clusters without OLM (vanilla Kubernetes, k3s, etc.):
cd keck-operator
# Install CRDs
make install
# Deploy operator, RBAC, and manager
make deploy
# Create KeckCluster
kubectl apply -f config/samples/keckcluster.yamlTo remove:
make undeployRun the operator outside the cluster for development:
cd keck-operator
# Install CRDs into your dev cluster
make install
# Run operator locally (uses ~/.kube/config)
make runLimit power consumption per namespace:
apiVersion: keck.io/v1alpha1
kind: PowerBudget
metadata:
name: ml-training-budget
namespace: ml-training
spec:
maxWatts: 10000
action: reject # alert | throttle | rejectkubectl apply -f keck-operator/config/samples/powerbudget.yaml
kubectl get powerbudgets -A
# NAMESPACE NAME BUDGET (W) CURRENT (W) USAGE EXCEEDED
# ml-training ml-training-budget 10000 7234 72% falseUse PowerProfile to override the agent profile on specific nodes:
# Full metering on GPU nodes
apiVersion: keck.io/v1alpha1
kind: PowerProfile
metadata:
name: gpu-nodes-full
spec:
profile: full
nodeSelector:
nvidia.com/gpu.present: "true"
gpuEnabled: true
---
# Minimal overhead on edge nodes
apiVersion: keck.io/v1alpha1
kind: PowerProfile
metadata:
name: edge-minimal
spec:
profile: minimal
nodeSelector:
node-role.kubernetes.io/edge: ""kubectl apply -f keck-operator/config/samples/powerprofile.yaml
kubectl get powerprofiles
# NAME PROFILE NODES AGE
# gpu-nodes-full full 4 1m
# edge-minimal minimal 2 1mFor multi-cluster deployments, run the fleet manager separately and point each cluster's controller to it:
apiVersion: keck.io/v1alpha1
kind: KeckCluster
metadata:
name: keck
spec:
# ... agent and controller config ...
fleetEndpoint: "fleet-manager.example.com:9091"The fleet manager aggregates data from all clusters and provides:
- Unified dashboard at
http://<fleet-manager>:8090 - Per-team power/carbon/cost views
- Carbon-aware routing recommendations
- Policy enforcement across clusters
- ESG reporting
On OpenShift, the Keck UI integrates directly into the console as a Dynamic Console Plugin. After deployment, "Power Management" appears in the left navigation — no separate URL needed.
For non-OpenShift clusters, port-forward to the controller REST API:
kubectl port-forward -n keck-system svc/keck-controller 8080:8080
# API available at http://localhost:8080/api/v1/clusterDeployed and running on OpenShift.
- Kubernetes operator with OLM bundle and finalizer cleanup
- CRDs: KeckCluster, PowerBudget, PowerProfile
- OpenShift console plugin ("Power Management" in left nav)
- GPU power via DCGM (per-pod, measured from hardware)
- Vendor-agnostic Redfish discovery (3-level probing)
- Source priority system (Measured > Estimated, auto-select)
- REST API with bearer token auth and input validation
- Per-process CPU attribution (/proc + eBPF frequency weighting)
- Per-process memory attribution (PSS + LLC misses, cached reads)
- Container images built on OCP (agent, controller, UI)
- 139 unit tests across all components
- Prometheus /metrics endpoint
- Fleet manager deployment
- Carbon tracking connected to external API
- Benchmark: agent overhead measurement
Apache-2.0