Keck

Accurate per-workload power metering for bare metal, VMs, and Kubernetes.

Named after the W. M. Keck Observatory, Keck measures energy consumption at every level — from individual CPU cores to fleet-wide ESG reports — with hardware-grounded accuracy and transparent error bounds.

Why Keck?

Existing tools attribute power using a single signal:

process_energy = node_energy × (process_cpu_time / total_cpu_time)

This is inaccurate. Two processes with the same CPU time can consume 3× different power depending on frequency, workload type, and memory behavior. Keck fixes this with bottom-up attribution: per-core energy, weighted by multiple hardware signals, reconciled against PSU ground truth.

What makes Keck different

Aspect	Traditional approach	Keck
Attribution	Node-level CPU time ratio	Per-core, frequency-weighted, multi-signal
Data source	`/proc/[pid]/stat` polling	eBPF `sched_switch` tracepoint (in-kernel)
Frequency awareness	None	Per-core time-at-frequency tracking
Hardware counters	None	Instructions, cycles, LLC misses per core
Accuracy validation	None	PSU reconciliation with error bounds
Architecture	Passive exporter (scrape)	Active agent (aggregate, push, query)
VM support	Limited	Host/guest agent coordination
Scale	Per-node only	Node → Cluster → Fleet
Language	Go	Rust (no GC, eBPF via Aya)

Architecture

Layer 4:  keck-fleet         Fleet manager (multi-cluster governance)
              |
Layer 3:  keck-controller    Cluster controller (aggregation, scheduler, carbon)
              | gRPC
Layer 2:  keck-agent         Attribution engine, K8s enrichment, store, outputs
              | BPF maps
Layer 1:  keck-ebpf          Kernel programs (sched_switch, cpu_frequency)
              | sysfs/MSR
Layer 0:  keck-agent          Hardware readers (RAPL, hwmon, GPU, Redfish)
              |
          keck-common         Shared types (no_std, eBPF + userspace)

Node Agent (keck-agent)

Runs as a DaemonSet. Collects and attributes power on each node.

Layer 0 — Hardware signals:

RAPL energy counters (per-socket CPU + DRAM)
hwmon power sensors (direct electrical measurements)
GPU power (NVIDIA DCGM per-pod measured)
Platform power via Redfish/IPMI (PSU ground truth)
Tiered polling: fast (100ms), medium (500ms), slow (3s), heartbeat (5s)
Reconciliation: Σ(components) vs PSU_input → error_ratio

Layer 1 — Kernel observation (eBPF):

sched_switch tracepoint: per-PID per-core CPU time (nanosecond precision)
cpu_frequency tracepoint: per-core time-at-frequency tracking
perf_event_open: per-core hardware counters (instructions, cycles, LLC misses)
cgroup_id capture: pid → container mapping without /proc reads

Layer 2 — Attribution engine:

Splits socket RAPL energy to per-core using frequency-weighted model
Three attribution models (auto-selected by available data):
- FullModel: time × freq² × (1 + α·IPC + β·cache_miss_rate)
- FrequencyWeighted: time × freq²
- CpuTimeRatio: time only (basic fallback)
Normalization: Σ(process_energy) = core_energy (energy conservation guaranteed)
Memory attribution: 60% PSS + 40% LLC miss ratio (with PSS caching)
Aggregation: process → container → pod → namespace
Local ring buffer store with drill-down query API

Cluster Controller (keck-controller)

Runs as a single Deployment. Aggregates power data across all nodes.

Receives pod-level summaries from node agents via HTTP POST
Aggregates: pod → namespace → cluster
Carbon intensity integration (Electricity Maps, WattTime, or static)
Cost calculation: energy × $/kWh (configurable per region)
K8s custom metrics API (enables HPA scaling on power)
Power-aware scheduler extender:
- Filter: reject pods that would exceed namespace power budget
- Prioritize: score nodes by power headroom and metering accuracy
- Strategies: BinPack (reduce idle waste) or Spread (avoid hotspots)

Fleet Manager (keck-fleet)

Runs standalone. Multi-cluster observability and governance.

Unified fleet dashboard: power, carbon, cost per cluster/team
Team views: namespace → team mapping across clusters
Policy engine: power budgets, carbon budgets, metering quality, staleness alerts
Carbon-aware routing: recommend lowest-carbon cluster for new workloads
ESG reporting: daily/monthly reports with energy (kWh), carbon (kgCO2), cost

Agent Profiles

Keck adapts to deployment size — from edge nodes to large datacenters:

Aspect	Minimal	Standard	Full
Memory	~10MB	~50MB	~200MB
eBPF map size	1K PIDs	10K PIDs	100K PIDs
Fast poll	1s	500ms	100ms
Report upstream	Pod-level, 60s	Pod-level, 10s	Process-level, 5s
Attribution model	CpuTimeRatio	FrequencyWeighted	FullModel

The agent self-monitors and automatically downgrades its profile if it exceeds its resource budget.

Zoom Model

Full detail never leaves the node unless requested. Drill down on demand:

Fleet (45kW total)
  └── Cluster "prod-east" (18kW)
        └── Namespace "ml-training" (4.2kW)
              └── Pod "trainer-7" (380W)
                    └── Container "train" (340W: GPU 280W, CPU 55W, Mem 5W)
                          └── PID 4521: python train.py
                                ├── Core 12: 22W (3.4GHz, 1.2B insn, 40K LLC miss)
                                └── Core 13: 18W (3.1GHz, 0.9B insn, 12K LLC miss)

Project Structure

keck/
├── keck-common/       Shared types (no_std, works in kernel + userspace)
├── keck-ebpf/         eBPF programs (sched_switch, cpu_frequency)
├── keck-agent/        Node agent
│   └── src/
│       ├── hardware/    Layer 0: RAPL, hwmon, GPU, Redfish, tiered collector
│       ├── ebpf/        Layer 1: eBPF loader, map drainer, perf counters
│       ├── attribution/ Layer 2: models, engine, types
│       ├── k8s/         Layer 2: cgroup → container → pod enrichment
│       ├── store/       Layer 2: ring buffer with outbox
│       └── output/      Layer 2: Prometheus, query API
├── keck-controller/   Cluster controller
│   └── src/
│       ├── aggregator/  Cluster-wide state
│       ├── api/         REST API (axum) + bearer token auth
│       ├── carbon/      Carbon intensity tracking
│       └── scheduler/   Power-aware scheduler extender
└── keck-fleet/        Fleet manager
    └── src/
        ├── registry/    Multi-cluster state
        ├── api/         gRPC + REST
        ├── policy/      Budget and carbon policy engine
        └── reporting/   ESG report generation

Building

Requires:

Rust nightly (for eBPF target)
Linux (eBPF programs target the Linux kernel)

# Build agent (includes eBPF compilation)
cargo build -p keck-agent

# Build cluster controller
cargo build -p keck-controller

# Build fleet manager
cargo build -p keck-fleet

# Build operator
cd keck-operator && make build

Quick Install on OpenShift

Install the Keck operator with one command:

oc apply -f https://raw.githubusercontent.com/avivgt/keck/main/install.yaml

This creates the keck-system namespace, adds the Keck catalog to OLM, and installs the operator automatically. After ~60 seconds:

Go to Operators → Installed Operators (namespace: keck-system)
Click Keck Operator
Click Create KeckCluster to deploy agents and controller

To remove:

oc delete sub keck-operator -n keck-system
oc delete csv keck-operator.v0.1.0 -n keck-system
oc delete operatorgroup keck-operator-group -n keck-system
oc delete catalogsource keck-operator-catalog -n openshift-marketplace
oc delete ns keck-system

Deployment

For detailed step-by-step deployment instructions, see docs/openshift-deployment.md.

Option 1: OpenShift / OLM (Recommended for Production)

The Keck operator follows the Red Hat Operator Lifecycle Manager (OLM) standard. Use the quick install above, or build from source:

# Build and push operator, bundle, and catalog images to quay.io
./scripts/release.sh 0.1.0

# Users then install via: oc apply -f install.yaml

After the operator is installed, create a KeckCluster resource to deploy Keck to your cluster:

apiVersion: keck.io/v1alpha1
kind: KeckCluster
metadata:
  name: keck
spec:
  agent:
    defaultProfile: standard
    gpuEnabled: false
  controller:
    replicas: 1
    schedulerEnabled: false
    carbonRegion: "US-CAL-CISO"
    energyCostPerKWh: "0.10"
  image:
    repository: quay.io/aguetta/keck
    tag: latest

kubectl apply -f keck-operator/config/samples/keckcluster.yaml

The operator will create:

keck-system namespace
keck-agent DaemonSet (one agent per node, privileged)
keck-controller Deployment
ServiceAccount, ClusterRole, ClusterRoleBinding
Services for controller gRPC and HTTP endpoints

Verify:

kubectl get keckclusters
# NAME   AGENTS   CONTROLLER   PHASE     AGE
# keck   12       true         Running   2m

kubectl get pods -n keck-system
# NAME                               READY   STATUS    RESTARTS   AGE
# keck-agent-xxxxx                   1/1     Running   0          2m
# keck-agent-yyyyy                   1/1     Running   0          2m
# keck-controller-zzzzz-aaaaa        1/1     Running   0          2m

Option 2: Direct Deployment (Without OLM)

For clusters without OLM (vanilla Kubernetes, k3s, etc.):

cd keck-operator

# Install CRDs
make install

# Deploy operator, RBAC, and manager
make deploy

# Create KeckCluster
kubectl apply -f config/samples/keckcluster.yaml

To remove:

make undeploy

Option 3: Local Development

Run the operator outside the cluster for development:

cd keck-operator

# Install CRDs into your dev cluster
make install

# Run operator locally (uses ~/.kube/config)
make run

Setting Power Budgets

Limit power consumption per namespace:

apiVersion: keck.io/v1alpha1
kind: PowerBudget
metadata:
  name: ml-training-budget
  namespace: ml-training
spec:
  maxWatts: 10000
  action: reject   # alert | throttle | reject

kubectl apply -f keck-operator/config/samples/powerbudget.yaml

kubectl get powerbudgets -A
# NAMESPACE      NAME                 BUDGET (W)   CURRENT (W)   USAGE   EXCEEDED
# ml-training    ml-training-budget   10000        7234          72%     false

Customizing Agent Profiles Per Node

Use PowerProfile to override the agent profile on specific nodes:

# Full metering on GPU nodes
apiVersion: keck.io/v1alpha1
kind: PowerProfile
metadata:
  name: gpu-nodes-full
spec:
  profile: full
  nodeSelector:
    nvidia.com/gpu.present: "true"
  gpuEnabled: true
---
# Minimal overhead on edge nodes
apiVersion: keck.io/v1alpha1
kind: PowerProfile
metadata:
  name: edge-minimal
spec:
  profile: minimal
  nodeSelector:
    node-role.kubernetes.io/edge: ""

kubectl apply -f keck-operator/config/samples/powerprofile.yaml

kubectl get powerprofiles
# NAME              PROFILE   NODES   AGE
# gpu-nodes-full    full      4       1m
# edge-minimal      minimal   2       1m

Multi-Cluster Setup (Fleet Manager)

For multi-cluster deployments, run the fleet manager separately and point each cluster's controller to it:

apiVersion: keck.io/v1alpha1
kind: KeckCluster
metadata:
  name: keck
spec:
  # ... agent and controller config ...
  fleetEndpoint: "fleet-manager.example.com:9091"

The fleet manager aggregates data from all clusters and provides:

Unified dashboard at http://<fleet-manager>:8090
Per-team power/carbon/cost views
Carbon-aware routing recommendations
Policy enforcement across clusters
ESG reporting

Accessing the Dashboard

On OpenShift, the Keck UI integrates directly into the console as a Dynamic Console Plugin. After deployment, "Power Management" appears in the left navigation — no separate URL needed.

For non-OpenShift clusters, port-forward to the controller REST API:

kubectl port-forward -n keck-system svc/keck-controller 8080:8080
# API available at http://localhost:8080/api/v1/cluster

Status

Deployed and running on OpenShift.

License

Apache-2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Keck

Why Keck?

What makes Keck different

Architecture

Node Agent (keck-agent)

Cluster Controller (keck-controller)

Fleet Manager (keck-fleet)

Agent Profiles

Zoom Model

Project Structure

Building

Quick Install on OpenShift

Deployment

Option 1: OpenShift / OLM (Recommended for Production)

Option 2: Direct Deployment (Without OLM)

Option 3: Local Development

Setting Power Budgets

Customizing Agent Profiles Per Node

Multi-Cluster Setup (Fleet Manager)

Accessing the Dashboard

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
docs		docs
keck-agent		keck-agent
keck-common		keck-common
keck-controller		keck-controller
keck-ebpf		keck-ebpf
keck-fleet		keck-fleet
keck-operator		keck-operator
keck-ui		keck-ui
scripts		scripts
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
install.yaml		install.yaml
rust-toolchain.toml		rust-toolchain.toml

Folders and files

Latest commit

History

Repository files navigation

Keck

Why Keck?

What makes Keck different

Architecture

Node Agent (keck-agent)

Cluster Controller (keck-controller)

Fleet Manager (keck-fleet)

Agent Profiles

Zoom Model

Project Structure

Building

Quick Install on OpenShift

Deployment

Option 1: OpenShift / OLM (Recommended for Production)

Option 2: Direct Deployment (Without OLM)

Option 3: Local Development

Setting Power Budgets

Customizing Agent Profiles Per Node

Multi-Cluster Setup (Fleet Manager)

Accessing the Dashboard

Status

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages