Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions kagenti-operator/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@
bin/*
Dockerfile.cross

# Local dev (make run webhook TLS)
.run/

# Test binary, built with `go test -c`
*.test

Expand Down
23 changes: 21 additions & 2 deletions kagenti-operator/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -112,9 +112,28 @@ lint-config: golangci-lint ## Verify golangci-lint linter configuration
build: manifests generate fmt vet ## Build manager binary.
go build -o bin/manager cmd/main.go

# Self-signed TLS for the validating webhook when running outside the cluster (controller-runtime
# otherwise expects tls.crt/tls.key under $TMPDIR/k8s-webhook-server/serving-certs).
WEBHOOK_CERT_DIR ?= $(CURDIR)/.run/webhook-certs

.PHONY: run
run: manifests generate fmt vet ## Run a controller from your host.
go run ./cmd/main.go
run: manifests generate fmt vet ## Run a controller from your host (needs cluster + CRDs: make install).
@mkdir -p "$(WEBHOOK_CERT_DIR)"
@if [ ! -f "$(WEBHOOK_CERT_DIR)/tls.crt" ] || [ ! -f "$(WEBHOOK_CERT_DIR)/tls.key" ]; then \
command -v openssl >/dev/null 2>&1 || { \
echo "openssl not found: install it, or create tls.crt/tls.key under WEBHOOK_CERT_DIR and re-run"; \
echo " WEBHOOK_CERT_DIR=$(WEBHOOK_CERT_DIR)"; \
exit 1; \
}; \
openssl req -x509 -newkey rsa:2048 -nodes \
-keyout "$(WEBHOOK_CERT_DIR)/tls.key" \
-out "$(WEBHOOK_CERT_DIR)/tls.crt" \
-days 365 \
-subj "/CN=localhost"; \
fi
go run ./cmd/main.go \
--webhook-cert-path="$(WEBHOOK_CERT_DIR)" \
--metrics-secure=false

# If you wish to build the manager image targeting other platforms you can use the --platform flag.
# (i.e. docker build --platform linux/arm64). However, you must enable docker buildKit for it.
Expand Down
18 changes: 18 additions & 0 deletions kagenti-operator/cmd/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ import (
agentv1alpha1 "github.com/kagenti/operator/api/v1alpha1"
"github.com/kagenti/operator/internal/agentcard"
"github.com/kagenti/operator/internal/controller"
"github.com/kagenti/operator/internal/keycloak"
"github.com/kagenti/operator/internal/signature"
"github.com/kagenti/operator/internal/tekton"
webhookv1alpha1 "github.com/kagenti/operator/internal/webhook/v1alpha1"
Expand Down Expand Up @@ -76,6 +77,7 @@ func main() {
var requireA2ASignature bool
var signatureAuditMode bool
var enforceNetworkPolicies bool
var enableOperatorClientRegistration bool

var spireTrustDomain string
var spireTrustBundleConfigMapName string
Expand Down Expand Up @@ -107,6 +109,8 @@ func main() {
"When true, log signature verification failures but don't block (use for rollout)")
flag.BoolVar(&enforceNetworkPolicies, "enforce-network-policies", false,
"Create NetworkPolicies to restrict traffic for agents with unverified signatures")
flag.BoolVar(&enableOperatorClientRegistration, "enable-operator-client-registration", false,
"Reconcile Keycloak client registration for workloads with kagenti.io/client-registration-inject=false")

flag.StringVar(&spireTrustDomain, "spire-trust-domain", "",
"SPIRE trust domain for identity binding (e.g. 'example.org')")
Expand Down Expand Up @@ -325,6 +329,20 @@ func main() {
os.Exit(1)
}

if enableOperatorClientRegistration {
if err = (&controller.ClientRegistrationReconciler{
Client: mgr.GetClient(),
APIReader: mgr.GetAPIReader(),
Scheme: mgr.GetScheme(),
SpireTrustDomain: spireTrustDomain,
KeycloakAdminTokenCache: &keycloak.CachedAdminTokenProvider{},
}).SetupWithManager(mgr); err != nil {
setupLog.Error(err, "unable to create controller", "controller", "ClientRegistration")
os.Exit(1)
}
setupLog.Info("Operator-managed client registration controller enabled")
}

if controller.TektonConfigCRDExists(mgr.GetConfig()) {
if err = (&controller.TektonConfigReconciler{
Client: mgr.GetClient(),
Expand Down
11 changes: 11 additions & 0 deletions kagenti-operator/config/rbac/role.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,17 @@ rules:
- patch
- update
- watch
- apiGroups:
- ""
resources:
- secrets
verbs:
- create
- get
- list
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This grants cluster-wide secrets read access (get/list/watch). The controller only needs secrets in agent namespaces. Consider whether a more restrictive RBAC binding is warranted (ClusterRole + per-namespace RoleBindings), or document why cluster-wide is needed.

- patch
- update
- watch
- apiGroups:
- agent.kagenti.dev
resources:
Expand Down
171 changes: 171 additions & 0 deletions kagenti-operator/docs/operator-managed-client-registration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Operator-managed Keycloak client registration

This document describes the split responsibility between **kagenti-operator** and **kagenti-webhook** for registering agent workloads as OAuth clients in Keycloak and delivering credentials to AuthBridge sidecars.

The implementation lives in two repositories today (`kagenti-operator`, `kagenti-extensions` / `kagenti-webhook`); the same feature branch name is used in both until the code is consolidated into a single repo.

---

## 1. Why this change

### 1.1 Problem

By default, the mutating webhook injects a **`kagenti-client-registration`** sidecar (or embeds equivalent behavior inside the combined **authbridge** container). That sidecar:

- Runs **inside every pod**, uses workload identity (SPIFFE when SPIRE is enabled), and talks to Keycloak to register or refresh the OAuth client.
- Competes for startup ordering and resources with the application and other sidecars.

Some deployments want **Envoy / SPIFFE / AuthBridge** injection to stay **pod-local**, but prefer **client lifecycle and secrets** to be handled **centrally** by the platform: one registration path, predictable secret names, and no client-registration container in the pod.

### 1.2 Approach

Workloads **opt out** of webhook-injected client registration with a well-known label. The **operator** then:

1. Registers the workload as a Keycloak client using the **Keycloak admin API** (same conceptual contract as the sidecar).
2. Creates a **Secret** in the workload namespace with `client-id.txt` and `client-secret.txt`.
3. Annotates the pod template so the **webhook** can mount that Secret into every container that already uses the **`shared-data`** volume, at the **same paths** the sidecar used (`/shared/client-id.txt`, `/shared/client-secret.txt`).

The webhook continues to inject **proxy-init**, **envoy** / **authbridge**, and **spiffe-helper** according to existing precedence and feature gates; it only skips the **client-registration** sidecar (or the registration portion of combined authbridge) when the label opts out.

### 1.3 Benefits

- **Fewer containers** when the sidecar path is not desired.
- **Centralized registration** using namespace `keycloak-admin-secret` (already provisioned for the sidecar contract).
- **Deterministic secret naming** derived from namespace and workload name (`kagenti-keycloak-client-credentials-<hash>`), with **owner references** to the Deployment or StatefulSet.
- **Safe ordering**: the operator creates the Secret **before** setting the pod-template annotation, so new Pods do not reference a missing Secret.
- **Admission reinvocation**: the webhook uses `reinvocationPolicy: IfNeeded` so a second pass can add Secret volume mounts if the operator annotates the template **after** the first injection.

---

## 2. How it works

### 2.1 Contract (labels and annotations)

| Key | Value | Meaning |
|-----|--------|---------|
| `kagenti.io/client-registration-inject` | `"false"` | Workload opts **out** of webhook-injected client registration; operator is expected to manage registration **if** other conditions hold. |
| `kagenti.io/keycloak-client-credentials-secret-name` | Secret name | Set by the operator on the **pod template**; webhook reads it from **Pod** annotations at admission time and mounts the Secret. |

The string values for the label key and the annotation key are **duplicated** in both repos and must stay in sync:

- Operator: `LabelClientRegistrationInject`, `AnnotationKeycloakClientSecretName` in `clientregistration_controller.go`.
- Webhook: `LabelClientRegistrationInject` in `constants.go`, `AnnotationKeycloakClientSecretName` in `keycloak_client_credentials.go`.

### 2.2 Which workloads the operator reconciles

The **ClientRegistration** controller watches **Deployments** and **StatefulSets** whose pod template labels satisfy:

- `kagenti.io/client-registration-inject` is **exactly** `"false"`.
- `kagenti.io/type` is **`agent`**, or **`tool`** when the cluster feature gate **`injectTools`** is true (tools are skipped if `injectTools` is false).

Other workloads are ignored by this controller.

### 2.3 Webhook behavior

1. **Precedence** (unchanged): `kagenti.io/client-registration-inject=false` disables injection of the client-registration sidecar / registration slice in combined authbridge (`precedence.go`).
2. **After** sidecars and volumes are applied, **`ApplyKeycloakClientCredentialsSecretVolumes`** runs for **every** mutation:
- If the pod (template) annotation `kagenti.io/keycloak-client-credentials-secret-name` is set, the webhook adds a **Secret volume** named like the Secret (`kagenti-keycloak-client-credentials-<uniq-id>`) and **subPath mounts** for `client-id.txt` and `client-secret.txt` into **each container that already has a `shared-data` volume mount**.
3. **Reinvocation**: if the pod is already considered “injected” (e.g. envoy or proxy-init present) but operator mounts are still missing, **`NeedsKeycloakClientCredentialsVolumePatch`** returns true and the webhook applies **only** the operator Secret mounts (`authbridge_webhook.go`).

### 2.4 Operator reconcile flow (simplified)

1. Read **cluster feature gates** (`kagenti-webhook` ConfigMap in the cluster defaults namespace). If `globalEnabled` or `clientRegistration` is false, skip.
2. Read **`authbridge-config`** in the workload namespace (`KEYCLOAK_URL`, `KEYCLOAK_REALM`, `SPIRE_ENABLED`, etc.).
3. Read **`keycloak-admin-secret`** (admin username/password).
4. Compute **Keycloak client ID**:
- If `SPIRE_ENABLED` is not true: `namespace/workloadName`.
- If SPIRE is enabled: `spiffe://<trust-domain>/ns/<namespace>/sa/<serviceAccount>` (requires a **non-default** `serviceAccountName` and operator **`--spire-trust-domain`**).
5. **Register or fetch** the client via Keycloak admin API (`internal/keycloak`).
6. **Create or update** the credentials Secret; set **owner** to the Deployment/StatefulSet.
7. **Patch** the pod template annotation `kagenti.io/keycloak-client-credentials-secret-name` to the deterministic secret name.

### 2.5 Feature flags

| Component | Flag / gate | Role |
|-----------|-------------|------|
| Operator | `--enable-operator-client-registration` (default **true**) | Master switch for the ClientRegistration controller. |
| Operator | `--spire-trust-domain` | Required for SPIFFE-shaped client IDs when `authbridge-config` has `SPIRE_ENABLED=true`. |
| Webhook | `--enable-client-registration` | Cluster-wide gate for client-registration **injection** (precedence still applies). |
| Webhook | Feature gates ConfigMap | `clientRegistration`, `injectTools`, `globalEnabled`, etc., same as for injected sidecars. |

---

## 3. Requirements

### 3.1 Platform / namespace

- **`authbridge-config`** ConfigMap in the workload namespace with at least `KEYCLOAK_URL`, `KEYCLOAK_REALM`, and consistent `SPIRE_ENABLED` with the mesh.
- **`keycloak-admin-secret`** in the same namespace with `KEYCLOAK_ADMIN_USERNAME` and `KEYCLOAK_ADMIN_PASSWORD`.
- **Webhook** and **operator** versions that both implement this contract (deploy together).

### 3.2 Workload

- **Deployment** or **StatefulSet** (not bare Pods for operator ownership of Secrets).
- Pod template labels: `kagenti.io/client-registration-inject: "false"` and `kagenti.io/type: agent` or `tool` (subject to `injectTools`).
- For **SPIRE-enabled** namespaces: `spec.template.spec.serviceAccountName` must be a **dedicated** ServiceAccount (not `default`).

### 3.3 Operator configuration

- When `authbridge-config` sets `SPIRE_ENABLED=true`, configure **`--spire-trust-domain`** to match the SPIRE server trust domain (same value as used for workload SPIFFE IDs).
- Ensure the operator can read **`authbridge-config`** and **`keycloak-admin-secret`** in agent namespaces (RBAC is extended for ConfigMaps and Secrets as needed).

### 3.4 Webhook configuration

- **`reinvocationPolicy: IfNeeded`** on the mutating webhook so late annotations still get mounts.
- Pod template must eventually carry **`kagenti.io/keycloak-client-credentials-secret-name`** once the operator has reconciled; until then, auth consumers on `shared-data` may not see credentials (operator retries with backoff).

---

## 4. Migration strategy

### 4.1 Recommended rollout order

1. **Upgrade operator** (with ClientRegistration controller and Keycloak client package).
2. **Upgrade webhook** (operator Secret mounts + reinvocation path).
3. **Configure** `--spire-trust-domain` on the operator if agent namespaces use SPIRE (`SPIRE_ENABLED=true`).

Rolling webhook before operator can leave workloads with `client-registration-inject=false` **without** registration until the operator is available; rolling operator before webhook can create Secrets and annotations **without** mounts until the new webhook is live. Short overlap is acceptable if you migrate workloads **after** both are deployed.

### 4.2 Adopting operator-managed registration per workload

1. Ensure the namespace has `authbridge-config` and `keycloak-admin-secret`.
2. On the workload pod template, set **`kagenti.io/client-registration-inject: "false"`**.
3. If SPIRE is on, set a **dedicated** `serviceAccountName`.
4. **Restart** or roll the workload so the webhook sees the new template and the operator reconciles.

The operator will create or reuse the Keycloak client and Secret; the webhook will inject mounts on create or on reinvocation.

### 4.3 Rollback

- Remove **`kagenti.io/client-registration-inject: "false"`** (or set client-registration injection back to the default path) and **remove** the operator annotation if present.
- Roll pods so the **client-registration sidecar** (or combined authbridge with registration) runs again.
- Optionally delete operator-created Secrets named `kagenti-keycloak-client-credentials-*` after confirming Keycloak clients are recreated by the sidecar path if needed.

Disabling **`--enable-operator-client-registration`** stops new reconciliation but does not remove existing annotations or Secrets; clean those up if you need a full rollback.

### 4.4 Keycloak client identity

Switching from **sidecar** to **operator** registration may change the **client ID** string (e.g. from SPIFFE-based to `namespace/name` when SPIRE is off, or same SPIFFE shape when SPIRE is on). Plan for **one-time** Keycloak client cleanup or renamed clients if both paths ran for the same logical workload.

### 4.5 Future consolidation

When webhook and operator live in one repository, keep this document as the single **source of truth** for the contract; co-locate constants in one package to avoid drift between annotation/label keys.

---

## 5. Related code

| Area | Location |
|------|-----------|
| Operator reconciler | `internal/controller/clientregistration_controller.go` |
| Keycloak admin client | `internal/keycloak/` |
| Operator entrypoint / flags | `cmd/main.go` |
| Webhook mounts + reinvocation | `internal/webhook/injector/keycloak_client_credentials.go`, `pod_mutator.go`, `internal/webhook/v1alpha1/authbridge_webhook.go` |
| Injection precedence | `internal/webhook/injector/precedence.go` |

---

## 6. Operational notes

- If logs show **`cannot resolve Keycloak client id yet`** with reason **`--spire-trust-domain is required`**, configure the operator trust domain to match SPIRE (see platform docs / `kagenti-deps` `spire.trustDomain` on Kind installs).
- Operator reads **`authbridge-config`** via an **uncached API reader** because ConfigMaps may be excluded from the controller-runtime cache for scalability; this matches how the webhook resolves namespace config.
2 changes: 1 addition & 1 deletion kagenti-operator/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ require (
k8s.io/client-go v0.32.0
k8s.io/utils v0.0.0-20241104100929-3ea5e8cea738
sigs.k8s.io/controller-runtime v0.20.0
sigs.k8s.io/yaml v1.4.0
)

require (
Expand Down Expand Up @@ -103,5 +104,4 @@ require (
sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.31.0 // indirect
sigs.k8s.io/json v0.0.0-20241010143419-9aa6b5e7a4b3 // indirect
sigs.k8s.io/structured-merge-diff/v4 v4.4.2 // indirect
sigs.k8s.io/yaml v1.4.0 // indirect
)
Loading