Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ jobs:
sudo mv ./kind /usr/local/bin/kind
- name: Create Kind cluster
run: kind create cluster
- name: Set up Helm
uses: azure/setup-helm@b9e51907a09c216f16ebe8536097933489208112 # v4.3.0
- name: Run e2e tests
run: |
go mod tidy
Expand Down
219 changes: 219 additions & 0 deletions kagenti-operator/test/e2e/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
# E2E Tests

End-to-end tests for the kagenti-operator. The suite runs 8 specs:

- **Manager tests** (2 specs) — controller pod readiness and Prometheus metrics
- **AgentCard tests** (6 specs) — webhook validation, auto-discovery, duplicate prevention, audit mode, and SPIRE signature verification

## Prerequisites

- [Kind](https://kind.sigs.k8s.io/) — `go install sigs.k8s.io/kind@latest`
- [Helm](https://helm.sh/) — `brew install helm`
- [kubectl](https://kubernetes.io/docs/tasks/tools/) — `brew install kubectl`
- Container runtime: **Docker** or **Podman**

The test suite auto-detects Docker vs Podman. No env vars needed.

## Run

```bash
# Create a fresh Kind cluster
kind delete cluster 2>/dev/null; kind create cluster

# Run all 8 specs (~7 min)
make test-e2e
```

The suite automatically builds/loads images, installs Prometheus, CertManager, SPIRE, deploys the controller, runs tests, and tears everything down.

## Skip pre-installed components

If Prometheus, CertManager, or SPIRE are already on your cluster:

```bash
PROMETHEUS_INSTALL_SKIP=true \
CERT_MANAGER_INSTALL_SKIP=true \
SPIRE_INSTALL_SKIP=true \
make test-e2e
```

## Run specific scenarios

```bash
# Webhook + auto-discovery tests only (~4 min)
go test ./test/e2e/ -v -ginkgo.v -ginkgo.focus="should reject AgentCard|should not create|should auto-create|should reject duplicate"

# Signature verification tests only (~5 min)
go test ./test/e2e/ -v -ginkgo.v -ginkgo.focus="SignatureInvalidAudit|should verify signed"

# Manager tests only
go test ./test/e2e/ -v -ginkgo.v -ginkgo.focus="Manager"
```

## Cleanup

```bash
kind delete cluster
```

## Test scenarios

| Scenario | Context | What it tests |
|----------|---------|---------------|
| Reject missing targetRef | Without signature | Webhook rejects AgentCard with no `spec.targetRef` |
| No protocol label | Without signature | Workload with `kagenti.io/type=agent` but no `protocol.kagenti.io/*` label gets no auto-created card |
| Auto-discovery | Without signature | Properly labeled workload gets an auto-created AgentCard with correct targetRef, protocol, and Synced=True |
| Duplicate prevention | Without signature | Webhook rejects a second AgentCard targeting the same workload |
| Audit mode | With signature | Unsigned card syncs (Synced=True) but reports SignatureVerified=False with reason SignatureInvalidAudit |
| Signed agent | With signature | SPIRE-signed card gets SignatureVerified=True, correct SPIFFE ID, Synced=True, and Bound=True |

## Architecture

### What gets installed

The test suite sets up the following infrastructure in a Kind cluster:

```
BeforeSuite (once per suite)
├── Build & load operator image into Kind
├── Install Prometheus Operator v0.77.1 (metrics/ServiceMonitor CRDs)
├── Install CertManager v1.16.3 (webhook TLS certificates)
├── Build & load agentcard-signer image into Kind
└── Install SPIRE via Helm (spire-crds v0.5.0 + spire v0.28.3)

BeforeAll (per Describe block)
├── make install → applies AgentCard CRD via kustomize
├── make deploy → creates namespace, RBAC, Deployment, webhook, ServiceMonitor
├── Wait for controller pod Running + webhook endpoint ready
└── Create test namespace e2e-agentcard-test (labeled agentcard=true + PSA restricted)
```

### How the operator is installed

```
make docker-build make install make deploy
│ │ │
▼ ▼ ▼
Build image from kustomize build config/crd kustomize edit set image
Dockerfile │ │
│ kubectl apply --server-side kustomize build config/default
▼ │ │
kind load docker-image ▼ kubectl apply --server-side
(podman fallback) AgentCard CRD created │
kagenti-operator-system:
├── ServiceAccount
├── ClusterRole + Binding
├── Certificate + Issuer (cert-manager)
├── Webhook Service (port 443)
├── Metrics Service (port 8443)
├── Deployment (controller pod)
├── ValidatingWebhookConfiguration
└── ServiceMonitor (Prometheus)
```

### Component interactions

```
┌─ cert-manager ───────────────────────────────────────────────────┐
│ Issues TLS cert for operator webhook │
│ Injects CA into ValidatingWebhookConfiguration │
└───────────────────────────┬──────────────────────────────────────┘
│ TLS cert
┌─ kagenti-operator-system ────────────────────────────────────────┐
│ Controller Manager Pod │
│ ├── Webhook server (validates AgentCard create/update) │
│ ├── Metrics server (HTTPS, scraped by Prometheus) │
│ ├── AgentCardSync controller │
│ │ watches Deployments → auto-creates AgentCards │
│ └── AgentCard controller │
│ fetches card metadata, verifies signatures, evaluates binding│
└───────────────────────────┬──────────────────────────────────────┘
│ fetches /.well-known/agent-card.json
┌─ e2e-agentcard-test ─────────────────────────────────────────────┐
│ Agent Deployments (echo-agent, audit-agent, signed-agent) │
│ Services (expose agents for card fetching) │
│ AgentCard CRs (auto-created or manually applied) │
└───────────────────────────▲──────────────────────────────────────┘
│ SPIRE CSI volume provides SVIDs
┌─ spire-system ───────────┴──────────────────────────────────────┐
│ SPIRE Server → issues SVIDs via ClusterSPIFFEID policies │
│ SPIRE Agent (DaemonSet) → distributes SVIDs via CSI driver │
│ spire-bundle ConfigMap → CA certs for signature verification │
└──────────────────────────────────────────────────────────────────┘
```

### Test scenario details

#### Reject missing targetRef

Applies an AgentCard with no `spec.targetRef`. The validating webhook checks
`agentcard.Spec.TargetRef != nil` and rejects with `"spec.targetRef is required"`.

#### No protocol label

Deploys `noproto-agent` with `kagenti.io/type=agent` but no `protocol.kagenti.io/*` label.
The sync controller's `shouldSyncWorkload()` requires both the agent type AND a protocol
label, so it skips this workload. The test uses `Consistently` for 15s to prove no card appears.

#### Auto-discovery

Deploys `echo-agent` with both labels plus an inline Python HTTP server serving
`/.well-known/agent-card.json`. The sync controller auto-creates `echo-agent-deployment-card`.
The main controller reconciles it: fetches the card JSON from the Service endpoint, extracts
protocol from labels, and sets `Synced=True`. Test verifies managed-by label, targetRef fields,
protocol, and sync status.

#### Duplicate prevention

With `echo-agent-deployment-card` still present from the previous test (ordered container),
attempts to create `echo-agent-manual-card` targeting the same Deployment. The webhook's
`checkDuplicateTargetRef()` lists all AgentCards in the namespace, finds the existing card
with matching targetRef, and rejects with `"an AgentCard already targets"`.

#### Audit mode

Controller is patched with `--require-a2a-signature=true --signature-audit-mode=true`.
Deploys unsigned `audit-agent`. The controller verifies the signature (fails — no signature),
but audit mode allows sync to proceed. Status shows `Synced=True` and
`SignatureVerified=False` with reason `SignatureInvalidAudit`.

#### Signed agent

The most complex scenario. Controller runs with `--require-a2a-signature=true` (no audit mode).

1. **ClusterSPIFFEID** tells SPIRE to issue SVIDs to agent pods
2. **signed-agent** Deployment uses an `agentcard-signer` init-container that:
- Connects to SPIRE agent via CSI-mounted socket
- Signs the unsigned card JSON with the pod's SVID
- Writes signed card to a shared emptyDir volume
3. Main container serves the signed card via HTTP
4. Controller fetches the card, verifies the x5c signature chain against the SPIRE trust
bundle, extracts the SPIFFE ID from the leaf cert SAN
5. Identity binding checks that the SPIFFE ID belongs to the configured trust domain

Test verifies: `SignatureVerified=True` (reason `SignatureValid`),
`signatureSpiffeId = spiffe://example.org/ns/e2e-agentcard-test/sa/signed-agent-sa`,
`Synced=True`, `Bound=True`.

## Troubleshooting

**Stale cluster state** — if you see errors about namespaces being terminated or cert-manager TLS failures, delete and recreate the cluster:

```bash
kind delete cluster && kind create cluster
```

**Podman socket errors** — ensure your Podman machine is running:

```bash
podman machine start
```

**Override container tool** — if auto-detection picks the wrong runtime:

```bash
CONTAINER_TOOL=podman make test-e2e
```
55 changes: 45 additions & 10 deletions kagenti-operator/test/e2e/e2e_suite_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import (
"os"
"os/exec"
"testing"
"time"

. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
Expand All @@ -32,18 +33,25 @@ var (
// Optional Environment Variables:
// - PROMETHEUS_INSTALL_SKIP=true: Skips Prometheus Operator installation during test setup.
// - CERT_MANAGER_INSTALL_SKIP=true: Skips CertManager installation during test setup.
// These variables are useful if Prometheus or CertManager is already installed, avoiding
// re-installation and conflicts.
// - SPIRE_INSTALL_SKIP=true: Skips SPIRE installation during test setup.
// These variables are useful if Prometheus, CertManager, or SPIRE is already installed,
// avoiding re-installation and conflicts.
skipPrometheusInstall = os.Getenv("PROMETHEUS_INSTALL_SKIP") == "true"
skipCertManagerInstall = os.Getenv("CERT_MANAGER_INSTALL_SKIP") == "true"
skipSpireInstall = os.Getenv("SPIRE_INSTALL_SKIP") == "true"
// isPrometheusOperatorAlreadyInstalled will be set true when prometheus CRDs be found on the cluster
isPrometheusOperatorAlreadyInstalled = false
// isCertManagerAlreadyInstalled will be set true when CertManager CRDs be found on the cluster
isCertManagerAlreadyInstalled = false
// isSpireAlreadyInstalled will be set true when SPIRE CRDs are found on the cluster
isSpireAlreadyInstalled = false

// projectImage is the name of the image which will be build and loaded
// with the code source changes to be tested.
projectImage = "example.com/kagenti-operator:v0.0.1"

// signerImage is the agentcard-signer init-container image
signerImage = "ghcr.io/kagenti/kagenti-operator/agentcard-signer:e2e-test"
)

// TestE2E runs the end-to-end (e2e) test suite for the project. These tests execute in an isolated,
Expand All @@ -60,21 +68,20 @@ var _ = BeforeSuite(func() {
By("Ensure that Prometheus is enabled")
_ = utils.UncommentCode("config/default/kustomization.yaml", "#- ../prometheus", "#")

containerTool := utils.DetectContainerTool()
_, _ = fmt.Fprintf(GinkgoWriter, "Using container tool: %s\n", containerTool)

By("building the manager(Operator) image")
cmd := exec.Command("make", "docker-build", fmt.Sprintf("IMG=%s", projectImage))
cmd := exec.Command("make", "docker-build",
fmt.Sprintf("IMG=%s", projectImage),
fmt.Sprintf("CONTAINER_TOOL=%s", containerTool))
_, err := utils.Run(cmd)
ExpectWithOffset(1, err).NotTo(HaveOccurred(), "Failed to build the manager(Operator) image")

// TODO(user): If you want to change the e2e test vendor from Kind, ensure the image is
// built and available before running the tests. Also, remove the following block.
By("loading the manager(Operator) image on Kind")
err = utils.LoadImageToKindClusterWithName(projectImage)
ExpectWithOffset(1, err).NotTo(HaveOccurred(), "Failed to load the manager(Operator) image into Kind")

// The tests-e2e are intended to run on a temporary cluster that is created and destroyed for testing.
// To prevent errors when tests run in environments with Prometheus or CertManager already installed,
// we check for their presence before execution.
// Setup Prometheus and CertManager before the suite if not skipped and if not already installed
if !skipPrometheusInstall {
By("checking if prometheus is installed already")
isPrometheusOperatorAlreadyInstalled = utils.IsPrometheusCRDsInstalled()
Expand All @@ -95,10 +102,34 @@ var _ = BeforeSuite(func() {
_, _ = fmt.Fprintf(GinkgoWriter, "WARNING: CertManager is already installed. Skipping installation...\n")
}
}

By("building the agentcard-signer image")
cmd = exec.Command("make", "build-signer",
fmt.Sprintf("SIGNER_IMG=%s", signerImage),
fmt.Sprintf("CONTAINER_TOOL=%s", containerTool))
_, err = utils.Run(cmd)
ExpectWithOffset(1, err).NotTo(HaveOccurred(), "Failed to build the agentcard-signer image")

By("loading the agentcard-signer image on Kind")
err = utils.LoadImageToKindClusterWithName(signerImage)
ExpectWithOffset(1, err).NotTo(HaveOccurred(), "Failed to load the agentcard-signer image into Kind")

if !skipSpireInstall {
By("checking if SPIRE is installed already")
isSpireAlreadyInstalled = utils.IsSpireCRDsInstalled()
if !isSpireAlreadyInstalled {
_, _ = fmt.Fprintf(GinkgoWriter, "Installing SPIRE...\n")
Expect(utils.InstallSpire("example.org")).To(Succeed(), "Failed to install SPIRE")
Expect(utils.WaitForSpireReady(5*time.Minute)).To(Succeed(), "SPIRE pods not ready in time")
} else {
_, _ = fmt.Fprintf(GinkgoWriter, "WARNING: SPIRE is already installed. Skipping installation...\n")
}
}
})

var _ = AfterSuite(func() {
// Teardown Prometheus and CertManager after the suite if not skipped and if they were not already installed
// Teardown Prometheus, CertManager, and SPIRE after the suite if not skipped
// and if they were not already installed
if !skipPrometheusInstall && !isPrometheusOperatorAlreadyInstalled {
_, _ = fmt.Fprintf(GinkgoWriter, "Uninstalling Prometheus Operator...\n")
utils.UninstallPrometheusOperator()
Expand All @@ -107,4 +138,8 @@ var _ = AfterSuite(func() {
_, _ = fmt.Fprintf(GinkgoWriter, "Uninstalling CertManager...\n")
utils.UninstallCertManager()
}
if !skipSpireInstall && !isSpireAlreadyInstalled {
_, _ = fmt.Fprintf(GinkgoWriter, "Uninstalling SPIRE...\n")
utils.UninstallSpire()
}
})
Loading
Loading