Skip to content

feat: add local integration testing infrastructure (envtest + kind)#54

Merged
ian-flores merged 12 commits intomainfrom
issue-51-integration-tests
Feb 24, 2026
Merged

feat: add local integration testing infrastructure (envtest + kind)#54
ian-flores merged 12 commits intomainfrom
issue-51-integration-tests

Conversation

@ian-flores
Copy link
Contributor

@ian-flores ian-flores commented Jan 27, 2026

Summary

Implements #51: a two-tier local integration testing strategy so developers can validate operator behavior without deploying to staging clusters.

Before: the only way to test controller behavior against a real Kubernetes API was deploying to an internal staging cluster.
After: fast local alternatives that work offline, run in CI, and support an iterative dev loop.


What This Adds

Tier 1: Envtest (~5s, no cluster)

Tests CRD schema validation and API storage against a real embedded K8s API server — no kubelet, no containers, no Docker.

  • Enabled the previously-skipped envtest suite in suite_test.go, fixing CRD paths and registering all required schemes (keycloak, traefik, secrets-store-csi-driver)
  • 4 Ginkgo specs covering Site, Connect, Workbench, and PackageManager CRDs (internal/controller/core/site_envtest_test.go)

Tier 2: Kind cluster (minutes, full stack)

Tests the full operator Helm deployment and Site CR reconciliation end-to-end against a real cluster.

  • hack/test-kind.sh — deploys operator via Helm, creates a Site CR, verifies Connect and Workbench child CRs are reconciled, checks operator logs for panics, cleans up
  • make test-kind runs helm-generate before deploying so the chart is always in sync with the current API

Persistent Dev Loop

Instead of creating and destroying a cluster on every run, developers can keep a cluster alive across iterations:

make kind-setup    # create cluster + deploy operator (run once, or after code changes)
make kind-test     # run tests against the live cluster
make kind-teardown # clean up when done

make kind-setup is idempotent — running it after a code change rebuilds the image, reloads it into kind, and does a helm upgrade, so it doubles as the "reload" step.

CI Integration

  • Envtest runs on every PR
  • Kind tests run on PRs that touch relevant paths (api/**, internal/**, cmd/**, Dockerfile, Makefile, go.mod/sum, dist/chart/**, hack/test-kind.sh) via a check-changes job using dorny/paths-filter
  • Kind tests also run on push to main, nightly, and manual dispatch

Bug Fix

cleanupLegacyHomeApp was using a direct type assertion (err.(*meta.NoKindMatchError)) that misses wrapped errors, causing reconciliation to abort on any cluster without the secrets-store CSI driver (e.g. kind). Changed to errors.As.

Documentation

  • README.md: new Testing section surfacing both the one-shot and dev loop workflows as the primary recommended approach
  • docs/testing.md: full reference — prerequisites, all targets, development loop, troubleshooting, best practices

Test Plan

  • Envtest suite passes (4 specs, ~5s)
  • All existing unit tests continue to pass
  • make go-test exits cleanly
  • Kind tests pass end-to-end locally (make test-kind)
  • Dev loop tested: kind-setupkind-test → code change → kind-setupkind-testkind-teardown
  • CI workflow triggers correctly on path-matched PRs
  • Smoke tested API change workflow: added a field to SiteSpec, ran make test-kind, confirmed the new field appeared in the live CRD schema, reverted

- Enable envtest suite: remove Skip(), fix CRD paths, add required schemes
- Add Ginkgo-based envtest tests for Site, Connect, Workbench, PackageManager CRDs
- Add kind integration tests with make targets (test-kind, test-kind-full)
- Add GitHub Actions workflow for CI (envtest on PRs, kind on main/nightly)
- Add comprehensive testing documentation

Closes #51
@claude
Copy link

claude bot commented Jan 27, 2026

Claude finished @ian-flores's task —— View job


Reviewing PR #54

  • Read review guidelines
  • Create pending review
  • Analyze code changes
  • Add inline review comments
  • Submit review

Review submitted with feedback on:

  • Test name/behavior mismatch in site_envtest_test.go
  • Missing cleanup in Site creation test
  • Ineffective validation test
  • Unused helper functions in suite_test.go
  • Unused GO_VERSION env var in workflow

Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Good foundation for integration testing infrastructure. A few items to address:

Test Code Issues

internal/controller/core/site_envtest_test.go

  1. Misleading test name (line 28): Test says "Should create child resources (Connect, Workbench, etc.)" but only verifies the Site CR was created - no child resources are checked. Either rename to match actual behavior or add child resource assertions (requires starting the controller).

  2. Missing cleanup (line 67): The first Site test creates resources but doesn't clean them up, unlike other tests that properly delete their resources. Consider adding DeferCleanup or explicit cleanup.

  3. Ineffective validation test (line 83): The "Should reject invalid Site specs" test accepts both success and failure outcomes. If validation behavior is defined, assert it; otherwise consider removing this test.

internal/controller/core/suite_test.go

  1. Unused helper functions (lines 104-134): setupSiteControllerForEnvtest() and waitFor() are defined but never called. Remove dead code or add tests that use them.

Workflow Considerations

.github/workflows/integration-tests.yml

  1. The GO_VERSION env var (line 43) is set to '1.22' but unused since go-version-file: go.mod is used instead. Consider removing the unused env var.

Documentation

The docs/testing.md is comprehensive and well-structured.

Security Notes

  • The hack/test-kind.sh script uses appropriate practices (set -euo pipefail, proper quoting)
  • Helm installation uses --wait and proper timeouts
  • No credential handling concerns identified

- Rename misleading test "Should create child resources" to accurately
  describe what it tests (Site CR creation and retrieval)
- Add DeferCleanup to first Site test to match cleanup pattern of others
- Remove ineffective validation test that accepted both success/failure
- Remove unused setupSiteControllerForEnvtest and waitFor helpers
- Remove unused GO_VERSION env var from integration-tests.yml
go test ./... with -covermode=atomic fails for packages with no test
files in Go 1.25 due to covdata removal. Filter to only packages with
test files using go list.

Also improves kind integration tests:
- Add docker-build and kind-load-image as test-kind prerequisites so
  the operator image is always present before running tests
- Add test_reconciliation to verify the controller creates Connect and
  Workbench child CRs after a Site CR is applied
- Add test_operator_logs to detect panics and confirm reconciliation
  activity in operator logs
@ian-flores ian-flores linked an issue Feb 23, 2026 that may be closed by this pull request
4 tasks
@ian-flores
Copy link
Contributor Author

@claude review this PR

@claude
Copy link

claude bot commented Feb 23, 2026

Claude finished @ian-flores's task —— View job


Reviewing PR #54

  • Read review guidelines and previous review
  • Read all changed files
  • Analyze code changes
  • Submit review feedback

Review

Solid infrastructure addition. The two-tier strategy is well-designed and the errors.As bugfix is correct. Since the last review, test names have been fixed, DeferCleanup was added to the Site test, the unused helper functions and invalid-spec test were removed, and the GO_VERSION env var was cleaned up. All prior review items appear addressed.

A few remaining observations:

Envtest Tests (site_envtest_test.go)

  1. Tests only verify CRD schema, not controller behavior — The four tests create CRs and confirm they can be stored/retrieved, but no controller is started in the envtest suite (no manager, no reconciler registered). This means they test CRD registration, not reconciliation. The docs/testing.md and PR description claim these test "controller reconciliation logic" and "resource creation" — worth aligning the docs or adding a follow-up to wire up the controller. As-is, these are CRD smoke tests, which is still valuable.

  2. Inconsistent cleanup patterns — The Site test uses DeferCleanup (line 68), but Connect/Workbench/PackageManager tests use explicit k8sClient.Delete at the end of the test (e.g. line 118). If the assertion fails mid-test, the explicit cleanup never runs, leaving resources behind. Consider using DeferCleanup consistently across all tests.

  3. Eventually is unnecessary for synchronous createsk8sClient.Create followed by k8sClient.Get against envtest's embedded API server is synchronous. The Eventually loops (e.g. lines 75-78, 109-112) add 30s timeout overhead on failure with no benefit. A direct Expect(k8sClient.Get(...)).To(Succeed()) would be simpler and fail faster.

Kind Script (hack/test-kind.sh)

  1. test_reconciliation cleanup may leave child CRs — Line 312 deletes the Site, but if reconciliation created Connect/Workbench child CRs, those aren't explicitly cleaned up. If the operator isn't running (already cleaned up by the EXIT trap ordering), the finalizers or owner references won't cascade. This could leave resources behind in some failure paths, though the namespace deletion in cleanup() should catch it.

  2. test_create_site runs without the operator deployed when CHART_DIR is missing — In the main() function, if the chart directory doesn't exist, the script skips operator deployment but still runs test_create_site and test_reconciliation (lines 390-391). test_reconciliation will always fail without an operator, making the fallback path misleading. Consider skipping reconciliation tests when the operator isn't deployed.

CI Workflow (integration-tests.yml)

  1. Redundant image build/load steps — The kind job builds the image (line 117-119) and loads it (line 123), but make test-kind (line 127) has docker-build as a dependency already. This means the image is built twice. Consider either removing the explicit build step or using ./hack/test-kind.sh directly instead of make test-kind.

  2. codecov-action@v4 without a token — The codecov upload step (line 76) uses v4 which requires a token. Since fail_ci_if_error: false is set this won't break CI, but the upload will silently fail on PRs from forks. Either add a Codecov token or use codecov-action@v5.

Makefile

  1. KIND_VERSION variable is unusedKIND_VERSION ?= 1.29.x is defined at line 132 but never referenced by any target. The workflow uses a separate KIND_VERSION env var for the kind binary version (v0.23.0), and ENVTEST_K8S_VERSION is what controls the K8s API version for envtest. Consider removing the Makefile KIND_VERSION to avoid confusion.

Bug Fix (site_controller_home_cleanup.go)

The errors.As change is correct and well-motivated. Direct type assertion misses wrapped errors — this is the right fix.

Documentation (docs/testing.md)

Thorough and well-structured. One minor inaccuracy: the example in the "Envtest Suite" section (line 144) shows It("Should create child resources"...) which doesn't match the actual test names anymore.


Overall this is a good PR. The main feedback is about test fidelity (CRD smoke tests vs. actual controller reconciliation tests) and minor cleanup items. None are blockers.

@ian-flores
Copy link
Contributor Author

Addressed all findings in 464ef4d:

  • Item 1 — Updated docs/testing.md to accurately describe envtest tests as CRD schema/storage smoke tests, not controller reconciliation tests. Also fixed the stale example test name.
  • Item 2 — Replaced explicit k8sClient.Delete cleanup at end of Connect, Workbench, and PackageManager tests with DeferCleanup, matching the pattern already used in the Site test.
  • Item 3 — Removed all Eventually polling loops; replaced with direct Expect(k8sClient.Get(...)).To(Succeed()). Removed the now-unused timeout/interval constants and "time" import.
  • Item 4 — Added explicit kubectl delete connect and kubectl delete workbench after Site deletion in test_reconciliation cleanup.
  • Item 5 — Moved test_reconciliation inside the if [[ -d "${CHART_DIR}" ]] block so it only runs when the operator is actually deployed.
  • Item 6 — Removed the explicit "Build operator image" and "Load image into kind" steps from the CI workflow; make test-kind handles both via its docker-build dependency and the tagging/loading in hack/test-kind.sh.
  • Item 7 — Upgraded codecov/codecov-action from v4 to v5.
  • Item 8 — Removed unused KIND_VERSION ?= 1.29.x from the Makefile.

@ian-flores
Copy link
Contributor Author

Local dev loop test ✔

Ran the persistent kind workflow end-to-end on the current branch:

$ make kind-setup
No kind clusters found.
Creating kind cluster 'team-operator-test'...
✓ Preparing nodes
✓ Starting control-plane
✓ Ready after 17s
[INFO] Deploying team-operator via Helm...
[INFO] Operator deployed successfully
[INFO] Waiting for operator to be ready...
[INFO] Success: operator deployment ready
[INFO] Kind cluster is ready. Run 'make kind-test' to execute tests.
$ make kind-test
[INFO] Testing: CRDs are installed...
[INFO]   CRD found: sites.core.posit.team
[INFO]   CRD found: connects.core.posit.team
[INFO]   CRD found: workbenches.core.posit.team
[INFO]   CRD found: packagemanagers.core.posit.team
[INFO] Test passed: All CRDs installed
[INFO] Testing: Operator logs...
[INFO] Test passed: Operator logs look healthy
[INFO] Testing: Site reconciliation...
[INFO] Site CR created, waiting for reconciliation...
[INFO] Child CRs created successfully
[INFO]   Connect CR found: test-site-reconcile
[INFO]   Workbench CR found: test-site-reconcile
[INFO] Test passed: Site reconciliation verified
[INFO] Testing: Create Site CR...
[INFO] Test passed: Site CR created
[INFO] ==========================================
[INFO] All integration tests passed!
[INFO] ==========================================
$ make kind-teardown
[INFO] Cleaning up...
release "team-operator" uninstalled
namespace "posit-team-system" deleted
namespace "posit-team" deleted
[INFO] Cleanup completed
Deleting cluster "team-operator-test" ...
Deleted nodes: ["team-operator-test-control-plane"]

Cluster creation and operator deployment only happen in kind-setup. kind-test is pure test execution — no setup overhead. Running kind-setup again after a code change will rebuild the image and helm upgrade in place without recreating the cluster.

@ian-flores
Copy link
Contributor Author

How kind tests handle secrets without AWS credentials

The kind tests use type: test on both workloadSecret and mainDatabaseCredentialSecret in every Site CR. FetchSecret in api/product/secret.go switches on that type field and routes to SiteSecretTest, which reads from an in-memory GlobalTestSecretProvider map instead of hitting AWS Secrets Manager.

When the map is empty (as it is in kind tests — nothing pre-populates it), GetSecretWithFallback just returns the key name as the value. The controller gets a non-empty string, reconciliation proceeds, no AWS SDK call is made, no credentials required.

type: test is a first-class escape hatch baked into the operator for local/CI use. One thing to be aware of: it ships in the production binary, so it's a surface worth keeping an eye on if production hardening becomes a concern down the line.

@ian-flores ian-flores marked this pull request as ready for review February 23, 2026 21:27
Copy link
Collaborator

@stevenolen stevenolen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

massive, massive fan of this.

a few stray notes:

  • thanks for including the comment about secrets manager. ultimately, i think we need to extract the cloud-specific implementation bits, and formalize/document those integrations -- along with some local replacements for first-class kind support.
  • notes/additional pr that shows how to actually interact with the products once they are running in kind would be super cool

@ian-flores ian-flores added this pull request to the merge queue Feb 24, 2026
Merged via the queue into main with commit a5b5751 Feb 24, 2026
9 checks passed
@ian-flores ian-flores deleted the issue-51-integration-tests branch February 24, 2026 17:54
ian-flores pushed a commit that referenced this pull request Feb 24, 2026
# [1.15.0](v1.14.0...v1.15.0) (2026-02-24)

### Features

* add local integration testing infrastructure (envtest + kind) ([#54](#54)) ([a5b5751](a5b5751))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add local integration testing infrastructure (envtest + kind)

2 participants