-
Notifications
You must be signed in to change notification settings - Fork 52
OCPCLOUD-3327: Rewrite manifests-gen to support upgrade safety #434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPCLOUD-3327: Rewrite manifests-gen to support upgrade safety #434
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughMigrate manifest generation to a typed client.Object pipeline, add a Go-based manifest generator and a provider image subsystem that fetches/extracts manifests from container images, remove the legacy provider pipeline, integrate provider images into controllers, and add RBAC and deployment changes for provider-image handling. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI
participant Loader as loadProviderImages
participant PullSecret as PullSecret Reader
participant Fetcher as Image Fetcher
participant Registry as Image Registry
participant Extractor as Manifest Extractor
participant Disk as Filesystem
participant Controller as CapiInstallerController
CLI->>Loader: generateManifests / loadProviderImages(imagesFile)
Loader->>PullSecret: read pull-secret from openshift-config
PullSecret-->>Loader: docker keychain
Loader->>Fetcher: fetch image (with keychain)
Fetcher->>Registry: request image layers
Registry-->>Fetcher: layers + manifest blobs
Fetcher->>Extractor: extract /capi-operator-manifests
Extractor->>Disk: write files (profiles, metadata.yaml, manifests.yaml)
Disk-->>Loader: extracted paths / profile list
Loader->>CLI: return providerProfiles + containerImages
CLI->>Controller: setup with providerProfiles
Controller->>Controller: reconcileProviderImages (sorted)
loop for each provider
Controller->>Disk: read manifests.yaml
Controller->>Controller: extractManifests -> resources
Controller->>Controller: applyProviderComponents (CRDs, Deployments, others)
end
Controller-->>CLI: reconciliation complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Comment |
|
Skipping CI for Draft Pull Request. |
|
@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-ovn-techpreview |
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584 |
1ab7f33 to
5087e3b
Compare
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584 |
|
/test e2e-aws-ovn-techpreview |
dab2c52 to
7c0b761
Compare
f3ba06c to
a8a93e3
Compare
|
/test e2e-aws-ovn-techpreview |
|
/test e2e-aws-ovn-techpreview |
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259 |
2 similar comments
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259 |
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259 |
|
The e2e-aws-ovn-techpreview pass shows the modifications to the capiinstaller controller are backwards compatible with current transport configmaps. openshift/cluster-api-provider-aws#584 and openshift/cluster-api#259 update CAPA and core CAPI to publish manifests only in the provider images. They do not publish transport configmaps. The multi-pr test pass with these 2 PRs shows that pulling manifests from images is working. |
|
@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: theobarberbany The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
Debug Logging
- Added detailed logging when reconciling provider images (manifestsPath, type, version, profile, platform)
- Added logging for manifest extraction count
- Added component count summary before applying (CRDs, deployments, static components)
- Added per-resource logging when applying CRDs and deployments
Error Handling Improvements
- Fixed applyProviderImage to use named return so reader.Close() errors are not silently lost
- Fixed type assertion error messages to show actual type received (expected X but got %T)
- Added errUnexpectedResourceType static error for linter compliance
- Updated setDegradedCondition to accept and surface the actual error in ClusterOperator status conditions
- Changed degraded condition logging from Info to Error level
Provider Images Library
- Added error logging when provider image processing fails
- Improved error wrapping with context for:
- Manifest extraction failures (includes output directory)
- Profile discovery failures (includes directory path)
- Manifest write failures (includes profile name)
f85c5a7 to
f2238b7
Compare
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-capi-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259 |
|
@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/lgtm |
|
Scheduling tests matching the |
|
Installs work - only failures are VAP ones where we have flaky tests, that dont give signal for this sort of change. |
|
/lgtm |
|
@CodeRabbit full |
|
✅ Actions performedFull review triggered. |
|
@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
pkg/conversion/capi2mapi/aws.go (2)
525-531:⚠️ Potential issue | 🟡 MinorValidate negative throughput before narrowing.
Line 526 only guards
> MaxInt32. A negative throughput will still be encoded into the provider config, which is invalid for AWS and can fail later. Add a lower-bound check and surface a field error.Suggested fix
if volume.Throughput != nil { - if *volume.Throughput > math.MaxInt32 { + if *volume.Throughput < 0 || *volume.Throughput > math.MaxInt32 { return mapiv1beta1.BlockDeviceMappingSpec{}, field.Invalid(fldPath.Child("throughput"), *volume.Throughput, "throughput exceeds maximum int32 value") } bdm.EBS.ThroughputMib = ptr.To(int32(*volume.Throughput)) }
548-554:⚠️ Potential issue | 🟠 MajorPrevent overflow on placement group partition narrowing.
Line 553 narrows
int64→int32with no bounds check, which can overflow and silently corrupt the provider spec. With the nolint removed, this is also likely to re-trigger gosec. Consider adding explicit bounds validation and returning a field error to the caller.Suggested fix (function)
-func convertAWSPlacementGroupPartition(in int64) *int32 { +func convertAWSPlacementGroupPartition(fldPath *field.Path, in int64) (*int32, *field.Error) { if in == 0 { - return nil + return nil, nil } - return ptr.To(int32(in)) + if in < math.MinInt32 || in > math.MaxInt32 { + return nil, field.Invalid(fldPath, in, "placementGroupPartition out of int32 range") + } + return ptr.To(int32(in)), nil }Additional call-site update (outside this hunk) would be needed to append the error to
errorsintoProviderSpecwhen settingPlacementGroupPartition.manifests-gen/customizations.go (1)
161-165:⚠️ Potential issue | 🟡 MinorTypo in panic message.
The panic message has a concatenation issue:
"can't find secret from cert:CustomResourceDefinition "should likely just include thecertNNvalue without "CustomResourceDefinition".Suggested fix
- panic("can't find secret from cert:CustomResourceDefinition " + certNN) + panic("can't find secret from cert: " + certNN)
🤖 Fix all issues with AI agents
In `@pkg/providerimages/providerimages_test.go`:
- Around line 409-424: The test case "missing manifest image name" is using
legacy metadata keys so it doesn't properly test the "missing selfImageRef"
scenario; update the YAML passed into createCapiManifestsImage in the
setupFetcher closure to use the current metadata schema (same field names as
production tests) but simply omit the selfImageRef line so the test isolates the
missing selfImageRef behavior; locate the setupFetcher in providerimages_test.go
that calls createCapiManifestsImage and replace the legacy metadata block with a
schema-correct block missing only selfImageRef.
🧹 Nitpick comments (4)
pkg/providerimages/pullsecret.go (1)
60-66: Consider a clearer empty-config check.The current approach mutates
cfg.ServerAddressbefore comparison. While safe (cfg is a value copy), the intent would be clearer by checking meaningful fields directly:♻️ Suggested alternative
- // Check if we got an empty config - empty := types.AuthConfig{} - cfg.ServerAddress = "" // Clear for comparison - - if cfg == empty { + // Check if we got an empty config (no credentials) + if cfg.Username == "" && cfg.Password == "" && cfg.Auth == "" && + cfg.IdentityToken == "" && cfg.RegistryToken == "" { return authn.Anonymous, nil }manifests/0000_30_cluster-api_11_deployment.yaml (1)
35-61: Consider explicit securityContext hardening for capi-controllers.
With a writable volume mounted, you can still setreadOnlyRootFilesystem: trueand disable privilege escalation while keeping/var/lib/provider-imageswritable.🔒 Example hardening (optional)
- name: capi-controllers image: registry.ci.openshift.org/openshift:cluster-capi-operator + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: trueAlso applies to: 94-100
manifests-gen/generate.go (2)
102-102:O_APPENDflag is redundant withO_TRUNC.Since
O_TRUNCempties the file and writes are sequential via the buffered writer,O_APPENDhas no effect here.Suggested simplification
- manifestsFile, err := os.OpenFile(manifestsPathname, os.O_CREATE|os.O_TRUNC|os.O_WRONLY|os.O_APPEND, 0600) + manifestsFile, err := os.OpenFile(manifestsPathname, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0600)
137-137: Same redundantO_APPENDflag.Consistent with the earlier comment,
O_APPENDis unnecessary here.Suggested simplification
- metadataFile, err := os.OpenFile(metadataPathname, os.O_CREATE|os.O_TRUNC|os.O_WRONLY|os.O_APPEND, 0600) + metadataFile, err := os.OpenFile(metadataPathname, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0600)
| name: "missing manifest image name", | ||
| containerImages: []string{ | ||
| "registry.example.com/capi-aws:v1.0.0", | ||
| }, | ||
| setupFetcher: func(t *testing.T) *fakeImageFetcher { | ||
| t.Helper() | ||
| // Create metadata YAML without manifestImageName field | ||
| img, err := createCapiManifestsImage( | ||
| `providerName: aws | ||
| providerType: infrastructure | ||
| providerVersion: v1.0.0 | ||
| ocpPlatform: aws | ||
| contentID: id | ||
| `, | ||
| "image: some-other-image:latest\n", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use current metadata schema when testing “missing selfImageRef”.
The YAML in this case uses legacy keys, so the test no longer isolates “missing selfImageRef” behavior; it may pass for the wrong reason. Prefer a schema-correct metadata block that simply omits the selfImageRef line.
🔧 Suggested adjustment
- // Create metadata YAML without manifestImageName field
- img, err := createCapiManifestsImage(
- `providerName: aws
-providerType: infrastructure
-providerVersion: v1.0.0
-ocpPlatform: aws
-contentID: id
-`,
+ // Create metadata YAML without selfImageRef field
+ img, err := createCapiManifestsImage(
+ `name: aws
+ocpPlatform: aws
+installOrder: 20
+attributes:
+ type: infrastructure
+ version: v1.0.0
+`,
"image: some-other-image:latest\n",
)🤖 Prompt for AI Agents
In `@pkg/providerimages/providerimages_test.go` around lines 409 - 424, The test
case "missing manifest image name" is using legacy metadata keys so it doesn't
properly test the "missing selfImageRef" scenario; update the YAML passed into
createCapiManifestsImage in the setupFetcher closure to use the current metadata
schema (same field names as production tests) but simply omit the selfImageRef
line so the test isolates the missing selfImageRef behavior; locate the
setupFetcher in providerimages_test.go that calls createCapiManifestsImage and
replace the legacy metadata block with a schema-correct block missing only
selfImageRef.
|
/verified by e2es + test with + and manual testing |
|
@theobarberbany: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test okd-scos-images |
|
/test e2e-openstack-ovn-techpreview |
|
Across the last 3 open stack runs we have every test going green, it's just a flaky e2e :( /override e2e-openstack-ovn-techpreview |
|
@theobarberbany: Overrode contexts on behalf of theobarberbany: ci/prow/e2e-openstack-ovn-techpreview DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/tide refresh |
|
@mdbooth: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
b82eff1
into
openshift:main
Rewrite manifests-gen to support embedding CAPI installer manifests in the provider image instead of in a transport configmap.
Update the CAPI installer controller to support the new embedded manifests in addition to the existing transport configmaps. This allows us to have a smooth transition period while we update all providers.
Summary by CodeRabbit
New Features
Improvements
Tests