Cache KubeVirt Boot Image #1918

nunnatsa · 2022-12-05T13:06:31Z

Create a new PVC (if missing) to pull the boot image, then modify the KubeVirt node pull template, to clone from the new PVC, instead of pull from each node.

Signed-off-by: Nahshon Unna-Tsameret nunnatsa@redhat.com

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes #

Checklist

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

openshift-ci · 2022-12-05T13:06:44Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

davidvossel

I've been debating on whether this caching functionality belongs in the nodepool operator, or whether it fits better into capk itself.

I'm leaning towards thinking the current NodePool approach is a better fit. We're essentially doing some namespace scoped golden image caching that's local for each guest cluster. The context around what image to cache and how long to cache it for is hypershift specific, and which i think is generally outside of the scope of capk.

davidvossel · 2022-12-06T15:21:36Z

hypershift-operator/controllers/nodepool/kubevirt/kubevirt.go

+func CacheImage(ctx context.Context, cl client.Client, bootImage string, nodePool *hyperv1.NodePool) error {
+	kvPlatform := nodePool.Spec.Platform.Kubevirt
+	clusterName := nodePool.Spec.ClusterName
+	namespace := nodePool.GetNamespace()


we need to use the namespace that the VMs will be placed in. not the NodePool namespace.

davidvossel · 2022-12-06T21:57:04Z

hypershift-operator/controllers/nodepool/kubevirt/kubevirt.go

+func getCachedPVCName(clusterName string) string {
+	return bootImagePVCPrefix + clusterName
+}


We'll likely have multiple images associated with a single cluster. Can you think of a way to name the pvcs differently so that 1:many relationship can be observed?

davidvossel · 2022-12-06T22:04:28Z

hypershift-operator/controllers/nodepool/kubevirt/kubevirt.go

+	lbls := map[string]string{
+		bootImagePVCLabelApp:     bootImagePVCLabelAppValue,
+		bootImagePVCLabelType:    bootImagePVCLabelTypeValue,
+		bootImagePVCLabelCluster: clusterName,
+	}


We need either a label or annotation which stores the bootimage hash. This hash is how we'll determine whether or not a new cached DV needs to be created, or whether we can use a previously cached image.

This [1] is where that bootImage is detected today. It's not immediately obvious, but there's a hash in there as well we can retrieve. we return disk.Location url today, but there's also a disk.sha256 that we have access too. example struct [2].

That sha256 is our key when caching, not the location url.

hypershift/hypershift-operator/controllers/nodepool/kubevirt/kubevirt.go

Line 26 in afcbd08

artifact, exists := openStack.Formats["qcow2.gz"]

hypershift/support/releaseinfo/releaseinfo.go

Line 51 in afcbd08

type CoreOSFormat struct {

nunnatsa · 2022-12-11T10:34:38Z

/test kubevirt-e2e-kubevirt-gcp-ovn

openshift-ci · 2022-12-11T10:35:02Z

@nunnatsa: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws
/test images
/test periodics-4.12-images
/test unit
/test verify

The following commands are available to trigger optional jobs:

/test e2e-aws-metrics
/test e2e-conformance
/test e2e-ibmcloud-iks
/test e2e-ibmcloud-roks

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-hypershift-main-e2e-aws
pull-ci-openshift-hypershift-main-images
pull-ci-openshift-hypershift-main-periodics-4.12-images
pull-ci-openshift-hypershift-main-unit
pull-ci-openshift-hypershift-main-verify

Details

In response to this:

/test kubevirt-e2e-kubevirt-gcp-ovn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

netlify · 2022-12-13T09:00:17Z

✅ Deploy Preview for hypershift-docs ready!

Name	Link
🔨 Latest commit	`cb6bf72`
🔍 Latest deploy log	https://app.netlify.com/sites/hypershift-docs/deploys/6464e101374a690008e8e471
😎 Deploy Preview	https://deploy-preview-1918--hypershift-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

nunnatsa · 2022-12-13T09:02:47Z

/retest

nunnatsa · 2022-12-13T15:20:14Z

/hold

nunnatsa · 2022-12-19T06:00:12Z

/retest

davidvossel · 2022-12-19T21:42:25Z

hypershift-operator/controllers/nodepool/kubevirt/kubevirt.go

+		imageName := *nodePool.Spec.Platform.Kubevirt.RootVolume.Image.ContainerDiskImage
+		imageHash, err := getImageDigest(ctx, imageName)
+		if err != nil {
+			return "", "", fmt.Errorf("failed to get the image hash; %w", err)
+		}
+
+		return containerImagePrefix + imageName, imageHash, nil


for now, i'd suggest only caching images provided by the release payload, where the sha is known beforehand.

davidvossel · 2022-12-19T21:49:37Z

hypershift-operator/controllers/nodepool/kubevirt/kubevirt.go

+	}
+
+	// replace the PVC with a new one
+	err := cl.Delete(ctx, pvc)


does the DV need to be cleaned up as well potentially?

Also, shouldn't we only delete if pvc.DeletionTimestamp == nil?

From 4.12, the DV is auto deleted when the PVC is ready.

davidvossel · 2022-12-19T21:54:05Z

hypershift-operator/controllers/nodepool/kubevirt/kubevirt.go

+		return fmt.Errorf("failed to delete the boot image cache PVC; %w", err)
+	}
+
+	return createDVForCache(ctx, cl, pvc.Namespace, bootImage, imageHash, kvPlatform, pvc.Name)


We can't assume that just because a pvc was marked for deletion, that it is actually deleted.

I think it would be wise to give every cached pvc a unique name so we aren't colliding during garbage collection like this.

davidvossel · 2022-12-19T21:54:58Z

go.mod

 	github.com/aws/aws-sdk-go v1.44.84
 	github.com/blang/semver v3.5.1+incompatible
 	github.com/clarketm/json v1.14.1
+	github.com/containers/image/v5 v5.23.1


what's causing all these go.mod changes, and can we avoid it? Ideally i'd like this pr to not involve any changes to the vendor directory.

Added to read the container image digest, to be used as the image hash. Removed as we don't cache container images for now.

nunnatsa · 2023-01-09T05:20:42Z

/test kubevirt-e2e-kubevirt-aws-ovn

nunnatsa · 2023-01-09T06:08:45Z

/test kubevirt-e2e-kubevirt-aws-ovn

nunnatsa · 2023-01-09T08:21:45Z

/test kubevirt-e2e-kubevirt-aws-ovn

nunnatsa · 2023-05-15T12:11:00Z

rebased

davidvossel · 2023-05-17T14:08:51Z

/test e2e-kubevirt-aws-ovn

davidvossel

/lgtm

I made two comments that need to be addressed at some point, but don't have to be fixed right away.

the nodepool cache strategy cli arg needs to match the cluster create one
we need to close the small possibility of cache not being deleted when during deletion

I know you're looking at transitioning us to use the kubevirt rhcos image soon, which will touch all these areas again, so we can fix these issues there. This PR has grown so large i'd like to avoid delaying it any further if possible. as long as the local and external e2e tests pass, i think this is good to go

davidvossel · 2023-05-17T14:34:19Z

cmd/nodepool/kubevirt/create.go

 	cmd.Flags().Uint32Var(&platformOpts.RootVolumeSize, "root-volume-size", platformOpts.RootVolumeSize, "The size of the root volume for machines in the NodePool in Gi")
 	cmd.Flags().StringVar(&platformOpts.RootVolumeAccessModes, "root-volume-access-modes", platformOpts.RootVolumeAccessModes, "The access modes of the root volume to use for machines in the NodePool (comma-delimited list)")
 	cmd.Flags().StringVar(&platformOpts.ContainerDiskImage, "containerdisk", platformOpts.ContainerDiskImage, "A reference to docker image with the embedded disk to be used to create the machines")
+	cmd.Flags().StringVar(&platformOpts.CacheStrategyType, "cache-strategy-type", platformOpts.CacheStrategyType, "Set the boot image caching strategy; Supported values:\n- \"None\": no caching (default).\n- \"PVC\": Cache into a PVC; only for QCOW image; ignored for container images")


This cli arg doesn't match the one for the cluster create. I think they should be the same for consistency. For example, the root-volume-size arg is the same between nodepool and cluster create.

s/cache-strategy-type/root-volume-cache-strategy

sure. I guess I rename one and forgot the other. fixing

davidvossel · 2023-05-17T15:07:07Z

hypershift-operator/controllers/nodepool/nodepool_controller.go

+					ns = nodePool.Status.Platform.KubeVirt.RemoteNamespace
+				}
+
+				if cl := r.KubevirtInfraClients.GetClient(string(nodePool.GetUID())); cl != nil {


There's a chance that the cache won't get cleaned up here if we get unlucky and the hypershift-operator either restarts or is updated while a nodepool is being deleted.

What can happen is that the kubevirt infra client cache map might not have an entry for this nodepool if the current invocation of the hypershift-operator was unable to reconcile it before the nodepool deleted... This is because the kubevirt infra client is only cached during the normal reconcile, not during deletion.

This could be solved by storing a reference to the kubevirt infra secret on the nodepool status, similar to how the cache name is stored.

nunnatsa · 2023-05-17T17:48:27Z

@davidvossel - fixed a bug in the new e2e nodepool test + the create nodepool CLI flag name.

davidvossel

/lgtm

nunnatsa · 2023-05-17T19:16:15Z

/test e2e-kubevirt-aws-ovn

davidvossel · 2023-05-17T21:11:33Z

/test e2e-kubevirt-aws-ovn

Create a new PVC (if missing) to pull the boot image, then modify the KubeVirt node pull template, to clone from the new PVC, instead of pull from each node. Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

nunnatsa · 2023-05-18T13:13:53Z

/test e2e-kubevirt-aws-ovn

Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

nunnatsa · 2023-05-18T20:07:45Z

@davidvossel , I think it is ready now.

davidvossel

/lgtm
/approve
/hold cancel

openshift-ci · 2023-05-18T21:20:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidvossel, nunnatsa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [davidvossel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2023-05-18T23:03:30Z

@nunnatsa: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/kubevirt-e2e-kubevirt-aws-ovn	`10a4d04`	link	false	`/test kubevirt-e2e-kubevirt-aws-ovn`
ci/prow/e2e-ibmcloud-iks	`23aa969`	link	false	`/test e2e-ibmcloud-iks`
ci/prow/e2e-ibmcloud-roks	`23aa969`	link	false	`/test e2e-ibmcloud-roks`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 5, 2022

davidvossel reviewed Dec 6, 2022

View reviewed changes

nunnatsa force-pushed the cache-kv-image branch from 13d9f7f to a94f5a0 Compare December 13, 2022 08:59

nunnatsa force-pushed the cache-kv-image branch from a94f5a0 to 5612211 Compare December 13, 2022 09:14

nunnatsa marked this pull request as ready for review December 13, 2022 09:37

nunnatsa requested a review from davidvossel December 13, 2022 09:37

openshift-ci bot requested review from nirarg and sjenning December 13, 2022 09:37

nunnatsa changed the title ~~WIP: Cache KubeVirt Boot Image~~ Cache KubeVirt Boot Image Dec 13, 2022

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 13, 2022

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 13, 2022

nunnatsa force-pushed the cache-kv-image branch from 5612211 to 4b893cb Compare December 15, 2022 16:20

davidvossel reviewed Dec 19, 2022

View reviewed changes

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 21, 2022

nunnatsa force-pushed the cache-kv-image branch from 4b893cb to ef44592 Compare December 29, 2022 07:23

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 29, 2022

nunnatsa force-pushed the cache-kv-image branch from ef44592 to fd2bebd Compare January 8, 2023 20:49

nunnatsa force-pushed the cache-kv-image branch from fd2bebd to 31f064a Compare January 9, 2023 06:08

nunnatsa force-pushed the cache-kv-image branch from 31f064a to 0082542 Compare January 9, 2023 08:21

nunnatsa force-pushed the cache-kv-image branch from 0082542 to b69ae02 Compare January 9, 2023 14:35

nunnatsa force-pushed the cache-kv-image branch from c85d783 to cb6bf72 Compare May 17, 2023 14:13

davidvossel reviewed May 17, 2023

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 17, 2023

nunnatsa force-pushed the cache-kv-image branch from cb6bf72 to 74eb0f3 Compare May 17, 2023 17:47

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 17, 2023

davidvossel reviewed May 17, 2023

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 17, 2023

nunnatsa force-pushed the cache-kv-image branch from 74eb0f3 to b4aec57 Compare May 18, 2023 03:29

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 18, 2023

nunnatsa added 2 commits May 18, 2023 11:07

Cache KubeVirt Boot Image

e0be868

Create a new PVC (if missing) to pull the boot image, then modify the KubeVirt node pull template, to clone from the new PVC, instead of pull from each node. Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

KV cache: support external infra

950a40f

Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

nunnatsa force-pushed the cache-kv-image branch 2 times, most recently from 2ce12df to 3a2d4b0 Compare May 18, 2023 10:41

nunnatsa added 2 commits May 18, 2023 17:21

add e2e test for KubeVirt root image caching

f7c364e

Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

cancel AWS apecific check for non-AWS platforms

23aa969

Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

nunnatsa force-pushed the cache-kv-image branch from 3a2d4b0 to 23aa969 Compare May 18, 2023 14:22

davidvossel reviewed May 18, 2023

View reviewed changes

openshift-ci bot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 18, 2023

openshift-merge-robot merged commit 105c19f into openshift:main May 18, 2023

nunnatsa deleted the cache-kv-image branch May 19, 2023 04:32

Cache KubeVirt Boot Image #1918

Cache KubeVirt Boot Image #1918

Uh oh!

Conversation

nunnatsa commented Dec 5, 2022

Uh oh!

openshift-ci bot commented Dec 5, 2022

Uh oh!

davidvossel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nunnatsa commented Dec 11, 2022

Uh oh!

openshift-ci bot commented Dec 11, 2022

Uh oh!

netlify bot commented Dec 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for hypershift-docs ready!

Uh oh!

nunnatsa commented Dec 13, 2022

Uh oh!

nunnatsa commented Dec 13, 2022

Uh oh!

nunnatsa commented Dec 19, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nunnatsa commented Jan 9, 2023

Uh oh!

nunnatsa commented Jan 9, 2023

Uh oh!

nunnatsa commented Jan 9, 2023

Uh oh!

nunnatsa commented May 15, 2023

Uh oh!

davidvossel commented May 17, 2023

Uh oh!

davidvossel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nunnatsa commented May 17, 2023

Uh oh!

davidvossel left a comment

Choose a reason for hiding this comment

Uh oh!

nunnatsa commented May 17, 2023

Uh oh!

davidvossel commented May 17, 2023

Uh oh!

nunnatsa commented May 18, 2023

Uh oh!

nunnatsa commented May 18, 2023

Uh oh!

davidvossel left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented May 18, 2023

Uh oh!

netlify bot commented Dec 13, 2022 •

edited

Loading