-
Notifications
You must be signed in to change notification settings - Fork 37
machine_core: Retry VM boot in nested KVM environments #8642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Don't repeat the "120" number so much. This disambiguates it from the boot timeout, and makes the timeouts easier to change.
Similar to the previous commit. Avoids magic numbers in the API declaration, and the next commits will reuse it.
jelly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with the retries however annoying they are but one note, this means it will take a bit longer to figure out legit boot failures on image refreshes. (Especially if flaky)
As image refreshes happen during the night, we'll probably not run into this.
Yeah, I thought about this as well. This is a "damned if you do, damned if you don't" situation. But wait, how about we dont' retry this in |
Since 2023, rawhide does not have a slow debug kernel any more: [1] https://gitlab.com/cki-project/kernel-ark/-/merge_requests/2263
The common pattern with VirtMachine usage is to call `.start()` followed by `.wait_boot()`. This also happens in Cockpit's test machinery and some of its tests. Abstract that into a `boot()` method. This is simpler to use, and we can tweak its behaviour in the next step.
We often get failed test runs because `image-customize` or `image-create` etc. run into QEMU VM boot failures in our nested KVM environment. Detect this situation, and retry up to 3 times then. However, we want to avoid image refreshes which introduce actual boot trouble in the guest, not because of nested KVM. Thus disable auto-retry in image-create's `boot_system()` canary check.
Yes, agreed. Maybe I should have stated that more clearly 👍 |
c883a5c to
429d11f
Compare
|
@jelly Done. PTAL? |
|
image-refresh cirros done: https://github.com/cockpit-project/bots/commits/image-refresh-cirros-20260120-102749 |
We often get failed test runs because
image-customizeorimage-createetc. run into QEMU VM boot failures in our nested KVMenvironment.
Detect this situation, and retry up to 3 times then.
recent example from cockpit-files. I saw this a lot in cockpit as well.
After this, I'll also port a few more
.start()+.wait_boot()patterns in testlib.py to the new.boot().