Skip to content

Conversation

@martinpitt
Copy link
Member

@martinpitt martinpitt commented Jan 20, 2026

We often get failed test runs because image-customize or
image-create etc. run into QEMU VM boot failures in our nested KVM
environment.

Detect this situation, and retry up to 3 times then.


recent example from cockpit-files. I saw this a lot in cockpit as well.

After this, I'll also port a few more .start() + .wait_boot() patterns in testlib.py to the new .boot().

  • image-refresh cirros

Don't repeat the "120" number so much. This disambiguates it from the
boot timeout, and makes the timeouts easier to change.
Similar to the previous commit. Avoids magic numbers in the API
declaration, and the next commits will reuse it.
@martinpitt martinpitt requested review from jelly and mvollmer January 20, 2026 07:54
jelly
jelly previously approved these changes Jan 20, 2026
Copy link
Member

@jelly jelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the retries however annoying they are but one note, this means it will take a bit longer to figure out legit boot failures on image refreshes. (Especially if flaky)

As image refreshes happen during the night, we'll probably not run into this.

@martinpitt
Copy link
Member Author

this means it will take a bit longer to figure out legit boot failures on image refreshes. (Especially if flaky)

Yeah, I thought about this as well. This is a "damned if you do, damned if you don't" situation. But wait, how about we dont' retry this in image-refresh's boot_system()? I.e. we'd rather accept the occasional "nested kvm" boot failure than a flake? I think given how rarely that boot happens, that'd be acceptable?

The common pattern with VirtMachine usage is to call `.start()` followed
by `.wait_boot()`. This also happens in Cockpit's test machinery and
some of its tests.

Abstract that into a `boot()` method. This is simpler to use, and we can
tweak its behaviour in the next step.
We often get failed test runs because `image-customize` or
`image-create` etc. run into QEMU VM boot failures in our nested KVM
environment.

Detect this situation, and retry up to 3 times then.

However, we want to avoid image refreshes which introduce actual boot
trouble in the guest, not because of nested KVM. Thus disable auto-retry
in image-create's `boot_system()` canary check.
@jelly
Copy link
Member

jelly commented Jan 20, 2026

this means it will take a bit longer to figure out legit boot failures on image refreshes. (Especially if flaky)

Yeah, I thought about this as well. This is a "damned if you do, damned if you don't" situation. But wait, how about we dont' retry this in image-refresh's boot_system()? I.e. we'd rather accept the occasional "nested kvm" boot failure than a flake? I think given how rarely that boot happens, that'd be acceptable?

Yes, agreed. Maybe I should have stated that more clearly 👍

@martinpitt
Copy link
Member Author

@jelly Done. PTAL?

@martinpitt martinpitt requested a review from jelly January 20, 2026 10:12
@martinpitt martinpitt added the bot label Jan 20, 2026
@cockpituous cockpituous changed the title machine_core: Retry VM boot in nested KVM environments WIP: a406b0999dca: [no-test] machine_core: Retry VM boot in nested KVM environments Jan 20, 2026
@cockpituous
Copy link
Contributor

@cockpituous
Copy link
Contributor

@cockpituous cockpituous changed the title WIP: a406b0999dca: [no-test] machine_core: Retry VM boot in nested KVM environments machine_core: Retry VM boot in nested KVM environments Jan 20, 2026
@martinpitt martinpitt merged commit 2c695f9 into main Jan 20, 2026
12 checks passed
@martinpitt martinpitt deleted the failed-vm-boots branch January 20, 2026 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants