Skip to content

Make ephemeral more production-like #22

@cgwalters

Description

@cgwalters

Making Ephemeral VMs Production-Like

The bcvk ephemeral command runs bootc container images as lightweight VMs using QEMU with virtiofs for the root filesystem. This is incredibly useful for testing and development, but the current implementation differs from how bootc actually runs in production in ways that can mask real issues or cause false failures.

The goal is to make ephemeral VMs behave as closely as possible to a real bootc deployment. When someone tests their container image with bcvk ephemeral run, they should have confidence that what works there will work when installed to disk. This means matching bootc's filesystem layout semantics: an immutable root, a transient /etc overlay, and a properly initialized /var.

Why This Matters

The original implementation used systemd.volatile=overlay which overlays the entire root filesystem with a single tmpfs-backed overlayfs. While simple, this creates several problems:

First, it breaks container runtimes. Podman uses overlayfs for container storage in /var/lib/containers, but you can't nest overlayfs on top of overlayfs. Anyone trying to run containers inside their ephemeral VM would hit cryptic mount errors.

Second, it doesn't match production semantics. In a real bootc system, / is immutable (ideally composefs), /etc is an overlay that allows transient writes, and /var is a real filesystem with persistent state. The volatile overlay approach makes everything writable everywhere, which can hide bugs where software incorrectly tries to write to immutable locations.

Third, SELinux is currently disabled entirely via selinux=0 because the virtiofs root gets labeled as virtiofs_t, causing widespread policy violations. This means SELinux-related issues won't be caught during ephemeral testing.

What's Implemented (Phase 1)

The first phase replaces systemd.volatile=overlay with fine-grained mount management:

  • / (root) is mounted read-only via virtiofs directly from the container image
  • /etc uses overlayfs with a tmpfs upper directory, providing transient writes like bootc's etc.transient mode
  • /var is a real tmpfs with the container's /var content copied into it at boot, not an overlayfs

This structure allows podman to work inside the VM since /var/lib/containers is now on a real tmpfs rather than nested overlayfs. The implementation injects systemd units into the initramfs that set up these mounts before switch-root.

Supporting older systemd versions (like CentOS Stream 9 with systemd 252) required some care. Modern systemd 256+ can create units from SMBIOS credentials, but older versions need the units embedded directly in the initramfs. The implementation handles both cases, with a service that copies the journal-stream unit to /run/systemd/system/ on older systems.

One subtle issue discovered during implementation: when appending CPIO archives to an initramfs, proper padding is required between archives. Some kernel decompressors, particularly LZ4, need at least 4 bytes of NUL padding to detect EOF and process the next archive. Without this padding, appended units were silently ignored.

Phase 2: Composefs and SELinux

The next phase will make the root filesystem truly immutable using composefs, and re-enable SELinux with proper labeling.

Composefs provides a content-addressed, immutable filesystem layer that's exactly what bootc uses in production. Rather than mounting virtiofs directly as root, the implementation would generate a composefs image from the container's root filesystem, with virtiofs providing the backing object store.

The key advantage of composefs for this use case is SELinux labeling. Each file in the composefs can have its correct SELinux context baked in, rather than everything inheriting virtiofs_t from the virtiofs mount. This should allow dropping selinux=0 and catching SELinux policy issues during ephemeral testing.

There's an open question about whether fsverity should be required. Composefs can work without fsverity, just losing the integrity verification. For ephemeral testing this tradeoff seems acceptable since the content is coming from a trusted container image anyway.

The composefs logic could eventually live in bootc itself, but implementing it in bcvk first allows for faster iteration. The composefs-rs crate provides the necessary Rust bindings.

Technical Notes

The initramfs injection mechanism from phase 1 provides a foundation for phase 2. Additional systemd units can be added to handle composefs setup. The CPIO generation code already handles directories, files, and symlinks with proper padding.

Mount point verification uses findmnt -J with JSON parsing rather than string matching on command output, making it more robust against format changes.

The implementation maintains compatibility with both traditional kernel+initramfs images (like CentOS Stream 9) and UKI-based images (like CentOS Stream 10), extracting and modifying the initramfs appropriately for each

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions