Skip to content

Try to join the cgroup of the init process of the parent container when apply_cgroup for a tenant container fails due to a "Device or resource busy" error#3347

Draft
logica0419 wants to merge 3 commits intoyouki-dev:mainfrom
logica0419:retyry-systemd-cgroup-EBUSY

Conversation

@logica0419
Copy link
Contributor

Description

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test updates
  • CI/CD related changes
  • Other (please describe):

Testing

Related Issues

Fixes #3342

Additional Context

@tommady
Copy link
Collaborator

tommady commented Jan 3, 2026

Thanks for opening this PR — I ran into the same issue while working on
#3210 and also needed a retry to re-join the cgroup for exec.

This problem is not about the init process, but about exec processes under cgroup v2 when domain controllers are enabled. Once a controller is turned on, the container’s configured cgroup may no longer be joinable (kernel returns EBUSY / EPERM), and exec is expected to fall back to joining the init process’s cgroup.

This behavior is explicitly documented by runc:

Note for cgroup v2: in case the process can’t join the top level cgroup, runc exec fallback is to try joining the cgroup of container’s init.
https://github.com/opencontainers/runc/blob/main/man/runc-exec.8.md

Importantly, this fallback is exec-only:

  • init process cgroup placement must still fail hard
  • only exec processes may retry using the init process’s leaf cgroup

Because this is policy, not cgroup mechanism, runc implements it in the container execution path, not inside the cgroup manager itself. This avoids:

  • accidentally applying fallback to init
  • duplicating logic across systemd vs cgroupfs managers
  • diverging behavior depending on the cgroup backend

For youki, the correct place to implement, I think, is here:

// crates/libcontainer/src/process/container_intermediate_process.rs
fn apply_cgroups<
    C: CgroupManager<Error = E> + ?Sized,
    E: std::error::Error + Send + Sync + 'static,
>(
    cmanager: &C,
    resources: Option<&LinuxResources>,
    init: bool,
) -> Result<()> { ... }

where we know:

  • whether the process is init or exec
  • the init PID
  • and can enforce exec-only fallback semantics

Handling this inside libcgroups (or only for systemd) is insufficient and environment-dependent. The expected behavior should be:

  • init process: no fallback, fail on cgroup join error
  • exec process + cgroup v2 + EBUSY/EPERM: retry by joining init’s cgroup
  • all other errors: fail as before

Without implementing this retry at the libcontainer level (as runc does), exec under cgroup v2 with domain controllers enabled will continue to fail for cgroupfs users.

WDYT? Thanks again.

@utam0k utam0k requested a review from tommady January 3, 2026 21:43
Comment on lines 56 to 57
/// The init process PID of the parent container if the container is created as a tenant.
parent_init_pid: Option<Pid>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContainerType should have parent_init_pid.

Err(e) => {
// If adding the process to the cgroup fails due to a "Device or resource busy" error,
// manager tries to join the cgroup of the init process of the tenant container.
if e.to_string().contains("Device or resource busy")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about getting the error(EBUSY) from the debug client instead of parsing the error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wanted to, but I couldn't achieve that just by putting the following code here.

impl From<nix::Error> for SystemdClientError {
    fn from(err: nix::Error) -> SystemdClientError {
        match err {
            nix::Error::EBUSY => DbusError::DeviceOrResourceBusy(err.to_string()).into(),
            _ => DbusError::ConnectionError(err.to_string()).into(),
        }
    }
}

Seems like socket::sendmsg in dbus_native::DbusConnection::send_message() doesn't emit nix::error::EBUSY. Rather, it puts out an error message with no error in Result.

Could you give me some advice on what I should do here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, the goal of this PR’s initial implementation is to serve as a conceptual demo, providing a starting point for discussing a more suitable implementation.

Since you mentioned it, I'll stop focusing on the detailed code of this PR for now. I'll review the detailed code once we've clarified and implemented the non-demo aspects.

// is empty string ("") and the value is the cgroup path the <pid> is in.
//
// ref: https://github.com/opencontainers/cgroups/blob/main/utils.go#L171-L219
pub fn parse_proc_cgroup_file(path: &str) -> Result<HashMap<String, String>, ParseProcCgroupError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the procfs crate? Be careful: if it reads inside the container, please use ProcfsHandle for safety.

@logica0419 logica0419 changed the title Try to join the cgroup of the init process of the tenant container when add_process_to_unit fails due to a "Device or resource busy" error Try to join the cgroup of the init process of the parent container when apply_cgroup for tenant container fails due to a "Device or resource busy" error Jan 4, 2026
@logica0419 logica0419 changed the title Try to join the cgroup of the init process of the parent container when apply_cgroup for tenant container fails due to a "Device or resource busy" error Try to join the cgroup of the init process of the parent container when apply_cgroup for a tenant container fails due to a "Device or resource busy" error Jan 4, 2026
@logica0419
Copy link
Contributor Author

logica0419 commented Jan 6, 2026

@utam0k @tommady
Thanks for the quick feedback! I didn’t expect comments to come in so fast 😅
I was planning to write the explanation today (I was pretty exhausted last night), so this was a nice surprise.

To clarify, the goal of this PR’s initial implementation is to serve as a conceptual demo, providing a starting point for discussing a more suitable implementation.

I’m still not very familiar with youki, or even Rust itself.
Please feel free to point out any issues, including basic ones or anything related to “Rust-ish” coding style.


Also, this may be out of context, but I want to clarify the wording here. I made a table of the wording I imagine.

Image
Perspective Container A Container B A's init process B's init process
Container A self (InitContainer) child init_process child_init_process
Container B parent self (TenantContainer) parent_init_process init_process
tommady's comment - - init process exec process
runc initProcess (containerProcess) setnsProcess (containerProcess) linuxStandardInit linuxSetnsInit

What confused me here is that the word init process used in Container B's context can mean Container A's init process or B's init process. That's why I used the name parent_init_process for Container A's init process in the implementation.

FYI: in runc, Container A's init process is called initProcessPid even in the context of Container B.
https://github.com/opencontainers/runc/blob/main/libcontainer/process_linux.go#L175

WDYT about this? Should I use the name init process as runc does?

@logica0419
Copy link
Contributor Author

@tommady
Thank you too for finding this PR! I'm happy that I can help you solve the issue.
And, thanks again for the precise explanation of what's happening. I managed to get an abstract understanding, but your explanation helped me strengthen it so much.

For youki, the correct place to implement, I think, is here:

I strongly agree with that. I'll re-implement the logic there.
Thank you so much for the advice.

Signed-off-by: Takuto Nagami <logica0419@gmail.com>
…en add_process_to_unit fails

Signed-off-by: Takuto Nagami <logica0419@gmail.com>
@logica0419 logica0419 force-pushed the retyry-systemd-cgroup-EBUSY branch from 2b24f7f to e1e62b0 Compare January 6, 2026 08:44
@logica0419
Copy link
Contributor Author

Just to clarify, since I've forgotten to put sign-offs on the previous commits and I've pushed a complete re-implementation now, I force-pushed the branch.

Signed-off-by: Takuto Nagami <logica0419@gmail.com>
@utam0k utam0k marked this pull request as draft January 6, 2026 10:34
@utam0k
Copy link
Member

utam0k commented Jan 6, 2026

Please set it to "ready for review" when you are ready to review the detailed codes after the discussion.

@tommady
Copy link
Collaborator

tommady commented Jan 6, 2026

@utam0k @tommady Thanks for the quick feedback! I didn’t expect comments to come in so fast 😅 I was planning to write the explanation today (I was pretty exhausted last night), so this was a nice surprise.

To clarify, the goal of this PR’s initial implementation is to serve as a conceptual demo, providing a starting point for discussing a more suitable implementation.

I’m still not very familiar with youki, or even Rust itself. Please feel free to point out any issues, including basic ones or anything related to “Rust-ish” coding style.

Also, this may be out of context, but I want to clarify the wording here. I made a table of the wording I imagine.

Image Perspective Container A Container B A's init process B's init process Container A self (InitContainer) child init_process child_init_process Container B parent self (TenantContainer) parent_init_process init_process tommady's comment - - init process exec process runc initProcess (containerProcess) setnsProcess (containerProcess) linuxStandardInit linuxSetnsInit What confused me here is that the word `init process` used in Container B's context **can** mean Container A's init process or B's init process. That's why I used the name `parent_init_process` for Container A's init process in the implementation.

FYI: in runc, Container A's init process is called initProcessPid even in the context of Container B.
https://github.com/opencontainers/runc/blob/main/libcontainer/process_linux.go#L175

WDYT about this? Should I use the name init process as runc does?

Thanks for the table — that actually helped me realize part of the confusion is on my side too 😅 I think I’ve been a bit sloppy with naming.

Referring to your table, when I said “init process” I meant Container B’s init process (the TenantContainer being exec’d into), not Container A’s init. In your terms, this is the exec case for Container B: if joining the configured cgroup fails under cgroup v2, exec should fall back to B’s init process cgroup, not the parent’s.

Sorry about the naming confusion 🤪 that’s on me. I’d really appreciate hearing others’ opinions on whether using runc-style naming.

@utam0k
Copy link
Member

utam0k commented Jan 6, 2026

This isn't a separate “Container B”; it's an exec/tenant process joining the existing container's cgroup. So calling it “parent” is confusing. How about landlord_init_pid (landlord = parent init in your context)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Docker(moby) + youki cannot launch Dev Container with DinD

3 participants

Comments