Try to join the cgroup of the init process of the parent container when apply_cgroup for a tenant container fails due to a "Device or resource busy" error#3347
Conversation
|
Thanks for opening this PR — I ran into the same issue while working on This problem is not about the init process, but about exec processes under cgroup v2 when domain controllers are enabled. Once a controller is turned on, the container’s configured cgroup may no longer be joinable (kernel returns EBUSY / EPERM), and exec is expected to fall back to joining the init process’s cgroup. This behavior is explicitly documented by runc: Note for cgroup v2: in case the process can’t join the top level cgroup, runc exec fallback is to try joining the cgroup of container’s init. Importantly, this fallback is exec-only:
Because this is policy, not cgroup mechanism, runc implements it in the container execution path, not inside the cgroup manager itself. This avoids:
For youki, the correct place to implement, I think, is here: // crates/libcontainer/src/process/container_intermediate_process.rs
fn apply_cgroups<
C: CgroupManager<Error = E> + ?Sized,
E: std::error::Error + Send + Sync + 'static,
>(
cmanager: &C,
resources: Option<&LinuxResources>,
init: bool,
) -> Result<()> { ... }where we know:
Handling this inside libcgroups (or only for systemd) is insufficient and environment-dependent. The expected behavior should be:
Without implementing this retry at the libcontainer level (as runc does), exec under cgroup v2 with domain controllers enabled will continue to fail for cgroupfs users. WDYT? Thanks again. |
| /// The init process PID of the parent container if the container is created as a tenant. | ||
| parent_init_pid: Option<Pid>, |
There was a problem hiding this comment.
ContainerType should have parent_init_pid.
| Err(e) => { | ||
| // If adding the process to the cgroup fails due to a "Device or resource busy" error, | ||
| // manager tries to join the cgroup of the init process of the tenant container. | ||
| if e.to_string().contains("Device or resource busy") |
There was a problem hiding this comment.
How about getting the error(EBUSY) from the debug client instead of parsing the error message?
There was a problem hiding this comment.
I really wanted to, but I couldn't achieve that just by putting the following code here.
impl From<nix::Error> for SystemdClientError {
fn from(err: nix::Error) -> SystemdClientError {
match err {
nix::Error::EBUSY => DbusError::DeviceOrResourceBusy(err.to_string()).into(),
_ => DbusError::ConnectionError(err.to_string()).into(),
}
}
}
Seems like socket::sendmsg in dbus_native::DbusConnection::send_message() doesn't emit nix::error::EBUSY. Rather, it puts out an error message with no error in Result.
Could you give me some advice on what I should do here?
There was a problem hiding this comment.
To clarify, the goal of this PR’s initial implementation is to serve as a conceptual demo, providing a starting point for discussing a more suitable implementation.
Since you mentioned it, I'll stop focusing on the detailed code of this PR for now. I'll review the detailed code once we've clarified and implemented the non-demo aspects.
crates/libcgroups/src/common.rs
Outdated
| // is empty string ("") and the value is the cgroup path the <pid> is in. | ||
| // | ||
| // ref: https://github.com/opencontainers/cgroups/blob/main/utils.go#L171-L219 | ||
| pub fn parse_proc_cgroup_file(path: &str) -> Result<HashMap<String, String>, ParseProcCgroupError> { |
There was a problem hiding this comment.
Could we use the procfs crate? Be careful: if it reads inside the container, please use ProcfsHandle for safety.
|
@utam0k @tommady To clarify, the goal of this PR’s initial implementation is to serve as a conceptual demo, providing a starting point for discussing a more suitable implementation. I’m still not very familiar with youki, or even Rust itself. Also, this may be out of context, but I want to clarify the wording here. I made a table of the wording I imagine.
What confused me here is that the word
WDYT about this? Should I use the name |
|
@tommady
I strongly agree with that. I'll re-implement the logic there. |
Signed-off-by: Takuto Nagami <logica0419@gmail.com>
…en add_process_to_unit fails Signed-off-by: Takuto Nagami <logica0419@gmail.com>
2b24f7f to
e1e62b0
Compare
|
Just to clarify, since I've forgotten to put sign-offs on the previous commits and I've pushed a complete re-implementation now, I force-pushed the branch. |
|
Please set it to "ready for review" when you are ready to review the detailed codes after the discussion. |
Thanks for the table — that actually helped me realize part of the confusion is on my side too 😅 I think I’ve been a bit sloppy with naming. Referring to your table, when I said “init process” I meant Container B’s init process (the TenantContainer being exec’d into), not Container A’s init. In your terms, this is the exec case for Container B: if joining the configured cgroup fails under cgroup v2, exec should fall back to B’s init process cgroup, not the parent’s. Sorry about the naming confusion 🤪 that’s on me. I’d really appreciate hearing others’ opinions on whether using runc-style naming. |
|
This isn't a separate “Container B”; it's an exec/tenant process joining the existing container's cgroup. So calling it “parent” is confusing. How about |


Description
Type of Change
Testing
Steps to Reproduceof [Bug]: Docker(moby) + youki cannot launch Dev Container with DinD #3342 and got the expected resultRelated Issues
Fixes #3342
Additional Context