-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Description
I found this behavior by accidentally blocking the write syscall with a seccomp profile. This is a uncommon execution path, but the issue is reproducible and the root cause is identified, so I am reporting it here.
I found that the parent runc process hangs indefinitely in ReadPacket if the child process is abruptly killed (e.g., by a seccomp SIGSYS) immediately after the parent sends the procRun packet. This creates a self-deadlock because the parent process inadvertently retains a write reference to the synchronization socket. Since the kernel will not deliver an io.EOF as long as any write reference remains, the parent blocks forever, unaware that the child has exited. This results in lingering zombie processes.
Steps to reproduce the issue
- Add this seccomp configureation to the
config.json
"seccomp": {
"defaultAction": "SCMP_ACT_ALLOW",
"architectures": [
"SCMP_ARCH_X86_64"
],
"syscalls": [
{
"names": ["write"],
"action": "SCMP_ACT_KILL"
}
]
}
SCMP_ACT_KILL when syscall write or close or gendents64
- Run a container:
sudo runc run test - The parent runc process will hang
topresult
In the process tree below, we can observe that 84927 is in Sleep and 84944 is in Zombie
- debug log
╰─$ sudo ~/runc-zombie-reap/runc/runc --debug run test
DEBU[0000]libcontainer/exeseal/cloned_binary_linux.go:207 libcontainer/exeseal.IsCloned() F_GET_SEALS on /proc/self/exe failed: invalid argument
DEBU[0000]libcontainer/exeseal/cloned_binary_linux.go:232 libcontainer/exeseal.CloneSelfExe() runc exeseal: using overlayfs for sealed /proc/self/exe
DEBU[0000]libcontainer/container_linux.go:527 libcontainer.(*Container).newParentProcess() runc exeseal: using /proc/self/exe clone
DEBU[0000] nsexec[84938]: => nsexec container setup
DEBU[0000] nsexec[84938]: affinity: 0xfff
DEBU[0000] nsexec-0[84938]: ~> nsexec stage-0
DEBU[0000] nsexec-0[84938]: spawn stage-1
DEBU[0000] nsexec-0[84938]: -> stage-1 synchronisation loop
DEBU[0000] nsexec-1[84943]: ~> nsexec stage-1
DEBU[0000] nsexec-1[84943]: unshare remaining namespaces
DEBU[0000] nsexec-1[84943]: spawn stage-2
DEBU[0000] nsexec-1[84943]: request stage-0 to forward stage-2 pid (84944)
DEBU[0000] nsexec-0[84938]: stage-1 requested pid to be forwarded
DEBU[0000] nsexec-0[84938]: forward stage-1 (84943) and stage-2 (84944) pids to runc
DEBU[0000] nsexec-1[84943]: signal completion to stage-0
DEBU[0000] nsexec-1[84943]: <~ nsexec stage-1
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2
DEBU[0000] nsexec-0[84938]: stage-1 complete
DEBU[0000] nsexec-0[84938]: <- stage-1 synchronisation loop
DEBU[0000] nsexec-0[84938]: -> stage-2 synchronisation loop
DEBU[0000] nsexec-0[84938]: signalling stage-2 to run
DEBU[0000] nsexec-2[1]: signal completion to stage-0
DEBU[0000] nsexec-2[1]: <= nsexec container setup
DEBU[0000] nsexec-2[1]: booting up go runtime ...
DEBU[0000] nsexec-0[84938]: stage-2 complete
DEBU[0000] nsexec-0[84938]: <- stage-2 synchronisation loop
DEBU[0000] nsexec-0[84938]: <~ nsexec stage-0
DEBU[0000]libcontainer/process_linux.go:660 libcontainer.(*initProcess).goCreateMountSources.func2() mount source thread: successfully running in container mntns
DEBU[0000] reading sync
DEBU[0000] child process in init()
DEBU[0000] writing sync type:procHooks
DEBU[0000] reading sync
DEBU[0000] read sync type:procHooks
DEBU[0000] writing sync type:procHooksDone
DEBU[0000] reading sync
DEBU[0000] read sync type:procHooksDone
DEBU[0000] read sync type:procReady
DEBU[0000] writing sync type:procReady
DEBU[0000] reading sync
DEBU[0000] writing sync type:procRun
DEBU[0000] reading sync
DEBU[0000] read sync type:procRun
DEBU[0000] HOME not set in process.env, and getting UID 0 homedir failed
DEBU[0000] seccomp: skipping -ENOSYS stub filter generation
DEBU[0000] seccomp: prepending -ENOSYS stub filter to user filter...
DEBU[0000] [....] --- original filter ---
DEBU[0000] seccomp filter flags: 4
DEBU[0060] mount source thread: closing thread: context deadline exceeded
- strace result : 84927 is blocked on recvfrom
╰─$ sudo strace -p 84927
strace: Process 84927 attached
recvfrom(13,
- Check the socket status and open fds
- The sync socket (fd 13) is held with u(read / write) mode
╰─$ sudo lsof -p 84927 | grep 13u
lsof: WARNING: can't stat() fuse.portal file system /run/user/1001/doc
Output information may be incomplete.
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1001/gvfs
Output information may be incomplete.
runc 84927 root 13u unix 0xffff95120cd0f800 0t0 444992 type=SEQPACKET (CONNECTED)
- and the socket ref count is 3
╰─$ grep 444992 /proc/net/unix
0000000000000000: 00000003 00000000 00000000 0005 03 444992
Describe the results you received and expected
-
Received Results: The parent runc process hangs indefinitely in a blocking Recvfrom call. It never reaches the waitpid stage, leaving the child as a zombie process and leaking host resources.
-
Expected Results: runc should detect the child's exit (via EOF or by checking the child's PID status) and exit gracefully with an error message instead of hanging.
-
Suggested Fix: Consider adding a poll-based fail-safe before recvfrom or ensuring the parent closes its write file descriptor before reading.
Below is a polling based workaround, but if there are no further write operations from the parent after procRun, it would be more ideal to explicitly close the parent's write file descriptor.
diff --git a/libcontainer/sync_unix.go b/libcontainer/sync_unix.go
index 2c6e4387..900a0a5e 100644
--- a/libcontainer/sync_unix.go
+++ b/libcontainer/sync_unix.go
@@ -43,6 +43,22 @@ func (s *syncSocket) WritePacket(b []byte) (int, error) {
}
func (s *syncSocket) ReadPacket() ([]byte, error) {
+ fds := []unix.PollFd{{Fd: int32(s.f.Fd()), Events: unix.POLLIN}}
+
+ for {
+ n, err := unix.Poll(fds, 100)
+ if err != nil {
+ if err == unix.EINTR { continue }
+ return nil, fmt.Errorf("poll sync socket: %w", err)
+ }
+
+ if n == 0 {
+ return nil, fmt.Errorf("sync pipe timeout: child process may have exited")
+ }
+
+ break
+ }
+
size, _, err := linux.Recvfrom(int(s.f.Fd()), nil, unix.MSG_TRUNC|unix.MSG_PEEK)
if err != nil {
return nil, fmt.Errorf("fetch packet length from socket: %w", err)
What version of runc are you using?
╰─$ ./runc --version
runc version 1.4.0-rc.1+dev
commit: v1.4.0-rc.1-236-g9b40f6af-dirty
spec: 1.3.0
go: go1.24.0
libseccomp: 2.5.5
Host OS information
Linux 6.8.12
Host kernel information
No response