Skip to content

Indefinite parent hang in ReadPacket upon abnormal child exit after procRun #5087

@ilp-sys

Description

@ilp-sys

Description

I found this behavior by accidentally blocking the write syscall with a seccomp profile. This is a uncommon execution path, but the issue is reproducible and the root cause is identified, so I am reporting it here.

I found that the parent runc process hangs indefinitely in ReadPacket if the child process is abruptly killed (e.g., by a seccomp SIGSYS) immediately after the parent sends the procRun packet. This creates a self-deadlock because the parent process inadvertently retains a write reference to the synchronization socket. Since the kernel will not deliver an io.EOF as long as any write reference remains, the parent blocks forever, unaware that the child has exited. This results in lingering zombie processes.

Steps to reproduce the issue

  1. Add this seccomp configureation to the config.json
              "seccomp": {
                        "defaultAction": "SCMP_ACT_ALLOW",
                        "architectures": [
                                "SCMP_ARCH_X86_64"
                        ],
                        "syscalls": [
                          {
                                "names": ["write"],

                                "action": "SCMP_ACT_KILL"
                          }
                        ]
                }

SCMP_ACT_KILL when syscall write or close or gendents64

  1. Run a container: sudo runc run test
  2. The parent runc process will hang
  • top result
    In the process tree below, we can observe that 84927 is in Sleep and 84944 is in Zombie
Image
  • debug log
╰─$ sudo ~/runc-zombie-reap/runc/runc --debug run test
DEBU[0000]libcontainer/exeseal/cloned_binary_linux.go:207 libcontainer/exeseal.IsCloned() F_GET_SEALS on /proc/self/exe failed: invalid argument
DEBU[0000]libcontainer/exeseal/cloned_binary_linux.go:232 libcontainer/exeseal.CloneSelfExe() runc exeseal: using overlayfs for sealed /proc/self/exe
DEBU[0000]libcontainer/container_linux.go:527 libcontainer.(*Container).newParentProcess() runc exeseal: using /proc/self/exe clone
DEBU[0000] nsexec[84938]: => nsexec container setup
DEBU[0000] nsexec[84938]: affinity: 0xfff
DEBU[0000] nsexec-0[84938]: ~> nsexec stage-0
DEBU[0000] nsexec-0[84938]: spawn stage-1
DEBU[0000] nsexec-0[84938]: -> stage-1 synchronisation loop
DEBU[0000] nsexec-1[84943]: ~> nsexec stage-1
DEBU[0000] nsexec-1[84943]: unshare remaining namespaces
DEBU[0000] nsexec-1[84943]: spawn stage-2
DEBU[0000] nsexec-1[84943]: request stage-0 to forward stage-2 pid (84944)
DEBU[0000] nsexec-0[84938]: stage-1 requested pid to be forwarded
DEBU[0000] nsexec-0[84938]: forward stage-1 (84943) and stage-2 (84944) pids to runc
DEBU[0000] nsexec-1[84943]: signal completion to stage-0
DEBU[0000] nsexec-1[84943]: <~ nsexec stage-1
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2
DEBU[0000] nsexec-0[84938]: stage-1 complete
DEBU[0000] nsexec-0[84938]: <- stage-1 synchronisation loop
DEBU[0000] nsexec-0[84938]: -> stage-2 synchronisation loop
DEBU[0000] nsexec-0[84938]: signalling stage-2 to run
DEBU[0000] nsexec-2[1]: signal completion to stage-0
DEBU[0000] nsexec-2[1]: <= nsexec container setup
DEBU[0000] nsexec-2[1]: booting up go runtime ...
DEBU[0000] nsexec-0[84938]: stage-2 complete
DEBU[0000] nsexec-0[84938]: <- stage-2 synchronisation loop
DEBU[0000] nsexec-0[84938]: <~ nsexec stage-0
DEBU[0000]libcontainer/process_linux.go:660 libcontainer.(*initProcess).goCreateMountSources.func2() mount source thread: successfully running in container mntns
DEBU[0000] reading sync
DEBU[0000] child process in init()
DEBU[0000] writing sync type:procHooks
DEBU[0000] reading sync
DEBU[0000] read sync type:procHooks
DEBU[0000] writing sync type:procHooksDone
DEBU[0000] reading sync
DEBU[0000] read sync type:procHooksDone
DEBU[0000] read sync type:procReady
DEBU[0000] writing sync type:procReady
DEBU[0000] reading sync
DEBU[0000] writing sync type:procRun
DEBU[0000] reading sync
DEBU[0000] read sync type:procRun
DEBU[0000] HOME not set in process.env, and getting UID 0 homedir failed
DEBU[0000] seccomp: skipping -ENOSYS stub filter generation
DEBU[0000] seccomp: prepending -ENOSYS stub filter to user filter...
DEBU[0000]   [....] --- original filter ---
DEBU[0000] seccomp filter flags: 4
DEBU[0060] mount source thread: closing thread: context deadline exceeded

  • strace result : 84927 is blocked on recvfrom
╰─$ sudo strace -p 84927                                                                                            
strace: Process 84927 attached
recvfrom(13,
  1. Check the socket status and open fds
  • The sync socket (fd 13) is held with u(read / write) mode
╰─$ sudo lsof -p 84927 | grep 13u
lsof: WARNING: can't stat() fuse.portal file system /run/user/1001/doc
      Output information may be incomplete.
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1001/gvfs
      Output information may be incomplete.
runc    84927 root   13u     unix 0xffff95120cd0f800      0t0   444992 type=SEQPACKET (CONNECTED)
  • and the socket ref count is 3
╰─$ grep 444992  /proc/net/unix
0000000000000000: 00000003 00000000 00000000 0005 03 444992

Describe the results you received and expected

  • Received Results: The parent runc process hangs indefinitely in a blocking Recvfrom call. It never reaches the waitpid stage, leaving the child as a zombie process and leaking host resources.

  • Expected Results: runc should detect the child's exit (via EOF or by checking the child's PID status) and exit gracefully with an error message instead of hanging.

  • Suggested Fix: Consider adding a poll-based fail-safe before recvfrom or ensuring the parent closes its write file descriptor before reading.

Below is a polling based workaround, but if there are no further write operations from the parent after procRun, it would be more ideal to explicitly close the parent's write file descriptor.

diff --git a/libcontainer/sync_unix.go b/libcontainer/sync_unix.go
index 2c6e4387..900a0a5e 100644
--- a/libcontainer/sync_unix.go
+++ b/libcontainer/sync_unix.go
@@ -43,6 +43,22 @@ func (s *syncSocket) WritePacket(b []byte) (int, error) {
 }

 func (s *syncSocket) ReadPacket() ([]byte, error) {
+       fds := []unix.PollFd{{Fd: int32(s.f.Fd()), Events: unix.POLLIN}}
+
+       for {
+               n, err := unix.Poll(fds, 100)
+               if err != nil {
+                       if err == unix.EINTR { continue }
+                       return nil, fmt.Errorf("poll sync socket: %w", err)
+               }
+
+               if n == 0 {
+                       return nil, fmt.Errorf("sync pipe timeout: child process may have exited")
+               }
+
+               break
+       }
+
        size, _, err := linux.Recvfrom(int(s.f.Fd()), nil, unix.MSG_TRUNC|unix.MSG_PEEK)
        if err != nil {
                return nil, fmt.Errorf("fetch packet length from socket: %w", err)

What version of runc are you using?

╰─$ ./runc --version                                                                                                     
runc version 1.4.0-rc.1+dev
commit: v1.4.0-rc.1-236-g9b40f6af-dirty
spec: 1.3.0
go: go1.24.0
libseccomp: 2.5.5

Host OS information

Linux 6.8.12

Host kernel information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions