Skip to content

Message #waitForExit doesn’t always return when the process has terminated #73

@Rinzwind

Description

@Rinzwind

We’re having a problem on our Jenkins server: test runs often hang. It seems to be related to OSSubprocess. I tried to capture it in a small example:

https://github.com/Rinzwind/OSSubprocessTest

The example:

Below is a sample from the output we get on our Jenkins server. It shows that the test hangs:

  • The line Did #run (m: 2542, pid: 6480) shows that a ‘curl’ process was spawned with PID 6480
  • The lines 6480 Z 00:00:00 \_ curl <defunct> from the ‘ps’ command show that that process has terminated but that its status has not been consumed yet (it’s a zombie process)
  • There is no corresponding line Did #waitForExit (m: 2542) showing a return from #waitForExit
  • The only remaining output is from the ‘ps’ command

Note there’s also no more line Will #checkFinishedChildren after the last Did #run. As far as I understand, the child watcher in OSSVMProcess should be sending #checkFinishedChildren at least once a second:

childWatcher := [[
"OSProcess authors suspected that there were various ways in which OS signals
could be missed under conditions of heavy load. For that reason, we use
#waitTimeoutMSecs: with the semaphore, so that if a signal is missed,
we time out and rescan occasionally anyway (#updateActiveChildrenAndNotifyDead
sends queryExitStatus which sends waitpid() )
"
self sigChldSemaphore waitTimeoutMSecs: 1000.
processSynchronizationDelay wait. "Avoids lost signals in heavy process switching"
self checkFinishedChildren.
] repeat] newProcess priority: Processor highIOPriority.

So far I’ve only been able to reproduce this on our Jenkins server though, making it a little hard to debug.

Sample output:

[…]
+ /home/packer/smalltalkCI-master/bin/smalltalkci --smalltalk Pharo64-9.0 .smalltalk.ston
[…]
Downloading the latest 90 Image:
    http://files.pharo.org/get-files/90/pharoImage-x86_64.zip
[…]
Downloading the latest pharoVM:
	http://files.pharo.org/get-files/90/pharo-vm-Linux-x86_64-stable.zip
[…]
Build information: Pharo-9.0.0+build.1575.sha.9bb5f998e8a6d016ec7abde3ed09c4a60c0b4551 (64 Bit)
[…]
Running suite "Pharo64-9.0" with 1 tests.
[…]
Did #run (m: 2541, pid: 6478)
HTTP/1.1 200 OK
Content-Length: 5
Content-Type: text/plain
Server: Zinc HTTP Components 1.0 (Pharo/9.0)
Date: Wed, 03 Aug 2022 19:16:13 GMT

Test
Will #checkFinishedChildren
Did #checkFinishedChildren
Did #waitForExit (m: 2541)
Did #run (m: 2542, pid: 6480)
HTTP/1.1 200 OK
Content-Length: 5
Content-Type: text/plain
Server: Zinc HTTP Components 1.0 (Pharo/9.0)
Date: Wed, 03 Aug 2022 19:16:13 GMT

Test
Did sleep (n: 26)
    PID STAT     TIME COMMAND
   1230 Sl   00:00:10 pharo
   1245 S    00:00:00  \_ bash
   6480 Z    00:00:00  \_ curl <defunct>
Did sleep (n: 27)
    PID STAT     TIME COMMAND
   1230 Sl   00:00:10 pharo
   1245 S    00:00:00  \_ bash
   6480 Z    00:00:00  \_ curl <defunct>
[…]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions