-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Description
We’re having a problem on our Jenkins server: test runs often hang. It seems to be related to OSSubprocess. I tried to capture it in a small example:
https://github.com/Rinzwind/OSSubprocessTest
The example:
- Sets up a Seaside request handler at ‘/test’ (see:
TestRequestHandler>>#handleFiltered:) - Patches OSSVMProcess to log sends of
#checkFinishedChildren(see:BaselineOfTest>>#postLoad) - Runs a test that uses ‘curl’ to request ‘/test’, repeated 10000 times while another ‘bash’ process executes ‘ps’ every 5 seconds (see:
TestTestCase>>#test)
Below is a sample from the output we get on our Jenkins server. It shows that the test hangs:
- The line
Did #run (m: 2542, pid: 6480)shows that a ‘curl’ process was spawned with PID 6480 - The lines
6480 Z 00:00:00 \_ curl <defunct>from the ‘ps’ command show that that process has terminated but that its status has not been consumed yet (it’s a zombie process) - There is no corresponding line
Did #waitForExit (m: 2542)showing a return from#waitForExit - The only remaining output is from the ‘ps’ command
Note there’s also no more line Will #checkFinishedChildren after the last Did #run. As far as I understand, the child watcher in OSSVMProcess should be sending #checkFinishedChildren at least once a second:
OSSubprocess/repository/OSSubprocess/OSSVMProcess.class.st
Lines 234 to 244 in 67ded10
| childWatcher := [[ | |
| "OSProcess authors suspected that there were various ways in which OS signals | |
| could be missed under conditions of heavy load. For that reason, we use | |
| #waitTimeoutMSecs: with the semaphore, so that if a signal is missed, | |
| we time out and rescan occasionally anyway (#updateActiveChildrenAndNotifyDead | |
| sends queryExitStatus which sends waitpid() ) | |
| " | |
| self sigChldSemaphore waitTimeoutMSecs: 1000. | |
| processSynchronizationDelay wait. "Avoids lost signals in heavy process switching" | |
| self checkFinishedChildren. | |
| ] repeat] newProcess priority: Processor highIOPriority. |
So far I’ve only been able to reproduce this on our Jenkins server though, making it a little hard to debug.
Sample output:
[…]
+ /home/packer/smalltalkCI-master/bin/smalltalkci --smalltalk Pharo64-9.0 .smalltalk.ston
[…]
Downloading the latest 90 Image:
http://files.pharo.org/get-files/90/pharoImage-x86_64.zip
[…]
Downloading the latest pharoVM:
http://files.pharo.org/get-files/90/pharo-vm-Linux-x86_64-stable.zip
[…]
Build information: Pharo-9.0.0+build.1575.sha.9bb5f998e8a6d016ec7abde3ed09c4a60c0b4551 (64 Bit)
[…]
Running suite "Pharo64-9.0" with 1 tests.
[…]
Did #run (m: 2541, pid: 6478)
HTTP/1.1 200 OK
Content-Length: 5
Content-Type: text/plain
Server: Zinc HTTP Components 1.0 (Pharo/9.0)
Date: Wed, 03 Aug 2022 19:16:13 GMT
Test
Will #checkFinishedChildren
Did #checkFinishedChildren
Did #waitForExit (m: 2541)
Did #run (m: 2542, pid: 6480)
HTTP/1.1 200 OK
Content-Length: 5
Content-Type: text/plain
Server: Zinc HTTP Components 1.0 (Pharo/9.0)
Date: Wed, 03 Aug 2022 19:16:13 GMT
Test
Did sleep (n: 26)
PID STAT TIME COMMAND
1230 Sl 00:00:10 pharo
1245 S 00:00:00 \_ bash
6480 Z 00:00:00 \_ curl <defunct>
Did sleep (n: 27)
PID STAT TIME COMMAND
1230 Sl 00:00:10 pharo
1245 S 00:00:00 \_ bash
6480 Z 00:00:00 \_ curl <defunct>
[…]
Metadata
Metadata
Assignees
Labels
No labels