Fix BlobStoreImpl non-daemon threads causing CI test hangs#6179
Closed
joewiz wants to merge 1 commit intoeXist-db:developfrom
Closed
Fix BlobStoreImpl non-daemon threads causing CI test hangs#6179joewiz wants to merge 1 commit intoeXist-db:developfrom
joewiz wants to merge 1 commit intoeXist-db:developfrom
Conversation
BlobStoreImpl's PersistentWriter and BlobVacuum threads were created as non-daemon threads, which prevents JVM exit if BrokerPool shutdown fails to join them cleanly. This caused ~20% of CI runs to hang indefinitely after tests completed. Mark both threads as daemon threads so they cannot block JVM shutdown. The threads still participate in normal shutdown via poison pill (PersistentWriter) and interrupt (BlobVacuum), but now the JVM can exit even if those shutdown paths fail. Also add a forkedProcessExitTimeoutInSeconds safety net to the surefire configuration, and a regression test that verifies no non-daemon eXist threads survive BrokerPool shutdown. Follow-up to PR eXist-db#6167 which fixed the same issue for StatusReporter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
|
@joewiz As the author of BlobStore, I just want to point out that these two threads have to be non-daemon threads. The problem needs to be fixed elsewhere in a different fashion otherwise with this PR you risk introducing database corruptions. |
joewiz
added a commit
to joewiz/exist
that referenced
this pull request
Mar 26, 2026
BlobStoreImpl.normalClose() and abnormalPersistentWriterShutdown() called Thread.join() with no timeout on the PersistentWriter and BlobVacuum threads. If either thread failed to terminate promptly, the shutdown would hang indefinitely — causing CI test suite timeouts. Add 30-second bounded timeouts to all join() calls. If the PersistentWriter doesn't terminate within 30s, it is interrupted and given 5 more seconds. The threads remain non-daemon to ensure pending blob writes complete during normal shutdown. This supersedes eXist-db#6179, which incorrectly made BlobStore threads daemon threads. As @adamretter (author of BlobStore) pointed out, these threads must remain non-daemon to ensure pending blob writes complete. The fix is bounded join() timeouts during shutdown — giving threads 30s to drain before proceeding. Verified: 3/3 full test suite trials pass with clean exit (3:26-4:23), 27/27 BlobStore-specific tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Member
Author
|
[This response was co-authored with Claude Code. -Joe] Closing in favor of a revised approach. Making BlobStore threads daemon was incorrect — as @adamretter pointed out, these threads must remain non-daemon to ensure pending blob writes complete. The replacement PR uses bounded |
3 tasks
joewiz
added a commit
to joewiz/exist
that referenced
this pull request
Mar 27, 2026
BlobStoreImpl.normalClose() and abnormalPersistentWriterShutdown() called Thread.join() with no timeout on the PersistentWriter and BlobVacuum threads. If either thread failed to terminate promptly, the shutdown would hang indefinitely — causing CI test suite timeouts. Add 30-second bounded timeouts to all join() calls. If the PersistentWriter doesn't terminate within 30s, it is interrupted and given 5 more seconds. The threads remain non-daemon to ensure pending blob writes complete during normal shutdown. This supersedes eXist-db#6179, which incorrectly made BlobStore threads daemon threads. As @adamretter (author of BlobStore) pointed out, these threads must remain non-daemon to ensure pending blob writes complete. The fix is bounded join() timeouts during shutdown — giving threads 30s to drain before proceeding. Verified: 3/3 full test suite trials pass with clean exit (3:26-4:23), 27/27 BlobStore-specific tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
What Changed
BlobStoreImpl.java: AddedsetDaemon(true)to bothpersistentWriterThreadandblobVacuumThreadbefore starting them. These threads still participate in normal shutdown via poison pill / interrupt, but the JVM can now exit even if those shutdown paths fail.exist-parent/pom.xml: AddedforkedProcessExitTimeoutInSeconds=60to the surefire configuration as a safety net — surefire will kill forked test JVMs that fail to exit within 60 seconds.BrokerPoolShutdownTest.java(new): Regression test that starts a BrokerPool, shuts it down, and asserts that no non-daemon eXist-related threads remain alive. Catches future regressions where new non-daemon threads are introduced.Evidence
A controlled hang experiment (5 serial trials × 4 configs) identified BlobStore threads as the remaining hang cause after #6167. jstack captures showed
BlobStoreImpl$PersistentWriterandBlobStoreImpl$BlobVacuumparked on blocking queue operations after all tests completed.Test Plan
BrokerPoolShutdownTestpasses🤖 Generated with Claude Code