Add bounded timeouts to BlobStore shutdown join() calls#6183
Closed
joewiz wants to merge 1 commit intoeXist-db:developfrom
Closed
Add bounded timeouts to BlobStore shutdown join() calls#6183joewiz wants to merge 1 commit intoeXist-db:developfrom
joewiz wants to merge 1 commit intoeXist-db:developfrom
Conversation
BlobStoreImpl.normalClose() and abnormalPersistentWriterShutdown() called Thread.join() with no timeout on the PersistentWriter and BlobVacuum threads. If either thread failed to terminate promptly, the shutdown would hang indefinitely — causing CI test suite timeouts. Add 30-second bounded timeouts to all join() calls. If the PersistentWriter doesn't terminate within 30s, it is interrupted and given 5 more seconds. The threads remain non-daemon to ensure pending blob writes complete during normal shutdown. This supersedes eXist-db#6179, which incorrectly made BlobStore threads daemon threads. As @adamretter (author of BlobStore) pointed out, these threads must remain non-daemon to ensure pending blob writes complete. The fix is bounded join() timeouts during shutdown — giving threads 30s to drain before proceeding. Verified: 3/3 full test suite trials pass with clean exit (3:26-4:23), 27/27 BlobStore-specific tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
3 tasks
Member
Author
|
[This response was co-authored with Claude Code. -Joe] Closing this PR. Adam is right that the bounded timeout has the same fundamental risk as the daemon thread approach — if the timeout fires and The CI hang symptom that motivated this fix is now addressed through other means:
The underlying BlobStore shutdown behavior (non-daemon threads that can prevent JVM exit if they hang) remains as-is. It's the correct design for data integrity — these threads must complete their work before the JVM exits. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BlobStoreImpl.normalClose()andabnormalPersistentWriterShutdown()calledThread.join()with no timeout on the PersistentWriter and BlobVacuum threads. If either thread failed to terminate promptly, the shutdown would hang indefinitely — causing CI test suite timeouts when the surefire fork couldn't exit.This supersedes #6179, which incorrectly made BlobStore threads daemon threads. As @adamretter (author of BlobStore) correctly pointed out, these threads must remain non-daemon to ensure pending blob writes complete. The fix is bounded
join()timeouts during shutdown — giving threads 30s to drain before proceeding.What changed
BlobStoreImpl.java— Added 30-second bounded timeouts to all threeThread.join()calls:normalClose()— PersistentWriter:join(30_000), theninterrupt()+join(5_000)if still alivenormalClose()— BlobVacuum:join(30_000)with warning if still aliveabnormalPersistentWriterShutdown()— BlobVacuum:join(30_000)with warning if still aliveWhy this is safe
The existing shutdown mechanisms are correct and unchanged:
POISON_PILLvia the persist queue → drains remaining items and exitsinterrupt()→ wakes fromPriorityBlockingQueue.take()and exitsIn practice, both threads respond within milliseconds. The 30s timeout is a safety net — it only fires if something goes wrong, ensuring shutdown completes instead of hanging forever.
Test results
Test plan
BlobStoreImplTest— 27 tests, 0 failuresmvn test -pl exist-core— 3 trials, all pass, no hangs🤖 Generated with Claude Code