Skip to content

Fix BlobStoreImpl non-daemon threads causing CI test hangs#6179

Closed
joewiz wants to merge 1 commit intoeXist-db:developfrom
joewiz:bugfix/blobstore-daemon-threads
Closed

Fix BlobStoreImpl non-daemon threads causing CI test hangs#6179
joewiz wants to merge 1 commit intoeXist-db:developfrom
joewiz:bugfix/blobstore-daemon-threads

Conversation

@joewiz
Copy link
Copy Markdown
Member

@joewiz joewiz commented Mar 25, 2026

Summary

  • BlobStoreImpl's PersistentWriter and BlobVacuum threads were non-daemon threads that blocked JVM exit when BrokerPool shutdown failed to join them cleanly
  • This caused ~20% of CI runs to hang indefinitely after all tests had completed
  • Follow-up to Fix BrokerPool shutdown hang causing CI timeouts #6167 which fixed the same issue for StatusReporter

What Changed

BlobStoreImpl.java: Added setDaemon(true) to both persistentWriterThread and blobVacuumThread before starting them. These threads still participate in normal shutdown via poison pill / interrupt, but the JVM can now exit even if those shutdown paths fail.

exist-parent/pom.xml: Added forkedProcessExitTimeoutInSeconds=60 to the surefire configuration as a safety net — surefire will kill forked test JVMs that fail to exit within 60 seconds.

BrokerPoolShutdownTest.java (new): Regression test that starts a BrokerPool, shuts it down, and asserts that no non-daemon eXist-related threads remain alive. Catches future regressions where new non-daemon threads are introduced.

Evidence

A controlled hang experiment (5 serial trials × 4 configs) identified BlobStore threads as the remaining hang cause after #6167. jstack captures showed BlobStoreImpl$PersistentWriter and BlobStoreImpl$BlobVacuum parked on blocking queue operations after all tests completed.

Test Plan

  • BrokerPoolShutdownTest passes
  • Full exist-core test suite: 6,543 tests, 0 failures, 0 errors

🤖 Generated with Claude Code

BlobStoreImpl's PersistentWriter and BlobVacuum threads were created as
non-daemon threads, which prevents JVM exit if BrokerPool shutdown fails
to join them cleanly. This caused ~20% of CI runs to hang indefinitely
after tests completed.

Mark both threads as daemon threads so they cannot block JVM shutdown.
The threads still participate in normal shutdown via poison pill
(PersistentWriter) and interrupt (BlobVacuum), but now the JVM can exit
even if those shutdown paths fail.

Also add a forkedProcessExitTimeoutInSeconds safety net to the surefire
configuration, and a regression test that verifies no non-daemon eXist
threads survive BrokerPool shutdown.

Follow-up to PR eXist-db#6167 which fixed the same issue for StatusReporter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joewiz joewiz requested a review from a team as a code owner March 25, 2026 19:14
@adamretter
Copy link
Copy Markdown
Contributor

adamretter commented Mar 25, 2026

@joewiz As the author of BlobStore, I just want to point out that these two threads have to be non-daemon threads. The problem needs to be fixed elsewhere in a different fashion otherwise with this PR you risk introducing database corruptions.

joewiz added a commit to joewiz/exist that referenced this pull request Mar 26, 2026
BlobStoreImpl.normalClose() and abnormalPersistentWriterShutdown()
called Thread.join() with no timeout on the PersistentWriter and
BlobVacuum threads. If either thread failed to terminate promptly,
the shutdown would hang indefinitely — causing CI test suite timeouts.

Add 30-second bounded timeouts to all join() calls. If the
PersistentWriter doesn't terminate within 30s, it is interrupted
and given 5 more seconds. The threads remain non-daemon to ensure
pending blob writes complete during normal shutdown.

This supersedes eXist-db#6179, which incorrectly made BlobStore threads
daemon threads. As @adamretter (author of BlobStore) pointed out,
these threads must remain non-daemon to ensure pending blob writes
complete. The fix is bounded join() timeouts during shutdown —
giving threads 30s to drain before proceeding.

Verified: 3/3 full test suite trials pass with clean exit (3:26-4:23),
27/27 BlobStore-specific tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joewiz
Copy link
Copy Markdown
Member Author

joewiz commented Mar 26, 2026

[This response was co-authored with Claude Code. -Joe]

Closing in favor of a revised approach. Making BlobStore threads daemon was incorrect — as @adamretter pointed out, these threads must remain non-daemon to ensure pending blob writes complete. The replacement PR uses bounded join() timeouts instead.

@joewiz joewiz closed this Mar 26, 2026
joewiz added a commit to joewiz/exist that referenced this pull request Mar 27, 2026
BlobStoreImpl.normalClose() and abnormalPersistentWriterShutdown()
called Thread.join() with no timeout on the PersistentWriter and
BlobVacuum threads. If either thread failed to terminate promptly,
the shutdown would hang indefinitely — causing CI test suite timeouts.

Add 30-second bounded timeouts to all join() calls. If the
PersistentWriter doesn't terminate within 30s, it is interrupted
and given 5 more seconds. The threads remain non-daemon to ensure
pending blob writes complete during normal shutdown.

This supersedes eXist-db#6179, which incorrectly made BlobStore threads
daemon threads. As @adamretter (author of BlobStore) pointed out,
these threads must remain non-daemon to ensure pending blob writes
complete. The fix is bounded join() timeouts during shutdown —
giving threads 30s to drain before proceeding.

Verified: 3/3 full test suite trials pass with clean exit (3:26-4:23),
27/27 BlobStore-specific tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants