Make cache-backed resource materialization concurrency-safe#216
Make cache-backed resource materialization concurrency-safe#216Danielalnajjar wants to merge 5 commits intodavis7dotsh:mainfrom
Conversation
[Feature: none] Move BTCA resource loading from path-based collection imports to resource-owned VFS materialization, then put git and npm cache work behind a shared lock and layout model. Key files: - apps/server/src/resources/lock.ts: adds heartbeat-backed filesystem locks, stale recovery, and clear-aware lock coordination. - apps/server/src/resources/service.ts: owns cache identity, clear orchestration, and resource loading dependencies. - apps/server/src/resources/impls/git.ts: materializes named mirrors and anonymous clones directly into the VFS under git locks. - apps/server/src/resources/impls/npm.ts: hydrates npm caches under per-cache locks and materializes metadata-driven VFS imports. - apps/server/src/collections/service.ts: stops reading raw cache paths and consumes resource materialization results instead. Design decisions: - Keep named git mirrors isolated per resource key under .git-mirrors instead of sharing one working tree across names. - Let each resource implementation own VFS import plus citation metadata so collections stay generic. - Centralize cache layout and clear-lock coordination inside resources rather than config. - Preserve CollectionError context with explicit load/materialize codes while relying on built-in Error.cause semantics. Validation: bun run check:server; bun run test:server. Unlocks deterministic cross-process cache clearing, subprocess race coverage, and user-facing /clear docs that describe the real cache model. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
[Feature: none] Add direct unit and subprocess coverage for the cache-backed materialization model so cross-process locking, clear coordination, and collection error propagation stay regression-proof. Key files: - apps/server/src/resources/lock.test.ts: exercises stale recovery, clear-aware retries, and lock-name parsing directly. - apps/server/src/resources/impls/git.process.test.ts: proves named and anonymous git materialization coordinates safely with /clear across processes. - apps/server/src/resources/impls/npm.process.test.ts: proves npm cache queueing, cleanup, and stale clear-lock recovery across processes. - apps/server/src/resources/impls/test-fixtures/npm-materialize-worker.ts: provides deterministic worker IPC so npm races can be reproduced without flaky timing. - apps/server/src/collections/service.test.ts: verifies metadata propagation and preserved resource hints through the new materialization seam. Design decisions: - Use subprocess workers for the real race boundaries instead of mocking lock ownership inside one process. - Add direct lock helper tests alongside the process suites so stale-lock recovery failures are localized quickly. - Keep worker fixtures small and purpose-built so git and npm tests can coordinate via IPC without duplicating harness code. - Assert observable error and citation behavior rather than internal call sequencing. Validation: bun run test:server; bun test apps/server/src/resources/lock.test.ts; bun test apps/server/src/resources/impls/git.process.test.ts; bun test apps/server/src/resources/impls/npm.process.test.ts. Unlocks maintainer-friendly review of the concurrency model because the failure modes now have focused, reproducible coverage. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
[Feature: none] Update the BTCA CLI and API docs so /clear is documented as a lock-aware cache operation over git mirrors, npm caches, anonymous temporary caches, and legacy leftovers rather than only cloned repos. Key files: - apps/docs/btca.spec.md: updates the canonical CLI and local API behavior descriptions for btca clear and POST /clear. - apps/docs/guides/cli-reference.mdx: explains the new anonymous git temp-dir behavior and the lock-aware clear flow. - apps/docs/api-reference/local/clear.mdx: refreshes the endpoint page title and semantics for local /clear. - apps/docs/api-reference/openapi.local.json: aligns the OpenAPI summary, description, and cleared field docs with the server behavior. Design decisions: - Describe /clear in terms of cache-backed resource data, not implementation details tied only to git clones. - Call out lock awareness explicitly so users understand why clear waits instead of racing active work. - Keep the docs focused on observable behavior while naming the cache categories that users may need to reason about. Validation: bun --cwd apps/docs run check; bunx mint validate. Unlocks a PR that stays self-explanatory for reviewers who start from the docs or OpenAPI surface. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
[Feature: none] Make the docs validation path hermetic in a Bun-only environment by invoking Mint through bunx instead of assuming a separate global Mint install is already present. Key files: - apps/docs/package.json: switches dev and check scripts to bunx mint invocations. - bun.lock: records the workspace metadata change for the docs tooling update. Design decisions: - Keep the repo aligned with the Bun-only tooling contract instead of relying on external global binaries. - Separate the tooling tweak from behavior/docs commits so reviewers can scan it as pure maintenance. Validation: bun --cwd apps/docs run check; bunx mint validate. Unlocks reproducible docs checks for contributors and CI in clean Bun environments. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
[Feature: none] Follow the post-review fixes through the cache-backed resource layer by making npm cache publication atomic, restoring anonymous git cleanup to BTCA-managed storage, narrowing stale pre-heartbeat recovery to pure age-based reclaim, and tightening the clear-service contract. Key files: - apps/server/src/resources/lock.ts: hardens bootstrap cleanup, uses the async heartbeat loop, and removes PID-based liveness from stale pre-heartbeat recovery. - apps/server/src/resources/service.ts: restores anonymous git clear ownership and returns ResourceError from clearCaches. - apps/server/src/resources/impls/npm.ts: publishes cache metadata atomically and prefers resolved installed versions for npm citation metadata. - apps/server/src/resources/lock.test.ts: covers stale pre-heartbeat reclaim, stale heartbeat reclaim, and bootstrap/heartbeat failure behavior directly. - apps/server/src/resources/impls/npm.test.ts: verifies resolved-version metadata and that failed payload writes do not publish reusable cache metadata. - apps/docs/btca.spec.md: updates btca clear and local clear docs to include BTCA-managed anonymous git caches again. Design decisions: - Keep anonymous git temp directories inside BTCA-managed .tmp so btca clear remains the recovery path for orphaned clones. - Make stale owner-without-heartbeat locks age out without PID checks to avoid false-live wedges from PID reuse. - Publish npm cache metadata only after payload files succeed so partial hydrations never look reusable. - Keep the heartbeat loop hardening and clear error normalization in the same follow-up because they change the same shared resource-management surface. Validation: bun test apps/server/src/resources/lock.test.ts; bun test apps/server/src/resources/service.test.ts apps/server/src/resources/impls/git.process.test.ts apps/server/src/resources/impls/npm.process.test.ts; bun test apps/server/src/resources/impls/npm.test.ts apps/server/src/collections/service.test.ts; bun run check:server; bun run format && bun run check (apps/docs); git diff --check. Unlocks a reviewable follow-up commit that brings the post-PR fixes, docs, and lock semantics back into sync without rewriting the main cache-backed materialization commit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
@Danielalnajjar is attempting to deploy a commit to the davis7dotsh Team on Vercel. A member of the Team first needs to authorize it. |
|
Gemini Deep Think Review Report Here is a prompt you can copy/paste to recreate the manual test mentioned in the PR notes locally: Note: Codex and Claude left review comments on the original draft PR on my fork, they reported no findings: Danielalnajjar#1 |
|
You have reached your Codex usage limits. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 73d1bc77ac
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| : {}) | ||
| }; | ||
|
|
||
| await Bun.write(getClaimFilePath(args.lockPath), JSON.stringify(claim, null, 2)); |
There was a problem hiding this comment.
Prevent stale-break from recreating a released lock directory
In the stale-break path, if the original owner releases lockPath right after the claim directory is created, Bun.write(getClaimFilePath(...)) recreates the deleted parent directories, and releaseClaimIfOwned then removes only .stale-break-claim (not the recreated lockPath). That leaves an empty lock directory that observeLock treats as live by mtime, so contenders block for a full staleMs window (30s by default) before proceeding. This race is triggered during normal lock handoff and causes unexpected long stalls in resource materialization/clear flows.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Confirmed the race exists. Bun.write automatically recreates deleted parent directories, so if the original lock owner releases at the exact moment a stale-breaker is writing its claim file, an empty ghost lock directory gets left behind. The ghost has no owner.json or heartbeat, so the next process treats it as a newly created lock and waits up to 30 seconds for it to age out. After 30 seconds, the stale-break logic deletes the ghost and everything proceeds normally.
The fix is one line in tryBreakStaleLock — after removing the claim, also remove the ghost lock directory if it's empty:
// lock.ts, inside tryBreakStaleLock, after the inode mismatch branch (line ~376):
if (!sameInstance(currentInstance, observation.instance)) {
await releaseClaimIfOwned(args, claimToken);
await fs.rmdir(args.lockPath).catch(() => undefined); // ← add this
return true;
}fs.rmdir (non-recursive) only succeeds on empty directories, so if a new legitimate owner already wrote owner.json inside, it fails harmlessly. If it's the empty ghost from Bun.write, it gets cleaned up.
Not fixing in this batch because the race requires the owner to release in the milliseconds between claim mkdir and claim file write, and the 30-second stall is self-healing with no user intervention needed.
There was a problem hiding this comment.
Just to follow up on this, there are some edge cases like this, some file system-related edge cases that can arise, but they can all be resolved by running btca clear. It still makes it much less likely that btca clear will have to be used in normal oepration than it would have previously.
I figured it's not necessary to bake in redundancy for things like people's hard drive being full or their hard drives failing and things like that.
I did consider trying to write this in effect beta v4, but I figured since you didn't have it written in effect, I won't just migrate a portion of your codebase to effect. If you'd like me to redo the PR in effect, I'd be more than happy to.
I just want to say thank you so much for creating this as well. It's really helped me out a lot!
| } catch (error) { | ||
| lastBranchError = error; | ||
| await cleanupDirectory(tempDir.path); | ||
| throw error; | ||
| } |
There was a problem hiding this comment.
Temp dir leaks on clone failure
When gitClone or ensureResolvedRepoSubPathsExist throws, cleanupDirectory(tempDir.path) removes the directory contents, but the temp dir's parent entry (created by mkdtemp) is not removed via removeDisposableTempDir(tempDir). The throw error here then re-throws, and since the temp dir was created before the loop at line 744, it never gets a matching removeDisposableTempDir call in the error path.
Consider cleaning up the full temp dir before re-throwing:
| } catch (error) { | |
| lastBranchError = error; | |
| await cleanupDirectory(tempDir.path); | |
| throw error; | |
| } | |
| } catch (error) { | |
| lastBranchError = error; | |
| await removeDisposableTempDir(tempDir); | |
| throw error; | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/server/src/resources/impls/git.ts
Line: 778-782
Comment:
**Temp dir leaks on clone failure**
When `gitClone` or `ensureResolvedRepoSubPathsExist` throws, `cleanupDirectory(tempDir.path)` removes the directory contents, but the temp dir's parent entry (created by `mkdtemp`) is not removed via `removeDisposableTempDir(tempDir)`. The `throw error` here then re-throws, and since the temp dir was created before the loop at line 744, it never gets a matching `removeDisposableTempDir` call in the error path.
Consider cleaning up the full temp dir before re-throwing:
```suggestion
} catch (error) {
lastBranchError = error;
await removeDisposableTempDir(tempDir);
throw error;
}
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
I investigated this one locally with both Claude and Codex and they both said it is not a real leak on the current branch.
" On the failure path here, materializeAnonymousGitResource(...) catches the clone/search-path error, calls await cleanupDirectory(tempDir.path), and then rethrows. cleanupDirectory(...) is rm(pathToRemove, { recursive: true, force: true }), so it removes the temp dir itself, not just its contents.
I also reproduced the path with a mocked anonymous clone failure and then inspected getTmpCacheRoot(...); there were no leftover btca-anon-git-* entries after the error (entries: []).
So I am going to leave this as-is unless we find a concrete repro where the temp dir entry survives the current cleanupDirectory(tempDir.path) call. "
Claude said "Greptile assumed cleanupDirectory meant "clean the contents" based on the name, but the implementation is a full recursive delete."
Summary
This makes BTCA's cache-backed resource materialization safe across concurrent CLI and server processes by moving resources onto a shared
materializeIntoVirtualFs(...)seam and coordinating cache mutation through filesystem locks. BTCA needs this because separate processes currently share one on-disk cache root, so git updates, npm hydration, andbtca clearcan collide and leave caches in a broken or partially cleared state. A little over half of the added lines are test coverage (3,696of5,765)Motivation
I first ran into this while working across multiple worktrees of the same project. Those worktrees could end up invoking BTCA against the same named resource at the same time. In the old implementation, separate BTCA processes could share one on-disk cache and run in-place git update operations on that same checkout without cross-process coordination, which could leave the cached resource in a bad or unusable state from BTCA's point of view. In practice, the recovery path was often to clear the cached resource and let BTCA fetch it again. No tracking issue, this came from real concurrent-worktree usage, not a filed bug.
Review request
Please focus review on:
apps/server/src/resources/lock.tsclearflow preserves the right cache boundaries and lock identitiesLower-priority feedback for this pass:
This PR is larger than ideal in file count, but the seam change, lock layer, and subprocess race coverage are tightly coupled. A little over half of the added lines are test coverage (
3,696of5,765), mostly around subprocess race cases and lock behavior. I split it into 5 commits so it can be reviewed in order instead of as one 37-file diff.Suggested review order:
cecfa6afeat(resources): make materialization cache-backed and lock-awareapps/server/src/resources/lock.ts,apps/server/src/resources/layout.ts,apps/server/src/resources/service.tsapps/server/src/resources/impls/git.ts,apps/server/src/resources/impls/npm.ts,apps/server/src/resources/impls/local.tsapps/server/src/collections/service.ts,apps/server/src/collections/types.ts,apps/server/src/errors.ts0e37cb0test(resources): cover lock-aware materialization races6c5857fdocs(btca): describe lock-aware cache clearingf7ff1c4chore(docs): run Mint through bunx73d1bc7fix(resources): tighten cache recovery follow-upsIf more implementation detail is useful, I attached the living ExecPlan used during the work in a PR comment. It includes the decision log, accepted review feedback, and validation checkpoints.
Approach
materializeIntoVirtualFs(...)seam so collections stop reading raw cache paths directly.git-mirrors/<key>/repo, switched anonymous git to unique temp dirs, and usedgit ls-remote --exit-code --headsto decide anonymous fallback branches/clearlock-aware for git mirrors, npm caches, anonymous tmp caches, and legacy leftovers/clearin terms of cache-backed resource data instead of only cloned reposbunx mint ...so docs checks work in a clean Bun-only environmentWhy this shape:
mkdiris atomic on POSIX and works cross-process without additional runtime support.getAbsoluteDirectoryPath) returned a raw cache path and let collections read from it later. That left a gap where another process could mutate the checkout between path resolution and VFS import.materializeIntoVirtualFscloses that gap — the lock covers reconcile through import as one unit..git-mirrors/to stop sharing the flat<resourcesDir>/<key>/namespace with npm. Anonymous git switched from one deterministic shared path per key tomkdtemp, so two concurrent requests for the same URL can't wipe each other mid-clone.Scope boundaries:
Upgrade and compatibility
/clearresponse schema are unchanged.BtcaFsResource.getAbsoluteDirectoryPathis replaced bymaterializeIntoVirtualFs,ConfigService.clearResources()moved toResourcesService.clearCaches(), andServerLayerDependenciesgains a requiredresourcesfield.<resourcesDir>/<key>/to.git-mirrors/<key>/. The server re-fetches into the new layout on first use, no manual step needed. Old directories sit unused untilbtca clear, which detects and removes legacy layouts lock-safely. Anonymous git now uses unique temp dirs under.tmp/instead of one shared path per key. All new internal directories (.git-mirrors/,.resource-locks/,.clear-trash/) live inside the existing resources directory.Verification
Ran:
bun test apps/server/src/errors.test.tsbun test apps/server/src/resources/lock.test.tsbun test apps/server/src/resources/impls/git.test.tsbun test apps/server/src/resources/impls/npm.test.tsbun test apps/server/src/resources/service.test.tsbun test apps/server/src/collections/service.test.tsbun run format:serverbun run check:serverbun run test:serverbun --cwd apps/docs run formatbun --cwd apps/docs run checkbunx mint validatebun run test:serverfinished with104 pass / 4 skip / 0 fail.Isolated CLI smoke also passed using the repo-local CLI only with a temp
HOME, tempXDG_DATA_HOME, temp project config, and temp.btcadata dir so the installed BTCA setup was not touched:statuspassedresourcespassedaskagainst the named git resource passedaskprocesses against the same named git resource passedaskpassedbtca clearpassed (Cleared 1 resource(s).)askpassed*see comments for a copy/paste prompt to recreate this test locally
Provenance
AI assistance was substantial in implementation and review iteration. The branch was reviewed by:
The full branch, the implementation plan, and the PR description were also uploaded to ChatGPT 5.4 Pro and Gemini DeepThink for deep review.
I am not claiming a personal line-by-line final diff review here. The confidence for this draft comes from the executed test/docs checks above plus the external AI review passes over the full branch and plan. A human is still opening and maintaining the PR and will handle review follow-up.
Supporting materials
Attached as PR comments:
Greptile Summary
This PR makes BTCA's cache-backed resource materialization concurrency-safe by introducing a
materializeIntoVirtualFs(...)seam on resources and coordinating cache mutation through filesystem locks. The key architectural change replaces the oldgetAbsoluteDirectoryPath()with a lock-protected materialization call that covers reconcile-through-VFS-import as one atomic unit, closing the race window where another process could mutate a checkout between path resolution and import.lock.ts): New heartbeat-backed filesystem locking using atomicmkdir, with stale lock recovery via claim directories and clear-aware retry logiclayout.ts): Named git caches moved under.git-mirrors/<key>/repo, anonymous git uses unique temp dirs viamkdtemp, new internal directories (.resource-locks/,.clear-trash/,.tmp/)ls-remoteto probe branches before cloningConfigServicetoResourcesServicewith lock-aware sweep logic that drains active resource work before removing caches, handles legacy layoutsCollectionsServiceno longer needsConfigService— delegates all materialization and metadata to resources directlymintcommands tobunx mintfor Bun-only environmentsNo plan
.mdfiles were found in the repository. No CLI flags, endpoint shapes, or response schemas changed. Over half the added lines (3,696 of 5,765) are test coverage.Confidence Score: 4/5
apps/server/src/resources/impls/git.ts(anonymous git temp dir cleanup on error) andapps/server/src/resources/lock.ts(heartbeat loop error handling readability).Important Files Changed
.git-mirrors/, anonymous git usesmkdtemp, branch probing vials-remote. Temp dir leak on clone failure in anonymous path.materializeIntoVirtualFson each resource. ConfigService dependency removed. No issues found.getAbsoluteDirectoryPathwithmaterializeIntoVirtualFsseam. Type-safe per-resource metadata via mapped types./clearroutes through ResourcesService.Prompt To Fix All With AI
Last reviewed commit: 73d1bc7
Context used: