Skip to content

Make cache-backed resource materialization concurrency-safe#216

Open
Danielalnajjar wants to merge 5 commits intodavis7dotsh:mainfrom
Danielalnajjar:daniel/cache-backed-materialization
Open

Make cache-backed resource materialization concurrency-safe#216
Danielalnajjar wants to merge 5 commits intodavis7dotsh:mainfrom
Danielalnajjar:daniel/cache-backed-materialization

Conversation

@Danielalnajjar
Copy link
Contributor

@Danielalnajjar Danielalnajjar commented Mar 15, 2026

Summary

This makes BTCA's cache-backed resource materialization safe across concurrent CLI and server processes by moving resources onto a shared materializeIntoVirtualFs(...) seam and coordinating cache mutation through filesystem locks. BTCA needs this because separate processes currently share one on-disk cache root, so git updates, npm hydration, and btca clear can collide and leave caches in a broken or partially cleared state. A little over half of the added lines are test coverage (3,696 of 5,765)

Motivation

I first ran into this while working across multiple worktrees of the same project. Those worktrees could end up invoking BTCA against the same named resource at the same time. In the old implementation, separate BTCA processes could share one on-disk cache and run in-place git update operations on that same checkout without cross-process coordination, which could leave the cached resource in a bad or unusable state from BTCA's point of view. In practice, the recovery path was often to clear the cached resource and let BTCA fetch it again. No tracking issue, this came from real concurrent-worktree usage, not a filed bug.

Review request

Please focus review on:

  • correctness of the lock lifecycle and stale-lock recovery in apps/server/src/resources/lock.ts
  • whether the git/npm materialization behavior still matches BTCA's current resource model
  • whether the new clear flow preserves the right cache boundaries and lock identities

Lower-priority feedback for this pass:

  • fixture naming/style
  • wording polish in docs unless something reads incorrectly

This PR is larger than ideal in file count, but the seam change, lock layer, and subprocess race coverage are tightly coupled. A little over half of the added lines are test coverage (3,696 of 5,765), mostly around subprocess race cases and lock behavior. I split it into 5 commits so it can be reviewed in order instead of as one 37-file diff.

Suggested review order:

  1. cecfa6a feat(resources): make materialization cache-backed and lock-aware
    • start with apps/server/src/resources/lock.ts, apps/server/src/resources/layout.ts, apps/server/src/resources/service.ts
    • then apps/server/src/resources/impls/git.ts, apps/server/src/resources/impls/npm.ts, apps/server/src/resources/impls/local.ts
    • finish with apps/server/src/collections/service.ts, apps/server/src/collections/types.ts, apps/server/src/errors.ts
  2. 0e37cb0 test(resources): cover lock-aware materialization races
  3. 6c5857f docs(btca): describe lock-aware cache clearing
  4. f7ff1c4 chore(docs): run Mint through bunx
  5. 73d1bc7 fix(resources): tighten cache recovery follow-ups

If more implementation detail is useful, I attached the living ExecPlan used during the work in a PR comment. It includes the decision log, accepted review feedback, and validation checkpoints.

Approach

  • introduced a resource-owned materializeIntoVirtualFs(...) seam so collections stop reading raw cache paths directly
  • added heartbeat-backed filesystem locking with stale recovery, clear-aware retrying, and namespace-safe lock parsing
  • moved named git caches under .git-mirrors/<key>/repo, switched anonymous git to unique temp dirs, and used git ls-remote --exit-code --heads to decide anonymous fallback branches
  • made npm hydration/import/cleanup coordinate on per-cache locks and return metadata directly from resource materialization
  • moved cache clearing out of config into the resources service and made /clear lock-aware for git mirrors, npm caches, anonymous tmp caches, and legacy leftovers
  • updated the docs/OpenAPI text to describe /clear in terms of cache-backed resource data instead of only cloned repos
  • switched docs validation scripts to bunx mint ... so docs checks work in a clean Bun-only environment

Why this shape:

  • BTCA is multi-process by default, each CLI invocation starts its own in-process server. In-process coordination can't reach across those boundaries, so the lock needs to work cross-process without a broker. Directory-based mkdir is atomic on POSIX and works cross-process without additional runtime support.
  • The old resource interface (getAbsoluteDirectoryPath) returned a raw cache path and let collections read from it later. That left a gap where another process could mutate the checkout between path resolution and VFS import. materializeIntoVirtualFs closes that gap — the lock covers reconcile through import as one unit.
  • Named git moved under .git-mirrors/ to stop sharing the flat <resourcesDir>/<key>/ namespace with npm. Anonymous git switched from one deterministic shared path per key to mkdtemp, so two concurrent requests for the same URL can't wipe each other mid-clone.

Scope boundaries:

  • no CLI flags or endpoint shapes changed
  • no new external dependencies were added
  • same-resource requests still queue through materialization; this keeps the v1 correctness-first tradeoff instead of trying to parallelize cache mutation
  • anonymous git fallback remains branch-only in this batch
  • no UI changes

Upgrade and compatibility

  • No user-facing breaking changes: CLI flags, HTTP endpoints, and the /clear response schema are unchanged.
  • Internal TypeScript API changes: BtcaFsResource.getAbsoluteDirectoryPath is replaced by materializeIntoVirtualFs, ConfigService.clearResources() moved to ResourcesService.clearCaches(), and ServerLayerDependencies gains a required resources field.
  • Cache layout on upgrade: named git caches move from <resourcesDir>/<key>/ to .git-mirrors/<key>/. The server re-fetches into the new layout on first use, no manual step needed. Old directories sit unused until btca clear, which detects and removes legacy layouts lock-safely. Anonymous git now uses unique temp dirs under .tmp/ instead of one shared path per key. All new internal directories (.git-mirrors/, .resource-locks/, .clear-trash/) live inside the existing resources directory.

Verification

Ran:

  • bun test apps/server/src/errors.test.ts
  • bun test apps/server/src/resources/lock.test.ts
  • bun test apps/server/src/resources/impls/git.test.ts
  • bun test apps/server/src/resources/impls/npm.test.ts
  • bun test apps/server/src/resources/service.test.ts
  • bun test apps/server/src/collections/service.test.ts
  • bun run format:server
  • bun run check:server
  • bun run test:server
  • bun --cwd apps/docs run format
  • bun --cwd apps/docs run check
  • bunx mint validate

bun run test:server finished with 104 pass / 4 skip / 0 fail.

Isolated CLI smoke also passed using the repo-local CLI only with a temp HOME, temp XDG_DATA_HOME, temp project config, and temp .btca data dir so the installed BTCA setup was not touched:

  • status passed
  • resources passed
  • warm-up ask against the named git resource passed
  • 4 parallel ask processes against the same named git resource passed
  • post-parallel ask passed
  • btca clear passed (Cleared 1 resource(s).)
  • post-clear ask passed
  • smoke logs showed no lock/git contention signatures

*see comments for a copy/paste prompt to recreate this test locally

Provenance

AI assistance was substantial in implementation and review iteration. The branch was reviewed by:

  • Codex CLI (local review)
  • Codex GitHub PR bot
  • Claude Code's code review skill
  • ChatGPT 5.4 Pro
  • Gemini DeepThink

The full branch, the implementation plan, and the PR description were also uploaded to ChatGPT 5.4 Pro and Gemini DeepThink for deep review.

I am not claiming a personal line-by-line final diff review here. The confidence for this draft comes from the executed test/docs checks above plus the external AI review passes over the full branch and plan. A human is still opening and maintaining the PR and will handle review follow-up.

Supporting materials

Attached as PR comments:

  1. ChatGPT 5.4 Pro review prompt — structured audit methodology requiring evidence-anchored findings, distrust of prior agent claims, phased extraction before evaluation, and a reverse-check step that verifies the premises behind the merge verdict
  2. ChatGPT 5.4 Pro review report — spec compliance matrix, architecture assessment, edge case audit, and merge verdict with quality grade
  3. Gemini Deep Think review report — spec compliance matrix, architecture assessment, edge case audit, and merge verdict with quality grade
  4. ExecPlan — the living implementation plan with decision log, accepted review feedback, and validation checkpoints

Greptile Summary

This PR makes BTCA's cache-backed resource materialization concurrency-safe by introducing a materializeIntoVirtualFs(...) seam on resources and coordinating cache mutation through filesystem locks. The key architectural change replaces the old getAbsoluteDirectoryPath() with a lock-protected materialization call that covers reconcile-through-VFS-import as one atomic unit, closing the race window where another process could mutate a checkout between path resolution and import.

  • Lock layer (lock.ts): New heartbeat-backed filesystem locking using atomic mkdir, with stale lock recovery via claim directories and clear-aware retry logic
  • Cache layout (layout.ts): Named git caches moved under .git-mirrors/<key>/repo, anonymous git uses unique temp dirs via mkdtemp, new internal directories (.resource-locks/, .clear-trash/, .tmp/)
  • Resource implementations: Git and npm resources now acquire clear-aware locks before reconciling caches and importing into VFS; anonymous git uses ls-remote to probe branches before cloning
  • Cache clearing: Moved from ConfigService to ResourcesService with lock-aware sweep logic that drains active resource work before removing caches, handles legacy layouts
  • Collections simplification: CollectionsService no longer needs ConfigService — delegates all materialization and metadata to resources directly
  • Docs and scripts: Updated spec, CLI reference, and OpenAPI docs to describe lock-aware clearing; switched mint commands to bunx mint for Bun-only environments

No plan .md files were found in the repository. No CLI flags, endpoint shapes, or response schemas changed. Over half the added lines (3,696 of 5,765) are test coverage.

Confidence Score: 4/5

  • This PR is well-structured with strong test coverage and no user-facing breaking changes; one temp dir cleanup issue in anonymous git error path should be addressed.
  • Score reflects thorough test coverage (3,696 lines), clean architectural separation, no external dependency additions, and passing type checks/tests. Deducted one point for the temp dir leak in the anonymous git clone error path which could leave orphaned directories on disk during transient failures.
  • Pay close attention to apps/server/src/resources/impls/git.ts (anonymous git temp dir cleanup on error) and apps/server/src/resources/lock.ts (heartbeat loop error handling readability).

Important Files Changed

Filename Overview
apps/server/src/resources/lock.ts New filesystem lock layer with heartbeat, stale recovery, and clear-aware locking. Well-structured; minor readability note on heartbeat error handling.
apps/server/src/resources/layout.ts Pure path helpers for the new cache layout directories. No logic concerns.
apps/server/src/resources/service.ts Moved cache clearing from ConfigService to ResourcesService with lock-aware sweep logic. Thorough handling of legacy caches and different namespace types.
apps/server/src/resources/impls/git.ts Major refactor: named mirrors under .git-mirrors/, anonymous git uses mkdtemp, branch probing via ls-remote. Temp dir leak on clone failure in anonymous path.
apps/server/src/resources/impls/npm.ts Npm resources now coordinate through clear-aware locks, atomic cache meta writes, and dependency injection for spawn. VFS import not injectable unlike git.
apps/server/src/resources/impls/local.ts Extracted local resource materialization from collections service. Clean and straightforward VFS import with git-aware path listing.
apps/server/src/collections/service.ts Significantly simplified by delegating to materializeIntoVirtualFs on each resource. ConfigService dependency removed. No issues found.
apps/server/src/resources/types.ts Replaced getAbsoluteDirectoryPath with materializeIntoVirtualFs seam. Type-safe per-resource metadata via mapped types.
apps/server/src/effect/layers.ts Added ResourcesService to the Effect layer. Required field in ServerLayerDependencies.
apps/server/src/index.ts Wiring changes: resources passed to runtime, collections no longer needs config, /clear routes through ResourcesService.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: apps/server/src/resources/impls/git.ts
Line: 778-782

Comment:
**Temp dir leaks on clone failure**

When `gitClone` or `ensureResolvedRepoSubPathsExist` throws, `cleanupDirectory(tempDir.path)` removes the directory contents, but the temp dir's parent entry (created by `mkdtemp`) is not removed via `removeDisposableTempDir(tempDir)`. The `throw error` here then re-throws, and since the temp dir was created before the loop at line 744, it never gets a matching `removeDisposableTempDir` call in the error path.

Consider cleaning up the full temp dir before re-throwing:

```suggestion
		} catch (error) {
			lastBranchError = error;
			await removeDisposableTempDir(tempDir);
			throw error;
		}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: apps/server/src/resources/lock.ts
Line: 446-454

Comment:
**Heartbeat swallows `AbortError`, exits on others**

The condition `if (name !== 'AbortError') { return; }` silently returns for all non-`AbortError` exceptions, and falls through (doing nothing extra) for `AbortError` itself. This means `AbortError` — the expected case when `stop()` is called — also silently falls through to the end of the IIFE with no issue. Since the heartbeat is best-effort, swallowing unexpected errors is defensible, but the inverted-looking condition could confuse future readers. A brief comment clarifying that both paths intentionally end the loop silently would help.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 73d1bc7

Greptile also left 1 inline comment on this PR.

Context used:

  • Rule used - ALWAYS follow this rule. (source)

Danielalnajjar and others added 5 commits March 15, 2026 10:44
[Feature: none]

Move BTCA resource loading from path-based collection imports to
resource-owned VFS materialization, then put git and npm cache work
behind a shared lock and layout model.

Key files:
- apps/server/src/resources/lock.ts: adds heartbeat-backed filesystem locks, stale recovery, and clear-aware lock coordination.
- apps/server/src/resources/service.ts: owns cache identity, clear orchestration, and resource loading dependencies.
- apps/server/src/resources/impls/git.ts: materializes named mirrors and anonymous clones directly into the VFS under git locks.
- apps/server/src/resources/impls/npm.ts: hydrates npm caches under per-cache locks and materializes metadata-driven VFS imports.
- apps/server/src/collections/service.ts: stops reading raw cache paths and consumes resource materialization results instead.

Design decisions:
- Keep named git mirrors isolated per resource key under .git-mirrors instead of sharing one working tree across names.
- Let each resource implementation own VFS import plus citation metadata so collections stay generic.
- Centralize cache layout and clear-lock coordination inside resources rather than config.
- Preserve CollectionError context with explicit load/materialize codes while relying on built-in Error.cause semantics.

Validation: bun run check:server; bun run test:server.

Unlocks deterministic cross-process cache clearing, subprocess race coverage, and user-facing /clear docs that describe the real cache model.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
[Feature: none]

Add direct unit and subprocess coverage for the cache-backed
materialization model so cross-process locking, clear coordination,
and collection error propagation stay regression-proof.

Key files:
- apps/server/src/resources/lock.test.ts: exercises stale recovery, clear-aware retries, and lock-name parsing directly.
- apps/server/src/resources/impls/git.process.test.ts: proves named and anonymous git materialization coordinates safely with /clear across processes.
- apps/server/src/resources/impls/npm.process.test.ts: proves npm cache queueing, cleanup, and stale clear-lock recovery across processes.
- apps/server/src/resources/impls/test-fixtures/npm-materialize-worker.ts: provides deterministic worker IPC so npm races can be reproduced without flaky timing.
- apps/server/src/collections/service.test.ts: verifies metadata propagation and preserved resource hints through the new materialization seam.

Design decisions:
- Use subprocess workers for the real race boundaries instead of mocking lock ownership inside one process.
- Add direct lock helper tests alongside the process suites so stale-lock recovery failures are localized quickly.
- Keep worker fixtures small and purpose-built so git and npm tests can coordinate via IPC without duplicating harness code.
- Assert observable error and citation behavior rather than internal call sequencing.

Validation: bun run test:server; bun test apps/server/src/resources/lock.test.ts; bun test apps/server/src/resources/impls/git.process.test.ts; bun test apps/server/src/resources/impls/npm.process.test.ts.

Unlocks maintainer-friendly review of the concurrency model because the failure modes now have focused, reproducible coverage.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
[Feature: none]

Update the BTCA CLI and API docs so /clear is documented as a
lock-aware cache operation over git mirrors, npm caches, anonymous
temporary caches, and legacy leftovers rather than only cloned repos.

Key files:
- apps/docs/btca.spec.md: updates the canonical CLI and local API behavior descriptions for btca clear and POST /clear.
- apps/docs/guides/cli-reference.mdx: explains the new anonymous git temp-dir behavior and the lock-aware clear flow.
- apps/docs/api-reference/local/clear.mdx: refreshes the endpoint page title and semantics for local /clear.
- apps/docs/api-reference/openapi.local.json: aligns the OpenAPI summary, description, and cleared field docs with the server behavior.

Design decisions:
- Describe /clear in terms of cache-backed resource data, not implementation details tied only to git clones.
- Call out lock awareness explicitly so users understand why clear waits instead of racing active work.
- Keep the docs focused on observable behavior while naming the cache categories that users may need to reason about.

Validation: bun --cwd apps/docs run check; bunx mint validate.

Unlocks a PR that stays self-explanatory for reviewers who start from the docs or OpenAPI surface.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
[Feature: none]

Make the docs validation path hermetic in a Bun-only environment by
invoking Mint through bunx instead of assuming a separate global Mint
install is already present.

Key files:
- apps/docs/package.json: switches dev and check scripts to bunx mint invocations.
- bun.lock: records the workspace metadata change for the docs tooling update.

Design decisions:
- Keep the repo aligned with the Bun-only tooling contract instead of relying on external global binaries.
- Separate the tooling tweak from behavior/docs commits so reviewers can scan it as pure maintenance.

Validation: bun --cwd apps/docs run check; bunx mint validate.

Unlocks reproducible docs checks for contributors and CI in clean Bun environments.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
[Feature: none]

Follow the post-review fixes through the cache-backed resource layer by
making npm cache publication atomic, restoring anonymous git cleanup to
BTCA-managed storage, narrowing stale pre-heartbeat recovery to pure
age-based reclaim, and tightening the clear-service contract.

Key files:
- apps/server/src/resources/lock.ts: hardens bootstrap cleanup, uses the async heartbeat loop, and removes PID-based liveness from stale pre-heartbeat recovery.
- apps/server/src/resources/service.ts: restores anonymous git clear ownership and returns ResourceError from clearCaches.
- apps/server/src/resources/impls/npm.ts: publishes cache metadata atomically and prefers resolved installed versions for npm citation metadata.
- apps/server/src/resources/lock.test.ts: covers stale pre-heartbeat reclaim, stale heartbeat reclaim, and bootstrap/heartbeat failure behavior directly.
- apps/server/src/resources/impls/npm.test.ts: verifies resolved-version metadata and that failed payload writes do not publish reusable cache metadata.
- apps/docs/btca.spec.md: updates btca clear and local clear docs to include BTCA-managed anonymous git caches again.

Design decisions:
- Keep anonymous git temp directories inside BTCA-managed .tmp so btca clear remains the recovery path for orphaned clones.
- Make stale owner-without-heartbeat locks age out without PID checks to avoid false-live wedges from PID reuse.
- Publish npm cache metadata only after payload files succeed so partial hydrations never look reusable.
- Keep the heartbeat loop hardening and clear error normalization in the same follow-up because they change the same shared resource-management surface.

Validation: bun test apps/server/src/resources/lock.test.ts; bun test apps/server/src/resources/service.test.ts apps/server/src/resources/impls/git.process.test.ts apps/server/src/resources/impls/npm.process.test.ts; bun test apps/server/src/resources/impls/npm.test.ts apps/server/src/collections/service.test.ts; bun run check:server; bun run format && bun run check (apps/docs); git diff --check.

Unlocks a reviewable follow-up commit that brings the post-PR fixes, docs, and lock semantics back into sync without rewriting the main cache-backed materialization commit.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Mar 15, 2026

@Danielalnajjar is attempting to deploy a commit to the davis7dotsh Team on Vercel.

A member of the Team first needs to authorize it.

@Danielalnajjar
Copy link
Contributor Author

Danielalnajjar commented Mar 15, 2026

GPT Pro Review Prompt

GPT Pro Review Report

Gemini Deep Think Review Report

BTCA Cache Backed ExecPlan

Here is a prompt you can copy/paste to recreate the manual test mentioned in the PR notes locally:

Please validate PR #216 by reproducing the same isolated manual BTCA smoke test used in PR verification.

 Check out the PR locally with `gh pr checkout 216` and work from that branch.

 Do not use my installed `btca` binary. Use the repo-local CLI entrypoint only: `bun <repo>/apps/cli/src/index.ts ...`

 Do not touch my real BTCA state. Write and run a bash smoke harness that uses:
 - a temp `HOME`
 - a temp `XDG_DATA_HOME`
 - a temp project dir
 - a temp `btca.config.jsonc`
 - a temp `.btca` data dir
 - a copied temp auth file from `~/.local/share/opencode/auth.json`

 Do not improvise the scenario. Recreate these exact steps:
 1. Run:
    - `bun run check:server`
    - `bun test apps/server/src/resources/lock.test.ts apps/server/src/resources/service.test.ts apps/server/src/resources/
 impls/git.process.test.ts apps/server/src/resources/impls/npm.process.test.ts apps/server/src/resources/impls/git.test.ts
 apps/server/src/resources/impls/npm.test.ts apps/server/src/collections/service.test.ts`
 2. In the temp sandbox, create a config with one named git resource:
    - `btca-upstream`
    - `https://github.com/bmdavis419/better-context.git`
    - branch `main`
 3. Use a currently authenticated provider/model that works in my environment.
 4. Run, in order:
    - isolated `status`
    - isolated `resources`
    - one warm-up `ask`
    - four parallel isolated `ask` processes against the same named git resource
    - one post-parallel `ask`
    - isolated `btca clear`
    - one post-clear `ask`
 5. Use this exact question for every `ask`:
    - `What does btca clear do, and what cache directories does it affect?`
 6. Verify:
    - my real `~/.config/btca/btca.config.jsonc` mtime is unchanged
    - my real `~/.local/share/opencode/auth.json` mtime is unchanged
    - only the temp project `.btca` was used
    - smoke logs do not contain `index.lock`, `could not lock`, `Unable to create`, `Another git process`, `git fetch
 failed`, `git reset failed`, or `RESOURCE_MATERIALIZE_FAILED`

 If provider quota/auth fails, separate that from BTCA behavior and retry with another already-authenticated provider/model
 if available.

 At the end, report:
 - `Regression gate: PASS/FAIL`
 - `Warm-up ask: PASS/FAIL`
 - `Parallel asks: PASS/FAIL`
 - `Post-parallel ask: PASS/FAIL`
 - `Clear: PASS/FAIL`
 - `Post-clear ask: PASS/FAIL`
 - `Isolation preserved: PASS/FAIL`
 - `Collision scan: clean/not clean`
 - provider/model used
 - exact commands run
 - any deviations from the requested scenario

Note: Codex and Claude left review comments on the original draft PR on my fork, they reported no findings: Danielalnajjar#1

@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits. You can see your limits in the Codex usage dashboard.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 73d1bc77ac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

: {})
};

await Bun.write(getClaimFilePath(args.lockPath), JSON.stringify(claim, null, 2));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent stale-break from recreating a released lock directory

In the stale-break path, if the original owner releases lockPath right after the claim directory is created, Bun.write(getClaimFilePath(...)) recreates the deleted parent directories, and releaseClaimIfOwned then removes only .stale-break-claim (not the recreated lockPath). That leaves an empty lock directory that observeLock treats as live by mtime, so contenders block for a full staleMs window (30s by default) before proceeding. This race is triggered during normal lock handoff and causes unexpected long stalls in resource materialization/clear flows.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

@Danielalnajjar Danielalnajjar Mar 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed the race exists. Bun.write automatically recreates deleted parent directories, so if the original lock owner releases at the exact moment a stale-breaker is writing its claim file, an empty ghost lock directory gets left behind. The ghost has no owner.json or heartbeat, so the next process treats it as a newly created lock and waits up to 30 seconds for it to age out. After 30 seconds, the stale-break logic deletes the ghost and everything proceeds normally.

The fix is one line in tryBreakStaleLock — after removing the claim, also remove the ghost lock directory if it's empty:

// lock.ts, inside tryBreakStaleLock, after the inode mismatch branch (line ~376):
if (!sameInstance(currentInstance, observation.instance)) {
    await releaseClaimIfOwned(args, claimToken);
    await fs.rmdir(args.lockPath).catch(() => undefined);  // ← add this
    return true;
}

fs.rmdir (non-recursive) only succeeds on empty directories, so if a new legitimate owner already wrote owner.json inside, it fails harmlessly. If it's the empty ghost from Bun.write, it gets cleaned up.

Not fixing in this batch because the race requires the owner to release in the milliseconds between claim mkdir and claim file write, and the 30-second stall is self-healing with no user intervention needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to follow up on this, there are some edge cases like this, some file system-related edge cases that can arise, but they can all be resolved by running btca clear. It still makes it much less likely that btca clear will have to be used in normal oepration than it would have previously.

I figured it's not necessary to bake in redundancy for things like people's hard drive being full or their hard drives failing and things like that.

I did consider trying to write this in effect beta v4, but I figured since you didn't have it written in effect, I won't just migrate a portion of your codebase to effect. If you'd like me to redo the PR in effect, I'd be more than happy to.

I just want to say thank you so much for creating this as well. It's really helped me out a lot!

@Danielalnajjar Danielalnajjar marked this pull request as ready for review March 15, 2026 23:41
Comment on lines +778 to +782
} catch (error) {
lastBranchError = error;
await cleanupDirectory(tempDir.path);
throw error;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temp dir leaks on clone failure

When gitClone or ensureResolvedRepoSubPathsExist throws, cleanupDirectory(tempDir.path) removes the directory contents, but the temp dir's parent entry (created by mkdtemp) is not removed via removeDisposableTempDir(tempDir). The throw error here then re-throws, and since the temp dir was created before the loop at line 744, it never gets a matching removeDisposableTempDir call in the error path.

Consider cleaning up the full temp dir before re-throwing:

Suggested change
} catch (error) {
lastBranchError = error;
await cleanupDirectory(tempDir.path);
throw error;
}
} catch (error) {
lastBranchError = error;
await removeDisposableTempDir(tempDir);
throw error;
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/server/src/resources/impls/git.ts
Line: 778-782

Comment:
**Temp dir leaks on clone failure**

When `gitClone` or `ensureResolvedRepoSubPathsExist` throws, `cleanupDirectory(tempDir.path)` removes the directory contents, but the temp dir's parent entry (created by `mkdtemp`) is not removed via `removeDisposableTempDir(tempDir)`. The `throw error` here then re-throws, and since the temp dir was created before the loop at line 744, it never gets a matching `removeDisposableTempDir` call in the error path.

Consider cleaning up the full temp dir before re-throwing:

```suggestion
		} catch (error) {
			lastBranchError = error;
			await removeDisposableTempDir(tempDir);
			throw error;
		}
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

@Danielalnajjar Danielalnajjar Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated this one locally with both Claude and Codex and they both said it is not a real leak on the current branch.

" On the failure path here, materializeAnonymousGitResource(...) catches the clone/search-path error, calls await cleanupDirectory(tempDir.path), and then rethrows. cleanupDirectory(...) is rm(pathToRemove, { recursive: true, force: true }), so it removes the temp dir itself, not just its contents.

I also reproduced the path with a mocked anonymous clone failure and then inspected getTmpCacheRoot(...); there were no leftover btca-anon-git-* entries after the error (entries: []).

So I am going to leave this as-is unless we find a concrete repro where the temp dir entry survives the current cleanupDirectory(tempDir.path) call. "

Claude said "Greptile assumed cleanupDirectory meant "clean the contents" based on the name, but the implementation is a full recursive delete."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant