fix(indexer): use negative mtime sentinel for gitignored files#40
Merged
fix(indexer): use negative mtime sentinel for gitignored files#40
Conversation
Gitignored files now return -stat.mtimeMs instead of null so callers store a sentinel in the mtime cache. This prevents re-running the gitignore check on every startup for every gitignored file (fixing the re-index-from-zero regression introduced in PR #39), while ensuring the mtime early-return never fires for gitignored files — so un-gitignoring a file without touching its mtime still triggers a proper re-index on the next run. No file content is read and no vectors are stored for gitignored files; the gitignore safety property from PR #39 is fully preserved.
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes an indexer restart regression where gitignored files weren’t persisted in the mtime cache, causing repeated gitignore checks across restarts. The PR introduces a negative-mtime sentinel to persist “gitignored” state without ever allowing the unchanged-mtime fast path to bypass re-checking ignore status.
Changes:
- Persist gitignored files in the mtime cache using a negative mtime sentinel instead of returning
null. - Document the mtime-cache sentinel conventions and the three
indexFilereturn cases. - Update batch processing comments to reflect the new sentinel semantics.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Comment on lines
+746
to
+748
| // Check if gitignored BEFORE reading the file | ||
| const ignored = isIgnored ? isIgnored(filePath) : await isGitignored(filePath) | ||
| if (ignored) return null | ||
| if (ignored) return -stat.mtimeMs |
| } | ||
|
|
||
| /** Returns the file's mtimeMs if it was indexed, null if skipped or errored. */ | ||
| /** Returns the file's mtimeMs if indexed, -mtimeMs (negative) if gitignored (cached as sentinel so callers skip re-stat but always re-check gitignore), null if skipped for other reasons or errored. */ |
Comment on lines
+170
to
+171
| // null (absent key) = never seen. Negative sentinels are persisted to disk so the | ||
| // mtime early-return never fires for gitignored files across restarts. |
| // null — error or file otherwise unindexable (do not update cache) | ||
| if (mtime !== null) { | ||
| cache.set(file, mtime) | ||
| cache.set(file, mtime) // <-- direct Map.set on s.mtimeCache ref |
| // Check if gitignored BEFORE reading the file | ||
| const ignored = isIgnored ? isIgnored(filePath) : await isGitignored(filePath) | ||
| if (ignored) return null | ||
| if (ignored) return -stat.mtimeMs |
- Fix -0 sentinel: use -(mtimeMs+1) instead of -mtimeMs so mtime 0 never produces -0 (which === 0 in JS and would defeat the invariant) - Check gitignore before mtime early-return in indexFile() so files newly added to .gitignore are pruned even without mtime change - Thread AbortSignal through redis.getAllMtimes() to allow cancellation of slow-path Redis scroll on abort - Change buildIgnoreChecker catch fallback to () => true (conservative, consistent with isGitignored utility) to prevent accidental indexing of sensitive files when git check-ignore fails - Fix indexFile JSDoc: sentinel skips content reads/embedding, not re-stat - Clarify cache semantics comment: absent key = unindexable/errored, not strictly "never seen" - Replace debug-style cache.set comment with intent explanation - Update docs/CLI_INDEX_DELETE.md: opencode index delete → indexer delete - Add mtime sentinel unit tests covering: always-negative invariant, never-equals-positive-mtime, -0 JSON round-trip hazard, persistence
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the re-index regression introduced by PR feat: enforce gitignore protection across bash tool and indexer #39 where gitignored files returned
nullfromindexFile, causing them to never be stored in the mtime cache. On every restart,isGitignored()was re-run for every gitignored file in the project (potentially thousands innode_modulesetc.).Uses a negative mtime sentinel (
-stat.mtimeMs) instead ofnullfor gitignored files. The mtime early-return check (=== stat.mtimeMs, always positive) can never match a negative value, so the gitignore check always re-runs — meaning un-gitignoring a file without touching its mtime still correctly triggers re-indexing on the next run.No file content is read and no vectors are stored for gitignored files. The gitignore safety property from PR feat: enforce gitignore protection across bash tool and indexer #39 is fully preserved.
Changes
packages/opencode/src/indexer/index.tsindexFileline 745:if (ignored) return -stat.mtimeMs(wasreturn null)indexFileto document the negative sentinel semanticsloadMtimeCacheFromDiskprocessBatchcomment to correctly document all three return cases (positive / negative / null)Notes
Investigation revealed the "3,697 files indexed on every restart" symptom is the
totalcounter (all files the glob scanner finds), not the number actually being re-embedded. The mtime cache does skip already-indexed files correctly. The remaining ~1,955 files not in cache are ones whose embedding batches never completed (indexer interrupted before finishing). That is a separate issue.