perf: cache metadata directory scans for large workflows#58
perf: cache metadata directory scans for large workflows#58yfarjoun wants to merge 7 commits intofg-labs:mainfrom
Conversation
…f files Each poll cycle (default 1s) was recursively scanning and stat-ing every file in .snakemake/metadata/. With 19k+ metadata files this caused significant CPU and I/O overhead. The new _ScanCache tracks directory mtimes (~256 dirs) and only rescans directories that have actually changed, reducing steady-state polling from ~19k stat calls to ~256. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughIntroduces a directory-mtime-backed incremental scan cache in Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant ScanCache as _ScanCache
participant FS as Filesystem
Caller->>ScanCache: request scan for root (use_cache?)
alt cache miss or use_cache == False
ScanCache->>FS: perform full recursive scan
FS-->>ScanCache: return files + dir mtimes
ScanCache-->>Caller: return full file list
else cache hit
ScanCache->>FS: non-recursive collect dir mtimes
FS-->>ScanCache: current dir mtimes
ScanCache->>ScanCache: compare mtimes -> changed/removed dirs
alt changed or removed dirs found
ScanCache->>FS: scan changed dirs (non-recursive)
FS-->>ScanCache: updated file metadata
ScanCache-->>Caller: return merged updated + cached file list
else no changes detected
ScanCache-->>Caller: return cached file list
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
snakesee/utils.py (1)
210-210: Sort__slots__alphabetically.Static analysis flags that
__slots__is not sorted.- __slots__ = ("_files", "_dir_mtimes", "_lock") + __slots__ = ("_dir_mtimes", "_files", "_lock")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakesee/utils.py` at line 210, The __slots__ tuple in the class is not alphabetically ordered; update the __slots__ declaration to list its members alphabetically (e.g., order "_dir_mtimes", "_files", "_lock") so the __slots__ = ("_dir_mtimes", "_files", "_lock") ordering is used; modify the __slots__ line accordingly where __slots__ is defined to satisfy the static analysis rule.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@snakesee/utils.py`:
- Around line 344-357: _scandir_files currently always uses the global
_scan_cache which prevents callers of iterate_metadata_files from bypassing the
directory-scan cache when they pass use_cache=False; change _scandir_files to
accept a parameter like use_scan_cache: bool = True and, when False, perform a
fresh recursive scan instead of calling _scan_cache.get_files (or call the
existing lower-level scanner directly), then update iterate_metadata_files to
pass its use_cache value as use_scan_cache to _scandir_files so use_cache=False
truly forces a fresh directory listing; keep the public behavior unchanged when
use_scan_cache is True to use _scan_cache.
- Around line 205-207: The caching logic that "Tracks directory mtimes to detect
when new files are added" must be changed to not assume metadata files are
append-only: update the code that builds/uses _MetadataFileInfo so it records
and compares each metadata file's stat (mtime, size, inode) rather than relying
solely on directory mtime; on rescans compare per-file stats and treat modified,
added, or removed files as cache-invalidating changes (missing files should mark
the cache stale), and ensure any cache entries for directories are invalidated
when any per-file stat differs. Address this in the module code that currently
uses directory mtime checks and in the _MetadataFileInfo handling so the cache
reflects file-level modifications and deletions.
- Around line 202-255: The get_files method in _ScanCache uses an unsafe prefix
check (str(f.path).startswith(d)) and accepts any directory but only stores one
global cache; update get_files to (1) replace the startswith check with a safe
comparison that treats directory boundaries correctly (e.g., check str(f.path)
== d or str(f.path).startswith(d + os.sep) or equivalent using Path methods)
when building kept, and (2) add a simple cached-root guard: store the initial
directory (e.g., self._root_dir) on first full scan and on subsequent calls
validate that the incoming directory equals that root (raise or document error)
so the cache cannot be silently corrupted when callers pass different
directories (symbols: class _ScanCache, method get_files, attributes _files and
_dir_mtimes, and the caller _scandir_files).
---
Nitpick comments:
In `@snakesee/utils.py`:
- Line 210: The __slots__ tuple in the class is not alphabetically ordered;
update the __slots__ declaration to list its members alphabetically (e.g., order
"_dir_mtimes", "_files", "_lock") so the __slots__ = ("_dir_mtimes", "_files",
"_lock") ordering is used; modify the __slots__ line accordingly where __slots__
is defined to satisfy the static analysis rule.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
- Fix unsafe path-prefix check: use `d + os.sep` to avoid matching `/foo/bar_baz` when checking for `/foo/bar` - Track cached root directory and reset on directory change - Fix docstring: remove incorrect "append-only" claim; directory mtime correctly detects modifications and deletions too - Add use_scan_cache bypass: when iterate_metadata_files is called with use_cache=False, also bypass the scan cache for a fresh listing - Sort __slots__ alphabetically Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@snakesee/utils.py`:
- Around line 249-264: The current eviction logic removes entire subtrees by
checking startswith on changed_prefixes; change it so we only evict cached files
whose immediate parent directory equals a changed directory (i.e., don't drop
files in deeper subdirectories). Replace the kept comprehension that uses
changed_prefixes with one that keeps f in self._files if not
any(Path(f.path).parent == Path(dir_path) for dir_path in changed_dirs), so
_scan_single_dir (which only repopulates direct files) won't cause unrelated
subtrees to vanish; leave updating self._dir_mtimes/current_mtimes and new_files
logic unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
||
| def _scandir_files(directory: Path) -> list[_MetadataFileInfo]: | ||
| """Recursively scan directory for files using os.scandir (faster than rglob). | ||
| class _ScanCache: |
There was a problem hiding this comment.
issue: can you add unit tests:
- First call performs full scan
- No-change call returns cached results
- Adding/removing a file triggers rescan of that directory only
- Removing an entire directory removes its files
- Different root directory triggers full rescan
- clear() resets state
| self._dir_mtimes = current_mtimes | ||
| return list(self._files) | ||
|
|
||
| def clear(self) -> None: |
There was a problem hiding this comment.
issue: needs to be called in the test infrastructure once unit tests are added, to avoid flaky tests
snakesee/utils.py
Outdated
| # First call or different directory — full scan | ||
| self._root = dir_str | ||
| self._files, self._dir_mtimes = _full_scandir(directory) | ||
| return list(self._files) |
There was a problem hiding this comment.
issue (blocking):
There's a list() copy on every hot-path return here and below. A tuple return or a generation-counter approach could avoid this allocation on the hot path.
- Return tuples from _ScanCache.get_files() to avoid list() copy on every hot-path return (callers that need mutability convert explicitly) - Fix subtree eviction: only evict direct children of changed dirs, preserving files in unchanged subdirectories - Add _ScanCache unit tests: first call, cached results, add/remove files, remove directory, different root, clear, subdirectory preservation - Clear global scan cache in test fixtures to prevent flaky tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
snakesee/utils.py (2)
358-375:⚠️ Potential issue | 🟠 MajorLine 375 puts the full
list()copy back on the steady-state path.The cache now stores a tuple, but
list(_scan_cache.get_files(directory))still allocates and copies every cached result. That restores an O(n) hot-path cost for large metadata trees, which is exactly what this PR is trying to remove.♻️ One way to keep the cached path allocation-free
-from collections.abc import Iterator +from collections.abc import Iterator, Sequence ... -def _scandir_files(directory: Path, *, use_scan_cache: bool = True) -> list[_MetadataFileInfo]: +def _scandir_files(directory: Path, *, use_scan_cache: bool = True) -> Sequence[_MetadataFileInfo]: ... if not use_scan_cache: files, _ = _full_scandir(directory) return files - return list(_scan_cache.get_files(directory)) + return _scan_cache.get_files(directory) ... - if sort_by_mtime: - files.sort(key=lambda f: f.mtime, reverse=newest_first) + if sort_by_mtime: + files = sorted(files, key=lambda f: f.mtime, reverse=newest_first)The sequential/parallel helper annotations should be widened to
Sequence[_MetadataFileInfo]too.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakesee/utils.py` around lines 358 - 375, _scandir_files currently forces an unnecessary allocation by returning list(_scan_cache.get_files(directory)), undoing the cache's zero-copy intent; change the cached-path return to return the cached sequence directly (i.e., return _scan_cache.get_files(directory)) and update any helper/annotation types from list[_MetadataFileInfo] to Sequence[_MetadataFileInfo] (and widen signatures/usages that reference _scandir_files and helpers to Sequence[_MetadataFileInfo]) so the steady-state path remains allocation-free while the non-cached path still calls _full_scandir and returns a concrete list.
232-249:⚠️ Potential issue | 🔴 CriticalDirectory mtimes won't catch in-place metadata rewrites.
When
current_mtimesmatches_dir_mtimes, this reuses the old_MetadataFileInfotuple without re-stat-ing any files. Parent directory mtimes change for add/remove/rename, not for rewriting an existing file, so a cached call can keep returning stale JSON aftera.jsonis overwritten in place because_read_metadata_file()will also see the stalefile_info.mtime/inode.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakesee/utils.py` around lines 232 - 249, The code only compares directory mtimes (current_mtimes vs self._dir_mtimes) so when they match it returns the cached self._files, which misses in-place file rewrites; instead, when no changed_dirs and no removed_dirs, iterate the existing cached entries in self._files, re-stat each file (e.g., os.stat) and compare st_mtime and st_ino against the stored values in the _MetadataFileInfo for that file; if any file's stat differs, call _read_metadata_file(path) to refresh that file's entry and update both self._files and the underlying per-file metadata (and optionally self._dir_mtimes if needed). Ensure you reference and update the same _MetadataFileInfo structure and keep using _read_metadata_file for consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@snakesee/utils.py`:
- Around line 358-375: _scandir_files currently forces an unnecessary allocation
by returning list(_scan_cache.get_files(directory)), undoing the cache's
zero-copy intent; change the cached-path return to return the cached sequence
directly (i.e., return _scan_cache.get_files(directory)) and update any
helper/annotation types from list[_MetadataFileInfo] to
Sequence[_MetadataFileInfo] (and widen signatures/usages that reference
_scandir_files and helpers to Sequence[_MetadataFileInfo]) so the steady-state
path remains allocation-free while the non-cached path still calls _full_scandir
and returns a concrete list.
- Around line 232-249: The code only compares directory mtimes (current_mtimes
vs self._dir_mtimes) so when they match it returns the cached self._files, which
misses in-place file rewrites; instead, when no changed_dirs and no
removed_dirs, iterate the existing cached entries in self._files, re-stat each
file (e.g., os.stat) and compare st_mtime and st_ino against the stored values
in the _MetadataFileInfo for that file; if any file's stat differs, call
_read_metadata_file(path) to refresh that file's entry and update both
self._files and the underlying per-file metadata (and optionally
self._dir_mtimes if needed). Ensure you reference and update the same
_MetadataFileInfo structure and keep using _read_metadata_file for consistency.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 41043795-db10-42d7-a8fd-7c95a451ca5c
📒 Files selected for processing (2)
snakesee/utils.pytests/test_utils.py
- Return Sequence from _scandir_files instead of list, avoiding O(n) copy on every cached call. Widen helper signatures to match. - Use sorted() instead of .sort() at the call site. - Detect in-place file modifications by re-stat-ing cached files when no directory changes are found (catches mtime/inode shifts). - Add test for in-place rewrite detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@snakesee/utils.py`:
- Around line 249-287: The fast-path currently still stats every cached file in
the block starting at "if not changed_dirs and not removed_dirs:" (using
self._files, stat(), and _MetadataFileInfo) which makes steady-state O(files)
and also inconsistently leaves in-place rewrites stale when any directory change
occurs; fix this by making the fast path directory-only: remove the per-file
stat loop and any_changed handling in that block so it simply returns
self._files when no directory mtimes changed, and extract the file-level
revalidation into a separate method (e.g., _revalidate_files or a configurable
"fresh_scan" mode) that callers can opt into when full file-level freshness is
required; ensure _dir_mtimes/current_mtimes logic and the existing eviction
logic (changed_dirs/removed_dirs, _scan_single_dir, kept/new_files) remain
unchanged and update any tests that assumed implicit per-file stats.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 44d349a1-5d7b-41d1-9c2f-fb1d541d5021
📒 Files selected for processing (2)
snakesee/utils.pytests/test_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_utils.py
Remove per-file stat() loop from the no-change fast path. Snakemake metadata files are write-once, so in-place rewrites don't need detection. Callers needing file-level freshness can bypass the cache via use_scan_cache=False. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
.snakemake/metadata/. With 19k+ files this caused significant sluggishness._ScanCachethat tracks directory mtimes (~256 dirs) and only rescans directories that have actually changed.Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit