Skip to content

perf: cache metadata directory scans for large workflows#58

Open
yfarjoun wants to merge 7 commits intofg-labs:mainfrom
yfarjoun:yf_performance-thread-investigation
Open

perf: cache metadata directory scans for large workflows#58
yfarjoun wants to merge 7 commits intofg-labs:mainfrom
yfarjoun:yf_performance-thread-investigation

Conversation

@yfarjoun
Copy link
Copy Markdown
Contributor

@yfarjoun yfarjoun commented Mar 31, 2026

Summary

  • Each poll cycle (default 1s) was recursively stat-ing every file in .snakemake/metadata/. With 19k+ files this caused significant sluggishness.
  • Added _ScanCache that tracks directory mtimes (~256 dirs) and only rescans directories that have actually changed.
  • Reduces steady-state polling from ~19k stat calls to ~256 per cycle (~75x reduction).

Test plan

  • All 978 tests pass
  • Manual testing with a large workflow directory (19k+ metadata files) to confirm responsiveness improvement

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor
    • Improved directory-scan performance with an incremental scan cache for much faster subsequent scans.
    • Added an option to bypass the scan cache so scans can run uncached when fresh results are required.
  • Tests
    • Added comprehensive tests validating cache behavior, invalidation, rescan scenarios, and cache clearing.

…f files

Each poll cycle (default 1s) was recursively scanning and stat-ing every
file in .snakemake/metadata/. With 19k+ metadata files this caused
significant CPU and I/O overhead.

The new _ScanCache tracks directory mtimes (~256 dirs) and only rescans
directories that have actually changed, reducing steady-state polling
from ~19k stat calls to ~256.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yfarjoun yfarjoun requested a review from nh13 as a code owner March 31, 2026 15:40
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces a directory-mtime-backed incremental scan cache in snakesee/utils.py that performs full recursive scans once and thereafter re-scans only changed or removed directories; test suite tests/test_utils.py adds fixtures and cases validating cache behavior, eviction, and edge cases.

Changes

Cohort / File(s) Summary
Directory scan & cache
snakesee/utils.py
Added _ScanCache and public get_scan_cache(); new helpers _collect_dir_mtimes, _scan_single_dir, _full_scandir; _scandir_files(directory: Path, *, use_scan_cache: bool = True) -> Sequence[_MetadataFileInfo] replaces recursive scanner; caching logic for dir mtimes, removed dirs, per-file restat, selective rescans; iterate_metadata_files() now passes use_scan_cache; switched to sorted(...) and widened internal type hints.
Tests for scan cache
tests/test_utils.py
Added TestScanCache suite and autouse fixture clearing global scan/metadata caches; expanded tests to assert cache reuse, initial full scan, mtime-driven rescans, deleted-subdir eviction, root-directory switches forcing full rescan, clear() behavior, in-place file rewrite detection, and preservation of unchanged subdir entries.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant ScanCache as _ScanCache
    participant FS as Filesystem

    Caller->>ScanCache: request scan for root (use_cache?)
    alt cache miss or use_cache == False
        ScanCache->>FS: perform full recursive scan
        FS-->>ScanCache: return files + dir mtimes
        ScanCache-->>Caller: return full file list
    else cache hit
        ScanCache->>FS: non-recursive collect dir mtimes
        FS-->>ScanCache: current dir mtimes
        ScanCache->>ScanCache: compare mtimes -> changed/removed dirs
        alt changed or removed dirs found
            ScanCache->>FS: scan changed dirs (non-recursive)
            FS-->>ScanCache: updated file metadata
            ScanCache-->>Caller: return merged updated + cached file list
        else no changes detected
            ScanCache-->>Caller: return cached file list
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇
I hopped through folders, noted each chime,
Saved tiny ticks to spare the climb,
When mtimes shifted, I dug once more,
Otherwise I lent the cached shore —
Nibble, bound, and scan some more! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The description covers the problem, solution, and test status, but lacks explicit sections matching the required template (Type of Change, Checklist, Testing, Related Issues). Fill in the missing template sections: specify Type of Change (New feature), complete the Checklist items, detail Testing approach, and provide any Related Issues or Additional Notes.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main performance improvement: caching metadata directory scans to handle large workflows efficiently.
Docstring Coverage ✅ Passed Docstring coverage is 92.86% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
snakesee/utils.py (1)

210-210: Sort __slots__ alphabetically.

Static analysis flags that __slots__ is not sorted.

-    __slots__ = ("_files", "_dir_mtimes", "_lock")
+    __slots__ = ("_dir_mtimes", "_files", "_lock")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakesee/utils.py` at line 210, The __slots__ tuple in the class is not
alphabetically ordered; update the __slots__ declaration to list its members
alphabetically (e.g., order "_dir_mtimes", "_files", "_lock") so the __slots__ =
("_dir_mtimes", "_files", "_lock") ordering is used; modify the __slots__ line
accordingly where __slots__ is defined to satisfy the static analysis rule.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@snakesee/utils.py`:
- Around line 344-357: _scandir_files currently always uses the global
_scan_cache which prevents callers of iterate_metadata_files from bypassing the
directory-scan cache when they pass use_cache=False; change _scandir_files to
accept a parameter like use_scan_cache: bool = True and, when False, perform a
fresh recursive scan instead of calling _scan_cache.get_files (or call the
existing lower-level scanner directly), then update iterate_metadata_files to
pass its use_cache value as use_scan_cache to _scandir_files so use_cache=False
truly forces a fresh directory listing; keep the public behavior unchanged when
use_scan_cache is True to use _scan_cache.
- Around line 205-207: The caching logic that "Tracks directory mtimes to detect
when new files are added" must be changed to not assume metadata files are
append-only: update the code that builds/uses _MetadataFileInfo so it records
and compares each metadata file's stat (mtime, size, inode) rather than relying
solely on directory mtime; on rescans compare per-file stats and treat modified,
added, or removed files as cache-invalidating changes (missing files should mark
the cache stale), and ensure any cache entries for directories are invalidated
when any per-file stat differs. Address this in the module code that currently
uses directory mtime checks and in the _MetadataFileInfo handling so the cache
reflects file-level modifications and deletions.
- Around line 202-255: The get_files method in _ScanCache uses an unsafe prefix
check (str(f.path).startswith(d)) and accepts any directory but only stores one
global cache; update get_files to (1) replace the startswith check with a safe
comparison that treats directory boundaries correctly (e.g., check str(f.path)
== d or str(f.path).startswith(d + os.sep) or equivalent using Path methods)
when building kept, and (2) add a simple cached-root guard: store the initial
directory (e.g., self._root_dir) on first full scan and on subsequent calls
validate that the incoming directory equals that root (raise or document error)
so the cache cannot be silently corrupted when callers pass different
directories (symbols: class _ScanCache, method get_files, attributes _files and
_dir_mtimes, and the caller _scandir_files).

---

Nitpick comments:
In `@snakesee/utils.py`:
- Line 210: The __slots__ tuple in the class is not alphabetically ordered;
update the __slots__ declaration to list its members alphabetically (e.g., order
"_dir_mtimes", "_files", "_lock") so the __slots__ = ("_dir_mtimes", "_files",
"_lock") ordering is used; modify the __slots__ line accordingly where __slots__
is defined to satisfy the static analysis rule.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bf1a5995-a9a4-47f3-8463-2429400aeb9e

📥 Commits

Reviewing files that changed from the base of the PR and between 453d660 and aab0451.

📒 Files selected for processing (1)
  • snakesee/utils.py

- Fix unsafe path-prefix check: use `d + os.sep` to avoid matching
  `/foo/bar_baz` when checking for `/foo/bar`
- Track cached root directory and reset on directory change
- Fix docstring: remove incorrect "append-only" claim; directory mtime
  correctly detects modifications and deletions too
- Add use_scan_cache bypass: when iterate_metadata_files is called with
  use_cache=False, also bypass the scan cache for a fresh listing
- Sort __slots__ alphabetically

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@snakesee/utils.py`:
- Around line 249-264: The current eviction logic removes entire subtrees by
checking startswith on changed_prefixes; change it so we only evict cached files
whose immediate parent directory equals a changed directory (i.e., don't drop
files in deeper subdirectories). Replace the kept comprehension that uses
changed_prefixes with one that keeps f in self._files if not
any(Path(f.path).parent == Path(dir_path) for dir_path in changed_dirs), so
_scan_single_dir (which only repopulates direct files) won't cause unrelated
subtrees to vanish; leave updating self._dir_mtimes/current_mtimes and new_files
logic unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c972074c-5728-477d-a9c8-82569ad47e98

📥 Commits

Reviewing files that changed from the base of the PR and between aab0451 and ef2d51b.

📒 Files selected for processing (1)
  • snakesee/utils.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

def _scandir_files(directory: Path) -> list[_MetadataFileInfo]:
"""Recursively scan directory for files using os.scandir (faster than rglob).
class _ScanCache:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: can you add unit tests:

  • First call performs full scan
  • No-change call returns cached results
  • Adding/removing a file triggers rescan of that directory only
  • Removing an entire directory removes its files
  • Different root directory triggers full rescan
  • clear() resets state

self._dir_mtimes = current_mtimes
return list(self._files)

def clear(self) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: needs to be called in the test infrastructure once unit tests are added, to avoid flaky tests

# First call or different directory — full scan
self._root = dir_str
self._files, self._dir_mtimes = _full_scandir(directory)
return list(self._files)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking):

There's a list() copy on every hot-path return here and below. A tuple return or a generation-counter approach could avoid this allocation on the hot path.

- Return tuples from _ScanCache.get_files() to avoid list() copy on
  every hot-path return (callers that need mutability convert explicitly)
- Fix subtree eviction: only evict direct children of changed dirs,
  preserving files in unchanged subdirectories
- Add _ScanCache unit tests: first call, cached results, add/remove
  files, remove directory, different root, clear, subdirectory
  preservation
- Clear global scan cache in test fixtures to prevent flaky tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
snakesee/utils.py (2)

358-375: ⚠️ Potential issue | 🟠 Major

Line 375 puts the full list() copy back on the steady-state path.

The cache now stores a tuple, but list(_scan_cache.get_files(directory)) still allocates and copies every cached result. That restores an O(n) hot-path cost for large metadata trees, which is exactly what this PR is trying to remove.

♻️ One way to keep the cached path allocation-free
-from collections.abc import Iterator
+from collections.abc import Iterator, Sequence
...
-def _scandir_files(directory: Path, *, use_scan_cache: bool = True) -> list[_MetadataFileInfo]:
+def _scandir_files(directory: Path, *, use_scan_cache: bool = True) -> Sequence[_MetadataFileInfo]:
...
     if not use_scan_cache:
         files, _ = _full_scandir(directory)
         return files
-    return list(_scan_cache.get_files(directory))
+    return _scan_cache.get_files(directory)
...
-    if sort_by_mtime:
-        files.sort(key=lambda f: f.mtime, reverse=newest_first)
+    if sort_by_mtime:
+        files = sorted(files, key=lambda f: f.mtime, reverse=newest_first)

The sequential/parallel helper annotations should be widened to Sequence[_MetadataFileInfo] too.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakesee/utils.py` around lines 358 - 375, _scandir_files currently forces an
unnecessary allocation by returning list(_scan_cache.get_files(directory)),
undoing the cache's zero-copy intent; change the cached-path return to return
the cached sequence directly (i.e., return _scan_cache.get_files(directory)) and
update any helper/annotation types from list[_MetadataFileInfo] to
Sequence[_MetadataFileInfo] (and widen signatures/usages that reference
_scandir_files and helpers to Sequence[_MetadataFileInfo]) so the steady-state
path remains allocation-free while the non-cached path still calls _full_scandir
and returns a concrete list.

232-249: ⚠️ Potential issue | 🔴 Critical

Directory mtimes won't catch in-place metadata rewrites.

When current_mtimes matches _dir_mtimes, this reuses the old _MetadataFileInfo tuple without re-stat-ing any files. Parent directory mtimes change for add/remove/rename, not for rewriting an existing file, so a cached call can keep returning stale JSON after a.json is overwritten in place because _read_metadata_file() will also see the stale file_info.mtime/inode.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakesee/utils.py` around lines 232 - 249, The code only compares directory
mtimes (current_mtimes vs self._dir_mtimes) so when they match it returns the
cached self._files, which misses in-place file rewrites; instead, when no
changed_dirs and no removed_dirs, iterate the existing cached entries in
self._files, re-stat each file (e.g., os.stat) and compare st_mtime and st_ino
against the stored values in the _MetadataFileInfo for that file; if any file's
stat differs, call _read_metadata_file(path) to refresh that file's entry and
update both self._files and the underlying per-file metadata (and optionally
self._dir_mtimes if needed). Ensure you reference and update the same
_MetadataFileInfo structure and keep using _read_metadata_file for consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@snakesee/utils.py`:
- Around line 358-375: _scandir_files currently forces an unnecessary allocation
by returning list(_scan_cache.get_files(directory)), undoing the cache's
zero-copy intent; change the cached-path return to return the cached sequence
directly (i.e., return _scan_cache.get_files(directory)) and update any
helper/annotation types from list[_MetadataFileInfo] to
Sequence[_MetadataFileInfo] (and widen signatures/usages that reference
_scandir_files and helpers to Sequence[_MetadataFileInfo]) so the steady-state
path remains allocation-free while the non-cached path still calls _full_scandir
and returns a concrete list.
- Around line 232-249: The code only compares directory mtimes (current_mtimes
vs self._dir_mtimes) so when they match it returns the cached self._files, which
misses in-place file rewrites; instead, when no changed_dirs and no
removed_dirs, iterate the existing cached entries in self._files, re-stat each
file (e.g., os.stat) and compare st_mtime and st_ino against the stored values
in the _MetadataFileInfo for that file; if any file's stat differs, call
_read_metadata_file(path) to refresh that file's entry and update both
self._files and the underlying per-file metadata (and optionally
self._dir_mtimes if needed). Ensure you reference and update the same
_MetadataFileInfo structure and keep using _read_metadata_file for consistency.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 41043795-db10-42d7-a8fd-7c95a451ca5c

📥 Commits

Reviewing files that changed from the base of the PR and between ef2d51b and 7e4b8a9.

📒 Files selected for processing (2)
  • snakesee/utils.py
  • tests/test_utils.py

yfarjoun and others added 2 commits April 14, 2026 16:54
- Return Sequence from _scandir_files instead of list, avoiding O(n)
  copy on every cached call. Widen helper signatures to match.
- Use sorted() instead of .sort() at the call site.
- Detect in-place file modifications by re-stat-ing cached files when
  no directory changes are found (catches mtime/inode shifts).
- Add test for in-place rewrite detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@snakesee/utils.py`:
- Around line 249-287: The fast-path currently still stats every cached file in
the block starting at "if not changed_dirs and not removed_dirs:" (using
self._files, stat(), and _MetadataFileInfo) which makes steady-state O(files)
and also inconsistently leaves in-place rewrites stale when any directory change
occurs; fix this by making the fast path directory-only: remove the per-file
stat loop and any_changed handling in that block so it simply returns
self._files when no directory mtimes changed, and extract the file-level
revalidation into a separate method (e.g., _revalidate_files or a configurable
"fresh_scan" mode) that callers can opt into when full file-level freshness is
required; ensure _dir_mtimes/current_mtimes logic and the existing eviction
logic (changed_dirs/removed_dirs, _scan_single_dir, kept/new_files) remain
unchanged and update any tests that assumed implicit per-file stats.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 44d349a1-5d7b-41d1-9c2f-fb1d541d5021

📥 Commits

Reviewing files that changed from the base of the PR and between 3c3e015 and f8e48e2.

📒 Files selected for processing (2)
  • snakesee/utils.py
  • tests/test_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_utils.py

Remove per-file stat() loop from the no-change fast path. Snakemake
metadata files are write-once, so in-place rewrites don't need
detection. Callers needing file-level freshness can bypass the cache
via use_scan_cache=False.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants