feat: support Snakemake SQLite DB persistence backend#51
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (18)
✅ Files skipped from review due to trivial changes (7)
🚧 Files skipped from review as they are similar to previous changes (6)
📝 WalkthroughWalkthroughAdds a pluggable persistence layer to read Snakemake metadata from either Changes
Sequence Diagram(s)sequenceDiagram
participant TUI as TUI Monitor
participant Detector as detect_backend()
participant Backend as PersistenceBackend
participant Loader as HistoricalDataLoader
participant Estimator as TimeEstimator
participant Registry as RuleRegistry
TUI->>Detector: detect_backend(workflow_dir)
alt metadata.db exists
Detector->>Backend: DbPersistence(paths)
else
Detector->>Backend: FsPersistence(paths)
end
Detector-->>TUI: backend
TUI->>Estimator: load_from_backend(backend)
Estimator->>Loader: loader.load_from_backend(backend, progress_cb)
Loader->>Backend: iterate_metadata(progress_cb)
Backend-->>Loader: MetadataRecord stream
Loader->>Registry: record completions & build code_hash_to_rules
Loader-->>Estimator: code_hash_to_rules assigned
Estimator-->>TUI: metadata loaded
sequenceDiagram
participant Parser as parse_workflow_state()
participant Detector as detect_backend()
participant Backend as PersistenceBackend
participant IsRunning as is_workflow_running()
Parser->>Detector: detect_backend(workflow_dir)
Detector-->>Parser: backend
Parser->>IsRunning: is_workflow_running(snakemake_dir, backend)
IsRunning->>Backend: has_locks()
Backend-->>IsRunning: bool
IsRunning->>Backend: has_incomplete_jobs()
Backend-->>IsRunning: bool
IsRunning-->>Parser: running status
alt is_latest_log and backend exists
Parser->>Backend: iterate_incomplete_jobs(min_start_time)
Backend-->>Parser: IncompleteJob stream
Parser->>Parser: map to JobInfo objects
else
Parser->>Parser: fallback filesystem checks
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #51 +/- ##
==========================================
+ Coverage 88.47% 88.59% +0.11%
==========================================
Files 48 52 +4
Lines 4684 4926 +242
==========================================
+ Hits 4144 4364 +220
- Misses 540 562 +22
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (3)
tests/test_state_paths.py (1)
453-456: Consider addingclear_exists_cache()for test isolation.This test doesn't clear the existence cache before checking
has_metadata_db. If a previous test (in the same pytest session) cached aTrueresult for a path that happens to match thistmp_path, the test could produce incorrect results. Whiletmp_pathprovides unique directories, adding the cache clear would be consistent withtest_has_metadata_db_true_when_existsand ensure isolation.♻️ Suggested fix
def test_has_metadata_db_false_when_missing(self, tmp_path: Path) -> None: """Test has_metadata_db is False when file doesn't exist.""" + from snakesee.state.paths import clear_exists_cache + + clear_exists_cache() paths = WorkflowPaths(tmp_path) assert paths.has_metadata_db is False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_state_paths.py` around lines 453 - 456, The test test_has_metadata_db_false_when_missing should clear the existence cache to ensure isolation: call clear_exists_cache() before constructing WorkflowPaths(tmp_path) so the cached result from other tests won't affect paths.has_metadata_db; locate the test function and add a clear_exists_cache() invocation prior to creating the WorkflowPaths instance (reference: clear_exists_cache, WorkflowPaths, has_metadata_db, test_has_metadata_db_false_when_missing).tests/test_parser.py (1)
2241-2251: Consider extracting shared schema creation into a helper.The
CREATE TABLE snakemake_metadataandCREATE TABLE snakemake_locksDDL is duplicated in both tests here and intests/conftest.py::metadata_dbandtests/test_persistence_db.py. Consider extracting this to a shared helper or constant to reduce duplication and ensure schema consistency.Also applies to: 2282-2292
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_parser.py` around lines 2241 - 2251, Duplicate DDL for tables snakemake_metadata and snakemake_locks should be extracted into a single test helper or constant (e.g., METADATA_SCHEMA_SQL or a function create_metadata_schema(conn)) and referenced from all locations (tests/test_parser.py, tests/conftest.py::metadata_db, tests/test_persistence_db.py) so schema is defined once; implement the helper that returns or executes the two CREATE TABLE statements and replace the inline SQL in functions/fixtures that currently call conn.execute(...) with a call to that helper.snakesee/estimation/data_loader.py (1)
80-115: Extract the shared metadata-record loop.
load_from_backend()now mirrorsload_from_metadata()almost line-for-line. Any future change to duration gating, wildcard handling, or code-hash grouping will need to be kept in sync manually. A small private helper that consumes an iterator ofMetadataRecords would keep both code paths aligned.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakesee/estimation/data_loader.py` around lines 80 - 115, The load_from_backend implementation duplicates the metadata-record loop logic present in load_from_metadata; extract that shared logic into a private helper (e.g., _consume_metadata_records or _process_metadata_iterator) that accepts an iterator of MetadataRecord and an optional progress_callback, moves the duration/end_time gating and wildcard handling, calls self._registry.record_completion(rule, duration, timestamp, wildcards, input_size) and accumulates code_hash -> set(rule) mapping, and then have both load_from_backend and load_from_metadata call this helper and assign its returned code_hash_to_rules to self.code_hash_to_rules.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@snakesee/persistence/backend.py`:
- Around line 92-97: detect_backend currently relies on
WorkflowPaths.has_metadata_db which uses a global exists cache and can return
stale results; replace that cached check with an immediate filesystem probe so
the function prefers the DB if it appears during a live session. Concretely, in
detect_backend (where you create paths = WorkflowPaths(workflow_dir)) stop using
paths.has_metadata_db and instead perform a direct existence check (e.g.
os.path.exists or pathlib.Path(...) .exists()) against the actual metadata.db
file path exposed by WorkflowPaths (use the attribute holding the metadata DB
path on paths), and return DbPersistence(paths) when that immediate check is
true; otherwise fall back to FsPersistence as before. Ensure you import os or
pathlib as needed.
In `@snakesee/persistence/db.py`:
- Around line 229-238: Replace the bare "except Exception" around the
orjson.loads/calculate_input_size block with targeted exception handlers: catch
orjson.JSONDecodeError for malformed JSON and TypeError (or ValueError if
applicable) for bad input to calculate_input_size, and log the errors using the
project's standard logging pattern (see snakesee/events.py and
snakesee/utils.py) before continuing; update the code around input_json,
input_files, calculate_input_size and the orjson import to reflect these
specific catches and preserve existing control flow.
- Around line 50-52: The code builds the SQLite URI by interpolating db_path
directly (uri = f"file:{db_path}?mode=ro"), which fails for paths containing
characters like ? or #; change it to percent-encode the filesystem path using
pathlib.Path.as_uri(), e.g. construct uri = Path(db_path).resolve().as_uri() +
"?mode=ro" and then call sqlite3.connect(uri, uri=True, timeout=10.0); update
the import to include pathlib.Path if needed and keep the existing variable
names (db_path, uri) and sqlite3.connect call.
In `@snakesee/persistence/fs.py`:
- Around line 55-63: The has_incomplete_jobs method currently treats any entry
in self._paths.incomplete_dir as an incomplete job; update it to mirror
iterate_incomplete_jobs() filtering: on OSError return False as before, but
otherwise iterate inc_dir.iterdir(), skip entries that are directories and skip
the special marker filename (e.g. "migration_underway" or the same constant used
by iterate_incomplete_jobs()), and return True only if any remaining entry
qualifies as an incomplete job; keep the same function name has_incomplete_jobs
and use self._paths.incomplete_dir and the same marker identifier as
iterate_incomplete_jobs().
In `@snakesee/tui/monitor.py`:
- Around line 396-407: The progress bar total is currently set from the
filesystem even when the DB backend (has_metadata_db) is used, which can make
the bar finish too early or overshoot; update the logic around progress.add_task
so the total is only provided when the filesystem backend is being used (i.e.,
has_metadata_fs is true AND has_metadata_db is false), otherwise pass None to
keep the progress bar indeterminate; adjust the total calculation that currently
uses metadata_dir.rglob()/file_count and the call to progress.add_task("Loading
metadata...", total=...) accordingly (refer to has_metadata, has_metadata_fs,
has_metadata_db, metadata_dir and the progress.add_task invocation).
---
Nitpick comments:
In `@snakesee/estimation/data_loader.py`:
- Around line 80-115: The load_from_backend implementation duplicates the
metadata-record loop logic present in load_from_metadata; extract that shared
logic into a private helper (e.g., _consume_metadata_records or
_process_metadata_iterator) that accepts an iterator of MetadataRecord and an
optional progress_callback, moves the duration/end_time gating and wildcard
handling, calls self._registry.record_completion(rule, duration, timestamp,
wildcards, input_size) and accumulates code_hash -> set(rule) mapping, and then
have both load_from_backend and load_from_metadata call this helper and assign
its returned code_hash_to_rules to self.code_hash_to_rules.
In `@tests/test_parser.py`:
- Around line 2241-2251: Duplicate DDL for tables snakemake_metadata and
snakemake_locks should be extracted into a single test helper or constant (e.g.,
METADATA_SCHEMA_SQL or a function create_metadata_schema(conn)) and referenced
from all locations (tests/test_parser.py, tests/conftest.py::metadata_db,
tests/test_persistence_db.py) so schema is defined once; implement the helper
that returns or executes the two CREATE TABLE statements and replace the inline
SQL in functions/fixtures that currently call conn.execute(...) with a call to
that helper.
In `@tests/test_state_paths.py`:
- Around line 453-456: The test test_has_metadata_db_false_when_missing should
clear the existence cache to ensure isolation: call clear_exists_cache() before
constructing WorkflowPaths(tmp_path) so the cached result from other tests won't
affect paths.has_metadata_db; locate the test function and add a
clear_exists_cache() invocation prior to creating the WorkflowPaths instance
(reference: clear_exists_cache, WorkflowPaths, has_metadata_db,
test_has_metadata_db_false_when_missing).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 712a47a8-7270-4fd4-9d65-f74897672464
📒 Files selected for processing (17)
snakesee/estimation/data_loader.pysnakesee/estimation/estimator.pysnakesee/parser/core.pysnakesee/parser/metadata.pysnakesee/persistence/__init__.pysnakesee/persistence/backend.pysnakesee/persistence/db.pysnakesee/persistence/fs.pysnakesee/state/paths.pysnakesee/tui/monitor.pytests/conftest.pytests/test_estimation_components.pytests/test_parser.pytests/test_persistence_db.pytests/test_persistence_detection.pytests/test_persistence_fs.pytests/test_state_paths.py
1c08545 to
7794905
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
snakesee/parser/metadata.py (1)
67-85: Keep this helper name metadata-specific.This rename creates two public
calculate_input_size()helpers with different contracts:snakesee.parser.utils.calculate_input_size(list[Path])is already re-exported fromsnakesee.parser, while this one accepts raw metadata strings. That makes the wrong import easy to pick and harder to type-check. Keeping this private or renaming it to something likecalculate_metadata_input_size()would avoid the collision.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakesee/parser/metadata.py` around lines 67 - 85, The helper calculate_input_size in snakesee.parser.metadata conflicts with the public utils.calculate_input_size (different contract), so rename this function to calculate_metadata_input_size (or prefix with an underscore) and update all internal callers in this module to use the new name; ensure it is not exported from snakesee.parser (remove from any __all__ or re-exports) so imports won’t accidentally pick the wrong helper and type-checking remains correct.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@snakesee/persistence/db.py`:
- Around line 179-186: The query in the db code that runs when min_start_time is
not None is treating rows with starttime IS NULL as always recent, which lets
stale incomplete stubs mark the workflow INCOMPLETE; update the SQL in the
conn.execute call that queries snakemake_metadata (the branch guarded by
min_start_time) to exclude NULL starttimes (e.g., replace the `(starttime IS
NULL OR starttime >= ?)` predicate with `starttime >= ?`) so only entries with a
real starttime >= min_start_time are considered; ensure this change is applied
where the code uses self._namespace and the same query that feeds
_determine_final_workflow_status().
---
Nitpick comments:
In `@snakesee/parser/metadata.py`:
- Around line 67-85: The helper calculate_input_size in snakesee.parser.metadata
conflicts with the public utils.calculate_input_size (different contract), so
rename this function to calculate_metadata_input_size (or prefix with an
underscore) and update all internal callers in this module to use the new name;
ensure it is not exported from snakesee.parser (remove from any __all__ or
re-exports) so imports won’t accidentally pick the wrong helper and
type-checking remains correct.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 386a2c03-ea9e-4c80-ad0d-2fac07df01c9
📒 Files selected for processing (15)
snakesee/estimation/data_loader.pysnakesee/estimation/estimator.pysnakesee/parser/core.pysnakesee/parser/metadata.pysnakesee/persistence/__init__.pysnakesee/persistence/backend.pysnakesee/persistence/db.pysnakesee/persistence/fs.pysnakesee/tui/monitor.pytests/conftest.pytests/test_estimation_components.pytests/test_parser.pytests/test_persistence_db.pytests/test_persistence_detection.pytests/test_persistence_fs.py
✅ Files skipped from review due to trivial changes (3)
- tests/test_parser.py
- snakesee/persistence/backend.py
- tests/test_estimation_components.py
🚧 Files skipped from review as they are similar to previous changes (4)
- snakesee/persistence/init.py
- snakesee/estimation/estimator.py
- tests/test_persistence_detection.py
- tests/test_persistence_fs.py
7794905 to
be89b00
Compare
Adds support for reading workflow metadata from Snakemake's new SQLite persistence backend (.snakemake/metadata.db) in addition to the existing filesystem-based backend. - New snakesee/persistence/ module with PersistenceBackend protocol - FsPersistence wraps existing filesystem metadata/incomplete/lock reading - DbPersistence reads from SQLite with read-only access and namespace filtering - detect_backend() auto-selects based on whether metadata.db exists (prefers DB) - Wired into HistoricalDataLoader, parse_workflow_state, is_workflow_running, and TUI metadata loading - 42 new tests across 4 test files Closes #40
be89b00 to
3cce606
Compare
Summary
Adds support for reading workflow metadata from Snakemake's new SQLite persistence backend (
.snakemake/metadata.db) in addition to the existing filesystem-based backend.snakesee/persistence/module withPersistenceBackendprotocolFsPersistencewraps existing filesystem metadata/incomplete/lock readingDbPersistencereads from SQLite with read-only access and namespace filteringdetect_backend()auto-selects based on whethermetadata.dbexists (prefers DB)HistoricalDataLoader,parse_workflow_state,is_workflow_running, and TUICloses #40
Test plan
uv run pytest tests/ -vuv run poe check-allSummary by CodeRabbit
New Features
Bug Fixes / Improvements
Tests