Skip to content

fix: parse group/pipe job logs correctly#52

Merged
nh13 merged 1 commit intomainfrom
nh/fix-group-job-parsing
Mar 28, 2026
Merged

fix: parse group/pipe job logs correctly#52
nh13 merged 1 commit intomainfrom
nh/fix-group-job-parsing

Conversation

@nh13
Copy link
Copy Markdown
Collaborator

@nh13 nh13 commented Mar 28, 2026

Summary

  • Snakemake indents log output for jobs within pipe()/group: blocks by 4 spaces, causing the parser to miss rule names and timestamps for grouped jobs
  • Fix RULE_START_PATTERN.match().match(line.lstrip()) across all 8 call sites in core.py and add indented-line handling to LogLineParser in line_parser.py
  • Fix TIMESTAMP_PATTERN.match().search() and remove line.startswith("[") guards that blocked indented timestamps

Test plan

  • 13 new tests covering group job parsing across parse_running_jobs_from_log, parse_failed_jobs_from_log, parse_completed_jobs_from_log, parse_all_jobs_from_log, and LogLineParser
  • All 148 existing parser tests still pass
  • Full suite (1033 tests) passes
  • ruff, mypy clean

Closes #42

Summary by CodeRabbit

  • Bug Fixes

    • Log parsing is now whitespace-tolerant for indented rule/group blocks and timestamps; indented timestamps and rule starts are correctly recognized. Job completion counting now attributes completions to the correct individual jobs for accurate rule counts.
  • Tests

    • Added extensive tests covering indented/group job parsing (running, failed, completed, scheduled) and timestamp regressions; relaxed an integration assertion to allow CI variability in reported total job counts.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3bff1483-0fbe-454e-bb39-2e832b5f142f

📥 Commits

Reviewing files that changed from the base of the PR and between 41b0ffa and 3c42206.

📒 Files selected for processing (4)
  • snakesee/parser/core.py
  • snakesee/parser/line_parser.py
  • tests/integration/test_workflows.py
  • tests/test_parser.py
✅ Files skipped from review due to trivial changes (1)
  • tests/integration/test_workflows.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • snakesee/parser/core.py

📝 Walkthrough

Walkthrough

Rule-start and timestamp detection were made whitespace-tolerant (matching on left-stripped lines). Indented/grouped log lines are routed to a new indented-line handler that recognizes indented timestamps and rule starts and flushes pending ERROR state. Completed-job attribution now uses a JOBID→rule mapping. Tests for group/pipe log parsing were added.

Changes

Cohort / File(s) Summary
Core parsing logic
snakesee/parser/core.py
Added job_rules: dict[str, str] to map JOBID→rule for completion attribution; switched rule-start and timestamp checks to use PATTERN.match(line.lstrip()) across multiple parsers to be whitespace-tolerant.
Indented line parsing
snakesee/parser/line_parser.py
Added LogLineParser._parse_indented_or_group_line(...) and routed all indented lines to it. Handler left-strips lines, recognizes indented timestamps and rule/checkpoint starts, flushes pending ERROR events, updates context, and delegates property parsing as needed.
Tests / Fixtures
tests/test_parser.py, tests/integration/test_workflows.py
Added extensive fixtures and unit/integration tests for grouped/pipe job logs (running/failed/completed/scheduled), regression tests for timestamp handling, and relaxed an integration assertion around PROGRESS-derived total_jobs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I hopped through indents, sniffed each line,

JobIDs led me to rules that brightly shine,
Timestamps unmasked where spaces used to hide,
Grouped jobs stood up, no longer brushed aside,
🥕📜 Hop, parse, and find!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: parse group/pipe job logs correctly' clearly and concisely describes the main change: fixing the parser to handle group/pipe job logs.
Description check ✅ Passed The description is well-structured with a clear summary of the problem, solution, and comprehensive test plan, exceeding the template requirements.
Linked Issues check ✅ Passed The PR fully addresses issue #42 by fixing the parser to detect group/pipe jobs through handling indented log lines and correctly extracting rule names and timestamps.
Out of Scope Changes check ✅ Passed All changes directly address the group/pipe job parsing problem, with parser logic updates, line parsing enhancements, comprehensive tests, and a focused integration test adjustment.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch nh/fix-group-job-parsing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Codecov Report

❌ Patch coverage is 90.90909% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.92%. Comparing base (4053236) to head (3c42206).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
snakesee/parser/core.py 88.88% 2 Missing ⚠️
snakesee/parser/line_parser.py 92.30% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #52      +/-   ##
==========================================
+ Coverage   88.59%   88.92%   +0.32%     
==========================================
  Files          52       52              
  Lines        4928     4955      +27     
==========================================
+ Hits         4366     4406      +40     
+ Misses        562      549      -13     
Files with missing lines Coverage Δ
snakesee/parser/core.py 85.23% <88.88%> (+2.79%) ⬆️
snakesee/parser/line_parser.py 97.74% <92.30%> (-0.97%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
snakesee/parser/core.py (1)

266-270: ⚠️ Potential issue | 🟠 Major

parse_rules_from_log() still miscounts grouped completions.

This function still increments the last seen rule for every finish line. In grouped output, multiple rules can be opened before any of them finish, so the last rule block wins. With the grouped fixture added in this PR, both consumer completions would still be counted as producer, leaving rule-level historical stats wrong.

🔧 Suggested direction
 def parse_rules_from_log(log_path: Path) -> dict[str, int]:
     rule_counts: dict[str, int] = {}
     current_rule: str | None = None
+    job_rules: dict[str, str] = {}

     try:
         for line in log_path.read_text().splitlines():
             # Track current rule being executed
             if match := RULE_START_PATTERN.match(line.lstrip()):
                 current_rule = match.group(1)
+            elif match := JOBID_PATTERN.match(line):
+                if current_rule is not None:
+                    job_rules[match.group(1)] = current_rule
             # Count "Finished job" as rule completion
-            elif "Finished job" in line and current_rule is not None:
-                rule_counts[current_rule] = rule_counts.get(current_rule, 0) + 1
+            elif match := FINISHED_JOB_PATTERN.search(line):
+                if rule := job_rules.get(match.group(1), current_rule):
+                    rule_counts[rule] = rule_counts.get(rule, 0) + 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakesee/parser/core.py` around lines 266 - 270, parse_rules_from_log()
misattributes "Finished job" lines to the last seen rule by using current_rule;
instead maintain a stack of open rules (e.g., rules_stack) where on
RULE_START_PATTERN.match(line.lstrip()) you push match.group(1), and on seeing
"Finished job" you pop from rules_stack (if non-empty) and increment rule_counts
for the popped rule; update references to current_rule (remove or keep only for
convenience) and ensure you guard against popping an empty stack.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@snakesee/parser/core.py`:
- Around line 332-333: TIMESTAMP_PATTERN is currently used with .search(), which
finds timestamps anywhere in a line and produces false positives; update every
check to use TIMESTAMP_PATTERN.match(line.lstrip()) instead (i.e., strip leading
whitespace and anchor to the start) in the functions
parse_running_jobs_from_log(), parse_failed_jobs_from_log(), and
_get_first_log_timestamp(), and where record_pending_error() is gated by a
TIMESTAMP_PATTERN check—replace the .search(...) calls at those sites so
timestamp detection only matches at the start of the trimmed line.

---

Outside diff comments:
In `@snakesee/parser/core.py`:
- Around line 266-270: parse_rules_from_log() misattributes "Finished job" lines
to the last seen rule by using current_rule; instead maintain a stack of open
rules (e.g., rules_stack) where on RULE_START_PATTERN.match(line.lstrip()) you
push match.group(1), and on seeing "Finished job" you pop from rules_stack (if
non-empty) and increment rule_counts for the popped rule; update references to
current_rule (remove or keep only for convenience) and ensure you guard against
popping an empty stack.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b74cfa2c-513d-4f1b-9708-1d0353eb8458

📥 Commits

Reviewing files that changed from the base of the PR and between 684359e and 227ea32.

📒 Files selected for processing (3)
  • snakesee/parser/core.py
  • snakesee/parser/line_parser.py
  • tests/test_parser.py

@nh13 nh13 force-pushed the nh/fix-group-job-parsing branch from 227ea32 to 3f2475d Compare March 28, 2026 16:01
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
snakesee/parser/line_parser.py (1)

276-285: Missing checkpoint/localcheckpoint handling for indented lines.

The non-indented rule start check (line 188-189) handles checkpoint and localcheckpoint prefixes, but this indented handler only checks for rule and localrule. If a checkpoint rule ever appears inside a group block, it won't be recognized as a rule start.

♻️ Proposed fix to add checkpoint support
         # Indented rule start: "    rule X:" or "    localrule X:"
-        if (first_stripped == "r" and stripped.startswith("rule ")) or (
-            first_stripped == "l" and stripped.startswith("localrule ")
+        if (
+            (first_stripped == "r" and stripped.startswith("rule "))
+            or (first_stripped == "l" and stripped.startswith("localrule "))
+            or (first_stripped == "c" and stripped.startswith("checkpoint "))
+            or (first_stripped == "l" and stripped.startswith("localcheckpoint "))
         ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakesee/parser/line_parser.py` around lines 276 - 285, The indented-line
rule-start branch only checks for "r"/"rule" and "l"/"localrule" and thus misses
"checkpoint"/"localcheckpoint"; update the conditional that examines
first_stripped and stripped (the block using RULE_START_PATTERN.match(stripped))
to also accept the checkpoint prefixes the same way the non-indented handler
does, so that when RULE_START_PATTERN matches you still call
self.context.get_pending_error(), self.context.reset_for_new_rule(rule) and
append ParseEvent(ParseEventType.RULE_START, {"rule": rule}) for
checkpoint/localcheckpoint names as well (keep the existing use of
RULE_START_PATTERN, match.group(1), pending handling, reset_for_new_rule, and
ParseEvent creation).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@snakesee/parser/line_parser.py`:
- Around line 276-285: The indented-line rule-start branch only checks for
"r"/"rule" and "l"/"localrule" and thus misses "checkpoint"/"localcheckpoint";
update the conditional that examines first_stripped and stripped (the block
using RULE_START_PATTERN.match(stripped)) to also accept the checkpoint prefixes
the same way the non-indented handler does, so that when RULE_START_PATTERN
matches you still call self.context.get_pending_error(),
self.context.reset_for_new_rule(rule) and append
ParseEvent(ParseEventType.RULE_START, {"rule": rule}) for
checkpoint/localcheckpoint names as well (keep the existing use of
RULE_START_PATTERN, match.group(1), pending handling, reset_for_new_rule, and
ParseEvent creation).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a4f515dc-65cb-4dc2-9fc7-d4075c8b4df5

📥 Commits

Reviewing files that changed from the base of the PR and between 227ea32 and 3f2475d.

📒 Files selected for processing (3)
  • snakesee/parser/core.py
  • snakesee/parser/line_parser.py
  • tests/test_parser.py

@nh13 nh13 force-pushed the nh/fix-group-job-parsing branch 2 times, most recently from 46b52ac to 41b0ffa Compare March 28, 2026 22:34
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/test_workflows.py`:
- Around line 52-56: The assertion in tests/integration/test_workflows.py is too
strict given EventState.add_event (in snakesee/state_comparison.py) and the
optional Event.total_jobs (in snakesee/types.py) which can leave total_jobs as
0,1,2,3,4; update the assertion that checks result.total_jobs so it allows any
integer in the inclusive range 0–4 (rather than only 0 or 4), i.e. validate
result.total_jobs is >= 0 and <= 4 (or use membership in range(0,5)) to cover
partial PROGRESS snapshots.

In `@tests/test_parser.py`:
- Around line 2822-2855: The tests never exercise a mid-line timestamp fragment,
so they don't catch regressions that use TIMESTAMP_PATTERN.search(); update the
fixtures so a non-timestamp line contains an embedded "[Mon ...]" fragment and
ensure the parser is driven into a pending-error state so the mid-line fragment
would matter — e.g., in test_midline_timestamp_does_not_corrupt_running_jobs (or
add a new test) change the log_content to start an error block (use
parse_failed_jobs_from_log or construct a pending error with the same shape as
other tests) then include a line like "some message [Mon Jan  6 10:00:01 2026]
continued text" to verify parse_running_jobs_from_log /
parse_failed_jobs_from_log does not prematurely close the block; this will
ensure the code path that would be broken by TIMESTAMP_PATTERN.search() is
actually exercised.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 74b448f7-c1a7-4328-8474-9a19b6dd14b2

📥 Commits

Reviewing files that changed from the base of the PR and between 3f2475d and 41b0ffa.

📒 Files selected for processing (4)
  • snakesee/parser/core.py
  • snakesee/parser/line_parser.py
  • tests/integration/test_workflows.py
  • tests/test_parser.py

Snakemake indents log output for jobs within group/pipe blocks by
4 spaces. The parser used RULE_START_PATTERN.match() anchored at
position 0 and line.startswith("[") checks that both fail on indented
lines, causing group jobs to be invisible or assigned wrong rule names.

Fix by using RULE_START_PATTERN.match(line.lstrip()) for rule detection
and TIMESTAMP_PATTERN.search() for timestamp detection across all
parser functions. Add _parse_indented_or_group_line() to LogLineParser
for the same handling in the streaming path.

Closes #42
@nh13 nh13 force-pushed the nh/fix-group-job-parsing branch from 41b0ffa to 3c42206 Compare March 28, 2026 22:52
@nh13 nh13 merged commit 75c9b27 into main Mar 28, 2026
8 checks passed
@nh13 nh13 deleted the nh/fix-group-job-parsing branch March 28, 2026 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Is snakesee blind to job groups?

1 participant