fix(#235): tolerate non-JSON QA verdict responses#236
fix(#235): tolerate non-JSON QA verdict responses#236ouroboros-agent[bot] wants to merge 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
0d8e738| Triggered by: PR opened
Branch: issue-235/ouroborosqa-fails-with-failed-to-parse-q | 2 files, +84/-7 | CI: Ruff Lint fail 12s https://github.com/Q00/ouroboros/actions/runs/23644472843/job/68873058775
MyPy Type Check pass 46s https://github.com/Q00/ouroboros/actions/runs/23644472843/job/68873058792
Test Python 3.12 pass 1m9s https://github.com/Q00/ouroboros/actions/runs/23644472797/job/68873058755
Test Python 3.13 pass 1m9s https://github.com/Q00/ouroboros/actions/runs/23644472797/job/68873058760
Test Python 3.14 pass 1m15s https://github.com/Q00/ouroboros/actions/runs/23644472797/job/68873058780
Issue Requirements
No linked issue detected in the PR body.
Previous Review Follow-up
First review — no previous findings.
Blocking Findings
| # | File:Line | Severity | Confidence | Finding |
|---|---|---|---|---|
| 1 | src/ouroboros/mcp/tools/qa.py:182 | Medium | Medium | The new non-JSON fallback only matches bare Score: / Verdict: labels at the start of a line, so it still rejects common LLM plain-text variants such as - Score: 0.91, **Score**: 0.91, or Score - 0.91. That means the PR only fixes one narrow response shape, and the QA tool can still fail with “Could not find JSON in QA response” for many realistic non-JSON outputs, which leaves issue #235 only partially addressed. |
Test Coverage
tests/unit/mcp/tools/test_qa_parser.py:7 only covers two minimal prose formats. It does not exercise bulleted or Markdown-formatted verdicts, which are the most likely non-JSON variants from chat-style providers and are currently still unhandled by the regex fallback in src/ouroboros/mcp/tools/qa.py:182-195.
Design Notes
The direction is reasonable: keeping _parse_qa_response() tolerant of non-JSON provider output is the right place to harden this tool. The main issue is that the fallback parser is regex-fragile and currently recognizes only a very small subset of “plain text” responses, so the implementation does not yet provide a robust degradation path across providers.
Follow-up Recommendations
- Re-run review after any substantive push if the PR head changes.
- Add or update targeted tests for each changed behavior path noted above.
Files Reviewed
src/ouroboros/mcp/tools/qa.pytests/unit/mcp/tools/test_qa_parser.py
Reviewed by ouroboros-agent[bot] via Codex deep analysis
- Split parsing into 3 layers: strip decorations → extract KV → parse fields - KV regex uses only unambiguous separators (: and =) to avoid false positives - Score-specific fallback handles "score is/- X" with word boundary - Add tests for bulleted, markdown bold, and dash-separated LLM outputs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
092de31| Triggered by: new push
Branch: issue-235/ouroborosqa-fails-with-failed-to-parse-q | 2 files, +149/-7 | CI: Ruff Lint failing; MyPy + tests passing
What improved
- The previous blocker was partially addressed: the fallback parser now correctly handles bulleted labels, bolded Markdown labels, and ASCII
Score - ...score lines in addition toScore:/Score = .... - Regression tests were added for the newly supported plain-text shapes.
Issue #235 Requirements
| Requirement | Status |
|---|---|
ouroboros_qa should degrade gracefully when the provider returns plain-text verdicts instead of JSON |
NOT MET — Verdict - fail is still ignored because _KV_RE only accepts : or = separators, so _parse_qa_response() silently re-derives the verdict from score and can return the opposite result (src/ouroboros/mcp/tools/qa.py:176, src/ouroboros/mcp/tools/qa.py:232, src/ouroboros/mcp/tools/qa.py:270) |
| Preserve useful structured QA information from fallback prose rather than failing or discarding it | NOT MET — all bullet lines are currently funneled into differences, while suggestions and dimensions stay empty even when the prose includes those sections (src/ouroboros/mcp/tools/qa.py:225-236) |
Previous Review Follow-up
| Previous Finding | Status |
|---|---|
Fallback parser did not handle common non-JSON variants such as bullets, bold Markdown labels, or Score - ... |
MODIFIED — those formats are now accepted, but the updated parser still mishandles dash-separated verdicts and remains lossy for other structured fields (src/ouroboros/mcp/tools/qa.py:174-178, src/ouroboros/mcp/tools/qa.py:208-236) |
Blocking Findings
| # | File:Line | Severity | Confidence | Finding |
|---|---|---|---|---|
| 1 | src/ouroboros/mcp/tools/qa.py:176 |
High | High | The new Score - 0.88 fallback does not also parse Verdict - fail/revise. _KV_RE only accepts : or =, so fields.get("verdict") stays empty and _parse_qa_response() silently re-derives the verdict from score. That means an explicit textual verdict can now be ignored and inverted, e.g. Score - 0.88 with Verdict - fail returns pass. |
| 2 | src/ouroboros/mcp/tools/qa.py:225 |
Medium | High | The fallback now treats every bullet item as a difference, regardless of section. Plain-text outputs like Suggestions: followed by bullet points are parsed as differences and suggestions is always returned empty, which materially degrades the structured QA result the tool advertises. |
| 3 | src/ouroboros/mcp/tools/qa.py:225 |
Medium | Medium | Bulleted dimension lines such as - accuracy: 0.9 / - clarity: 0.8 are also captured as generic differences and never populate dimensions. Since this PR broadens acceptance of non-JSON prose, it drops useful scoring detail from a common fallback shape instead of preserving it. |
Test Coverage
tests/unit/mcp/tools/test_qa_parser.py:62-73does not actually verify dash-separated verdict parsing.Verdict - passis currently ignored, and the test still passes only because the code derivespassfrom the score.tests/unit/mcp/tools/test_qa_parser.py:6-73has no coverage for preservation ofsuggestionsordimensionsfrom non-JSON prose, even though this PR now routes more responses through that fallback path.
Design Notes
The direction is right, and the re-review confirms meaningful progress on the original blocker. The remaining problem is that the fallback is still label-fragile and lossy: it accepts more prose shapes, but one of the newly advertised shapes (Verdict - ...) is only “supported” accidentally via score-based verdict derivation, and richer structured sections from prose are flattened away. For a QA loop tool, that is merge-blocking because it can change loop control and user-visible feedback while appearing to parse successfully.
Follow-up Recommendations
- Once the blocking correctness issue is fixed, consider extending
_strip_line_decorations()to tolerate quoted Markdown output (> Score: ...) and typographic dashes (–/—) as follow-up hardening. - The failing Ruff check should be resolved before merge as part of normal CI hygiene.
Files Reviewed
src/ouroboros/mcp/tools/qa.pytests/unit/mcp/tools/test_qa_parser.py
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Summary
ouroboros_qafall back to plain-text score/verdict parsing when providers ignore JSON schema outputTesting
uv run python -m pytest -q /tmp/issue_repo/tests/unit/mcp/tools/test_qa_parser.py /tmp/issue_repo/tests/unit/mcp/tools/test_qa_integration.py -q