Feature/#184 by vkehfdl1 · Pull Request #334 · NomaDamas/AutoRAG-Research

vkehfdl1 · 2026-04-06T11:49:20Z

Summary

add a repo-idiomatic RAGCriticPipeline and config that model a critic → planner → executor refinement loop over the existing generation pipeline abstraction
support rewrite, retrieval, decomposition, refinement, and answer-regeneration actions with structured metadata for auditability
add focused RAG-Critic tests covering JSON parsing, config persistence, multi-action correction flow, working-query retrieval defaults, and end-to-end pipeline persistence/token aggregation

Verification

make check
uv build
make test-only

Add a repo-idiomatic RAG-Critic generation pipeline with a critic, planner, and bounded corrective action loop over the existing retrieval/generation abstractions. The test suite locks helper parsing, multi-action execution, and service-layer persistence so the new pipeline behaves like the rest of the framework. Constraint: Must fit the existing BaseGenerationPipeline + retrieval composition model Constraint: No new dependencies or config-system changes Rejected: Full paper-faithful multi-agent reproduction | too broad for the current pipeline abstraction and issue scope Rejected: A single-pass critic-only wrapper | would miss the paper's planner/executor correction loop Confidence: medium Scope-risk: moderate Reversibility: clean Directive: Keep planner/critic outputs JSON-based unless you also harden parser fallbacks and tests together Tested: uv run pytest -q tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py; make check; uv build; make test-only Not-tested: Live LLM compliance with the structured prompt contracts

vkehfdl1 · 2026-04-06T12:20:33Z

I ran the $code-review workflow and verified this locally in a PR worktree.

Findings

[HIGH] The new pipeline is not actually discoverable/configurable through the repo’s normal config path.
RAGCriticPipelineConfig is added in autorag_research/pipelines/generation/rag_critic.py:117-150, but pipeline discovery only reads YAMLs under configs/pipelines/generation (autorag_research/cli/utils.py:56-66). This PR does not add a matching configs/pipelines/generation/rag_critic.yaml, so the new pipeline never shows up in the config-driven CLI flow. I verified that locally with:
- uv run --python 3.13 python - <<'PY' ... discover_pipelines('generation') ... PY → rag_critic present: False
Given the PR summary explicitly says it adds a repo-idiomatic pipeline and config, this leaves the feature incomplete for normal users. Please add the YAML config (and ideally a small discovery regression test).
[HIGH] The JSON parser accepts top-level arrays, but the callers unconditionally treat the result as a dict and crash.
_parse_json_payload() explicitly extracts either an object or an array in autorag_research/pipelines/generation/rag_critic.py:200-209, but _plan_actions() immediately does payload.get("actions", []) at autorag_research/pipelines/generation/rag_critic.py:331-345 and _decompose_query() does payload.get("sub_questions", []) at autorag_research/pipelines/generation/rag_critic.py:369-374. If an LLM returns a valid top-level JSON array (which is common in practice), the corrective loop raises AttributeError instead of falling back.

I reproduced that locally with a mocked planner response of '[{"action": "generate_answer"}]':
- uv run --python 3.13 python - <<'PY' ... await pipeline._plan_actions(...) ... PY → AttributeError 'list' object has no attribute 'get'
Please normalize list payloads into the expected dict shapes (or reject non-dicts before .get/.setdefault) and add a regression test.

Real Result

uv run --python 3.13 pytest -q tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py → 5 passed in 38.96s
uv run --python 3.13 ruff check autorag_research/pipelines/generation/rag_critic.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py → All checks passed!
uv run --python 3.13 python - <<'PY' ... discover_pipelines('generation') ... PY → rag_critic present: False
uv run --python 3.13 python - <<'PY' ... mocked planner returns a top-level array ... PY → AttributeError 'list' object has no attribute 'get'

Recommendation: REQUEST CHANGES

The review surfaced two user-facing gaps in the new generation pipeline: config-driven discovery could not find RAG-Critic because the repo lacked its generation YAML, and planner/decomposition JSON parsing crashed when an LLM returned a valid top-level array instead of an object. This adds the repo-standard config entry, hardens both callers to accept list-or-dict payloads, and locks the behavior with focused regressions. Constraint: Pipeline discovery is driven only by YAML files under configs/pipelines/generation Constraint: LLM JSON responses may legitimately be top-level arrays Rejected: Restrict _parse_json_payload to objects only | would reject valid planner/decomposition outputs already tolerated by the parser Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep rag_critic.yaml aligned with RAGCriticPipelineConfig whenever config fields change Tested: uv run pytest tests/autorag_research/cli/test_utils.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q Tested: uv run python snippet verifying discover_pipelines('generation') includes rag_critic and planner array payload no longer crashes Tested: make check Tested: uv build Tested: make clean-docker && make docker-up && make docker-wait && make test-only Not-tested: Manual CLI invocation of a generation run using the new rag_critic config

vkehfdl1 · 2026-04-06T13:18:45Z

Addressed the review feedback on feature/#184:

added configs/pipelines/generation/rag_critic.yaml so discover_pipelines("generation") now exposes rag_critic through the normal config-driven CLI flow
hardened RAGCriticPipeline planner and decomposition parsing so valid top-level JSON arrays are normalized instead of crashing on .get(...)
added focused regressions for generation discovery plus planner/decomposition array payload handling

Verification:

make check
uv build
make clean-docker && make docker-up && make docker-wait && make test-only
uv run pytest tests/autorag_research/cli/test_utils.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q

vkehfdl1 · 2026-04-06T13:45:31Z

Round 2 review on feature/#184 vs origin/dev after re-running the $code-review workflow locally.

I re-checked the two blocking issues from round 1 and did not find any additional blocking problems in the current patch.

The pipeline is now discoverable through the normal config path via configs/pipelines/generation/rag_critic.yaml, and the generation discovery regression is covered in tests/autorag_research/cli/test_utils.py:141-142.
The planner/decomposition parsing path now handles top-level JSON arrays without dict-only .get(...) crashes in autorag_research/pipelines/generation/rag_critic.py:338-352 and 376-386, with targeted regression coverage in tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py:67-133.

Real Result

make check → passed
uv build → passed
uv run pytest tests/autorag_research/cli/test_utils.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q → 26 passed in 2.74s
uv run pytest tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q -k 'top_level_array_payload or decomposition_array_payload' → 2 passed, 5 deselected in 2.75s
uv run python - <<'PY' ... discover_pipelines('generation') ... PY → rag_critic present: True

Recommendation: APPROVE

… simple The approved feature branch already fixed discoverability and top-level array parsing, but the unit tests still exercised DB-backed pipeline construction for helper-level behavior. This cleanup keeps the tested behavior intact while extracting shared payload-list normalization in the pipeline and switching helper-focused unit tests to a lightweight non-persistent fixture, which makes the verification path more robust and faster without changing the feature contract. Constraint: Must preserve the approved feature behavior and repo verification flow Rejected: Leave helper tests DB-backed | unnecessary coupling to container state for unit-only behavior Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep helper-focused RAG-Critic tests persistence-free unless the behavior under test requires repository integration Tested: uv run pytest tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q; make check; uv build; make clean-docker && make docker-up && make docker-wait && make test-only; architect verification Not-tested: Additional LLM prompt-quality scenarios beyond the existing regression coverage

vkehfdl1 · 2026-04-06T14:47:40Z

Follow-up on feature/#184:

kept the approved RAG-Critic discoverability + array-payload fix set intact on PR Feature/#184 #334
simplified RAGCriticPipeline payload handling by extracting shared top-level-array/keyed-list normalization
converted helper-focused RAG-Critic unit tests to a lightweight non-persistent fixture so the array/config coverage no longer depends on DB-backed pipeline setup

Verification:

uv run pytest tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q
make check
uv build
make clean-docker && make docker-up && make docker-wait && make test-only
architect verification: APPROVED

vkehfdl1 · 2026-04-06T15:22:27Z

Round 3 review on feature/#184 vs origin/dev after re-running the $code-review workflow locally.

I re-checked the approved discoverability + array-payload fix set after the follow-up simplification in aba22de, and I did not find any new blocking issues.

The shared top-level-array normalization in autorag_research/pipelines/generation/rag_critic.py:202-222 keeps the planner/decomposition paths simple while preserving the round-1 fix at autorag_research/pipelines/generation/rag_critic.py:333-389.
The helper-focused unit coverage now uses the lightweight non-persistent fixture in tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py:21-54, while the DB-backed integration path is still exercised by tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py:319-365.
The config-driven discovery path remains wired through configs/pipelines/generation/rag_critic.yaml and tests/autorag_research/cli/test_utils.py:139-143.

Real Result

make check → passed
uv build → passed
make clean-docker && make docker-up && make docker-wait → PostgreSQL test environment recreated successfully
uv run pytest tests/autorag_research/cli/test_utils.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q → 27 passed in 2.51s
make test-only → 1270 passed, 63 deselected, 2 warnings in 47.64s
uv run python - <<'PY' ... discover_pipelines('generation') ... PY → rag_critic present: True
uv run python - <<'PY' ... top-level planner/decomposition arrays ... PY → actions: [{'action': 'retrieval'}, {'action': 'generate_answer'}]; sub_questions: ['What is RAG?', 'How does the critic refine it?']

Recommendation: APPROVE

The approved RAG-Critic follow-up already covered config discovery and list-shaped planner/decomposition payloads, but a singleton recommended_actions string still reached fallback paths and expanded into per-character actions. Normalize string-or-list action metadata before storing critique state or building planner fallbacks, and lock the fix with focused regressions. Constraint: LLM structured outputs may return singleton strings even when prompts ask for lists Rejected: Coerce strings only in planner fallback | would leave critique metadata inconsistent for audit traces Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep critique/planner payload normalization centralized so fallback action handling stays consistent with stored metadata Tested: uv run pytest -q tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -k 'string_recommended_actions'; uv run pytest -q tests/autorag_research/cli/test_utils.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py; uv run python - <<'PY' ... singleton recommended_actions repro ... PY; make check; uv build; make clean-docker && make docker-up && make docker-wait && make test-only Not-tested: Malformed non-string/non-list recommended_actions payloads beyond existing fallback behavior

vkehfdl1 · 2026-04-06T16:12:43Z

Follow-up landed on feature/#184:

normalized singleton recommended_actions payloads in RAGCriticPipeline so critique metadata and planner fallbacks keep list semantics instead of expanding into per-character actions
added focused regressions for critic normalization and planner fallback handling when the model returns a singleton action string
re-verified the branch with targeted RAG-Critic tests, direct helper repro, make check, uv build, and a clean-docker make test-only run

vkehfdl1 · 2026-04-06T16:47:12Z

REJECT — the RAG-Critic changes test cleanly, but PR #334 is currently not mergeable against dev.

Short reason
The feature itself looks good after the follow-up fixes, but feature/#184 now conflicts with current origin/dev, so it is not ready to land as-is.

Real Result

make check → passed
uv build → passed
make clean-docker && make docker-up && make docker-wait && make test-only → 1272 passed, 63 deselected, 2 warnings in 42.95s
uv run pytest tests/autorag_research/cli/test_utils.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q → 29 passed in 2.27s
uv run python3 - <<'PY' ... RAGCriticPipeline._normalize_string_list("generate_answer") ... PY → ['generate_answer']
gh pr view 334 --repo NomaDamas/AutoRAG-Research --json mergeable,mergeStateStatus → {"mergeStateStatus":"DIRTY","mergeable":"CONFLICTING"}
git merge-tree $(git merge-base origin/dev HEAD) origin/dev HEAD → conflict markers in autorag_research/pipelines/generation/__init__.py

Concrete evidence
git merge-tree shows autorag_research/pipelines/generation/__init__.py changed on both sides: current origin/dev adds SelfRAGPipeline / SelfRAGPipelineConfig, while this branch adds RAGCriticPipeline / RAGCriticPipelineConfig in the same import/export block. The dry-run merge produces conflict markers there, so the PR cannot be merged cleanly until it is rebased and that file is resolved.

Once the __init__.py conflict is resolved against current dev, I’d be comfortable re-checking for approval.

Resolve conflict in pipelines/generation/__init__.py by keeping both RAGCriticPipeline and SelfRAGPipeline imports.

This moves the local pipeline beyond a simple JSON action loop by adding an official-style planner/executor path, published critic-checkpoint integration, and stricter adapters around the public RAG-Critic outputs. The goal is to reduce the gap with the paper and official repository while keeping the implementation compatible with AutoRAG-Research's existing generation/retrieval abstractions and safety expectations. Constraint: Must fit AutoRAG-Research BaseGenerationPipeline and existing retrieval wrappers Constraint: Must not add new dependencies or require FlashRAG runtime integration Rejected: Direct exec of planner-generated Python | too unsafe for repository integration Rejected: Full FlashRAG port | too invasive for this PR and incompatible with existing pipeline abstractions Confidence: medium Scope-risk: moderate Reversibility: clean Directive: Keep official-style python_agent execution on the restricted AST path unless a broader sandbox model is introduced Tested: uv run pytest tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q -k 'not integration'; uv run pytest tests/autorag_research/cli/test_utils.py tests/autorag_research/test_injection.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q -k 'not integration'; uv run ruff check autorag_research/pipelines/generation/rag_critic.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py Not-tested: End-to-end vLLM smoke test against dongguanting/RAG-Critic-3B; full make check remains blocked by existing deptry noise under .omx/review-worktrees

vkehfdl1 · 2026-04-09T06:58:43Z

Implemented a closer-to-official RAG-Critic integration and pushed it on in commit .

What changed:

added support for the published critic checkpoint via separate
added adapter for published outputs
added to follow the official / style more closely
added a restricted AST-based executor for official-style planner code using the local function surface (, , , , )
kept the earlier hardening fixes for invalid planner and blank rewrites
added config example:
expanded regression coverage for published critic parsing and python-agent execution

Validation:

................... [100%]
19 passed, 1 deselected in 2.01s ✅
............................................................... [100%]
63 passed, 1 deselected in 3.16s ✅
All checks passed! ✅
🚀 Checking lock file consistency with 'pyproject.toml'
🚀 Linting code: Running pre-commit
check for case conflicts.................................................Passed
check for merge conflicts................................................Passed
check toml...............................................................Passed
check yaml...............................................................Passed
check json...............................................................Passed
pretty format json.......................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
ruff check...............................................................Passed
ruff format..............................................................Passed
🚀 Static type checking: Running ty
All checks passed!
🚀 Checking for obsolete dependencies: Running deptry still hits pre-existing noise under , but lint/type portions relevant to this patch are green

Remaining gap vs official repo:

no full FlashRAG runtime/index/corpus port
no end-to-end smoke test yet against a live vLLM-served
official offline artifact flow (, , ) is still adapted into AutoRAG runtime state rather than reproduced 1:1

vkehfdl1 · 2026-04-09T06:59:04Z

Implemented a closer-to-official RAG-Critic integration and pushed it on branch feature/#184 in commit 0050f44.

What changed:

added support for the published dongguanting/RAG-Critic-3B critic checkpoint via separate critic_llm
added critic_output_format: rag_critic_tags adapter for published Judgement/tag1/tag2/tag3 outputs
added planner_output_format: python_agent to follow the official plan_agent.py / execute_agent.py style more closely
added a restricted AST-based executor for official-style planner code using the local function surface (Retrieval, RewriteQuery, DecomposeQuery, RefineDoc, GenerateAnswer)
kept the earlier hardening fixes for invalid planner top_k and blank rewrites
added config example: configs/llm/rag-critic-3b-vllm.yaml
expanded regression coverage for published critic parsing and python-agent execution

Validation:

uv run pytest tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q -k 'not integration' ✅
uv run pytest tests/autorag_research/cli/test_utils.py tests/autorag_research/test_injection.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q -k 'not integration' ✅
uv run ruff check autorag_research/pipelines/generation/rag_critic.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py ✅
make check still hits pre-existing deptry noise under .omx/review-worktrees/..., but lint/type portions relevant to this patch are green

Remaining gap vs official repo:

no full FlashRAG runtime/index/corpus port
no end-to-end smoke test yet against a live vLLM-served dongguanting/RAG-Critic-3B
official offline artifact flow (retrieval_data, previous_answer_data, error_data) is still adapted into AutoRAG runtime state rather than reproduced 1:1

The local .omx directory is workspace/runtime state and should not affect version control or dependency scanning. This adds .omx to gitignore and deptry excludes so local orchestration artifacts stop polluting status output and project-level dependency checks. Constraint: Must avoid changing unrelated tooling behavior Rejected: Leave .omx visible and clean it manually | noisy and brittle across sessions Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep transient OMX state excluded unless a specific .omx artifact is intentionally promoted into version control Tested: git status --short; uv run deptry . Not-tested: Full make check after this commit

vkehfdl1 added 2 commits April 9, 2026 01:22

Merge branch 'dev' into feature/#184

e2615ba

Resolve conflict in pipelines/generation/__init__.py by keeping both RAGCriticPipeline and SelfRAGPipeline imports.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/#184#334

Feature/#184#334
vkehfdl1 wants to merge 7 commits intodevfrom
feature/#184

vkehfdl1 commented Apr 6, 2026 •

edited

Loading

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026 •

edited

Loading

Uh oh!

vkehfdl1 commented Apr 9, 2026

Uh oh!

vkehfdl1 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vkehfdl1 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkehfdl1 commented Apr 9, 2026

Uh oh!

vkehfdl1 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vkehfdl1 commented Apr 6, 2026 •

edited

Loading

vkehfdl1 commented Apr 6, 2026 •

edited

Loading