Skip to content

Feature/#184#334

Open
vkehfdl1 wants to merge 7 commits intodevfrom
feature/#184
Open

Feature/#184#334
vkehfdl1 wants to merge 7 commits intodevfrom
feature/#184

Conversation

@vkehfdl1
Copy link
Copy Markdown
Collaborator

@vkehfdl1 vkehfdl1 commented Apr 6, 2026

Summary

  • add a repo-idiomatic RAGCriticPipeline and config that model a critic → planner → executor refinement loop over the existing generation pipeline abstraction
  • support rewrite, retrieval, decomposition, refinement, and answer-regeneration actions with structured metadata for auditability
  • add focused RAG-Critic tests covering JSON parsing, config persistence, multi-action correction flow, working-query retrieval defaults, and end-to-end pipeline persistence/token aggregation

Verification

  • make check
  • uv build
  • make test-only

Add a repo-idiomatic RAG-Critic generation pipeline with a critic, planner,
and bounded corrective action loop over the existing retrieval/generation
abstractions. The test suite locks helper parsing, multi-action execution,
and service-layer persistence so the new pipeline behaves like the rest of
the framework.

Constraint: Must fit the existing BaseGenerationPipeline + retrieval composition model
Constraint: No new dependencies or config-system changes
Rejected: Full paper-faithful multi-agent reproduction | too broad for the current pipeline abstraction and issue scope
Rejected: A single-pass critic-only wrapper | would miss the paper's planner/executor correction loop
Confidence: medium
Scope-risk: moderate
Reversibility: clean
Directive: Keep planner/critic outputs JSON-based unless you also harden parser fallbacks and tests together
Tested: uv run pytest -q tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py; make check; uv build; make test-only
Not-tested: Live LLM compliance with the structured prompt contracts
@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

I ran the $code-review workflow and verified this locally in a PR worktree.

Findings

  1. [HIGH] The new pipeline is not actually discoverable/configurable through the repo’s normal config path.
    RAGCriticPipelineConfig is added in autorag_research/pipelines/generation/rag_critic.py:117-150, but pipeline discovery only reads YAMLs under configs/pipelines/generation (autorag_research/cli/utils.py:56-66). This PR does not add a matching configs/pipelines/generation/rag_critic.yaml, so the new pipeline never shows up in the config-driven CLI flow. I verified that locally with:

    • uv run --python 3.13 python - <<'PY' ... discover_pipelines('generation') ... PYrag_critic present: False

    Given the PR summary explicitly says it adds a repo-idiomatic pipeline and config, this leaves the feature incomplete for normal users. Please add the YAML config (and ideally a small discovery regression test).

  2. [HIGH] The JSON parser accepts top-level arrays, but the callers unconditionally treat the result as a dict and crash.
    _parse_json_payload() explicitly extracts either an object or an array in autorag_research/pipelines/generation/rag_critic.py:200-209, but _plan_actions() immediately does payload.get("actions", []) at autorag_research/pipelines/generation/rag_critic.py:331-345 and _decompose_query() does payload.get("sub_questions", []) at autorag_research/pipelines/generation/rag_critic.py:369-374. If an LLM returns a valid top-level JSON array (which is common in practice), the corrective loop raises AttributeError instead of falling back.

    I reproduced that locally with a mocked planner response of '[{"action": "generate_answer"}]':

    • uv run --python 3.13 python - <<'PY' ... await pipeline._plan_actions(...) ... PYAttributeError 'list' object has no attribute 'get'

    Please normalize list payloads into the expected dict shapes (or reject non-dicts before .get/.setdefault) and add a regression test.

Real Result

  • uv run --python 3.13 pytest -q tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py5 passed in 38.96s
  • uv run --python 3.13 ruff check autorag_research/pipelines/generation/rag_critic.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.pyAll checks passed!
  • uv run --python 3.13 python - <<'PY' ... discover_pipelines('generation') ... PYrag_critic present: False
  • uv run --python 3.13 python - <<'PY' ... mocked planner returns a top-level array ... PYAttributeError 'list' object has no attribute 'get'

Recommendation: REQUEST CHANGES

The review surfaced two user-facing gaps in the new generation pipeline:
config-driven discovery could not find RAG-Critic because the repo lacked
its generation YAML, and planner/decomposition JSON parsing crashed when an
LLM returned a valid top-level array instead of an object. This adds the
repo-standard config entry, hardens both callers to accept list-or-dict
payloads, and locks the behavior with focused regressions.

Constraint: Pipeline discovery is driven only by YAML files under configs/pipelines/generation
Constraint: LLM JSON responses may legitimately be top-level arrays
Rejected: Restrict _parse_json_payload to objects only | would reject valid planner/decomposition outputs already tolerated by the parser
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep rag_critic.yaml aligned with RAGCriticPipelineConfig whenever config fields change
Tested: uv run pytest tests/autorag_research/cli/test_utils.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q
Tested: uv run python snippet verifying discover_pipelines('generation') includes rag_critic and planner array payload no longer crashes
Tested: make check
Tested: uv build
Tested: make clean-docker && make docker-up && make docker-wait && make test-only
Not-tested: Manual CLI invocation of a generation run using the new rag_critic config
@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Addressed the review feedback on feature/#184:

  • added configs/pipelines/generation/rag_critic.yaml so discover_pipelines("generation") now exposes rag_critic through the normal config-driven CLI flow
  • hardened RAGCriticPipeline planner and decomposition parsing so valid top-level JSON arrays are normalized instead of crashing on .get(...)
  • added focused regressions for generation discovery plus planner/decomposition array payload handling

Verification:

  • make check
  • uv build
  • make clean-docker && make docker-up && make docker-wait && make test-only
  • uv run pytest tests/autorag_research/cli/test_utils.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Round 2 review on feature/#184 vs origin/dev after re-running the $code-review workflow locally.

I re-checked the two blocking issues from round 1 and did not find any additional blocking problems in the current patch.

  • The pipeline is now discoverable through the normal config path via configs/pipelines/generation/rag_critic.yaml, and the generation discovery regression is covered in tests/autorag_research/cli/test_utils.py:141-142.
  • The planner/decomposition parsing path now handles top-level JSON arrays without dict-only .get(...) crashes in autorag_research/pipelines/generation/rag_critic.py:338-352 and 376-386, with targeted regression coverage in tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py:67-133.

Real Result

  • make check → passed
  • uv build → passed
  • uv run pytest tests/autorag_research/cli/test_utils.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q26 passed in 2.74s
  • uv run pytest tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q -k 'top_level_array_payload or decomposition_array_payload'2 passed, 5 deselected in 2.75s
  • uv run python - <<'PY' ... discover_pipelines('generation') ... PYrag_critic present: True

Recommendation: APPROVE

… simple

The approved feature branch already fixed discoverability and top-level
array parsing, but the unit tests still exercised DB-backed pipeline
construction for helper-level behavior. This cleanup keeps the tested
behavior intact while extracting shared payload-list normalization in the
pipeline and switching helper-focused unit tests to a lightweight
non-persistent fixture, which makes the verification path more robust and
faster without changing the feature contract.

Constraint: Must preserve the approved feature behavior and repo verification flow
Rejected: Leave helper tests DB-backed | unnecessary coupling to container state for unit-only behavior
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep helper-focused RAG-Critic tests persistence-free unless the behavior under test requires repository integration
Tested: uv run pytest tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q; make check; uv build; make clean-docker && make docker-up && make docker-wait && make test-only; architect verification
Not-tested: Additional LLM prompt-quality scenarios beyond the existing regression coverage
@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Follow-up on feature/#184:

  • kept the approved RAG-Critic discoverability + array-payload fix set intact on PR Feature/#184 #334
  • simplified RAGCriticPipeline payload handling by extracting shared top-level-array/keyed-list normalization
  • converted helper-focused RAG-Critic unit tests to a lightweight non-persistent fixture so the array/config coverage no longer depends on DB-backed pipeline setup

Verification:

  • uv run pytest tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q
  • make check
  • uv build
  • make clean-docker && make docker-up && make docker-wait && make test-only
  • architect verification: APPROVED

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Round 3 review on feature/#184 vs origin/dev after re-running the $code-review workflow locally.

I re-checked the approved discoverability + array-payload fix set after the follow-up simplification in aba22de, and I did not find any new blocking issues.

  • The shared top-level-array normalization in autorag_research/pipelines/generation/rag_critic.py:202-222 keeps the planner/decomposition paths simple while preserving the round-1 fix at autorag_research/pipelines/generation/rag_critic.py:333-389.
  • The helper-focused unit coverage now uses the lightweight non-persistent fixture in tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py:21-54, while the DB-backed integration path is still exercised by tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py:319-365.
  • The config-driven discovery path remains wired through configs/pipelines/generation/rag_critic.yaml and tests/autorag_research/cli/test_utils.py:139-143.

Real Result

  • make check → passed
  • uv build → passed
  • make clean-docker && make docker-up && make docker-wait → PostgreSQL test environment recreated successfully
  • uv run pytest tests/autorag_research/cli/test_utils.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q27 passed in 2.51s
  • make test-only1270 passed, 63 deselected, 2 warnings in 47.64s
  • uv run python - <<'PY' ... discover_pipelines('generation') ... PYrag_critic present: True
  • uv run python - <<'PY' ... top-level planner/decomposition arrays ... PYactions: [{'action': 'retrieval'}, {'action': 'generate_answer'}]; sub_questions: ['What is RAG?', 'How does the critic refine it?']

Recommendation: APPROVE

The approved RAG-Critic follow-up already covered config discovery and
list-shaped planner/decomposition payloads, but a singleton
recommended_actions string still reached fallback paths and expanded into
per-character actions. Normalize string-or-list action metadata before
storing critique state or building planner fallbacks, and lock the fix
with focused regressions.

Constraint: LLM structured outputs may return singleton strings even when prompts ask for lists
Rejected: Coerce strings only in planner fallback | would leave critique metadata inconsistent for audit traces
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep critique/planner payload normalization centralized so fallback action handling stays consistent with stored metadata
Tested: uv run pytest -q tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -k 'string_recommended_actions'; uv run pytest -q tests/autorag_research/cli/test_utils.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py; uv run python - <<'PY' ... singleton recommended_actions repro ... PY; make check; uv build; make clean-docker && make docker-up && make docker-wait && make test-only
Not-tested: Malformed non-string/non-list recommended_actions payloads beyond existing fallback behavior
@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Follow-up landed on feature/#184:

  • normalized singleton recommended_actions payloads in RAGCriticPipeline so critique metadata and planner fallbacks keep list semantics instead of expanding into per-character actions
  • added focused regressions for critic normalization and planner fallback handling when the model returns a singleton action string
  • re-verified the branch with targeted RAG-Critic tests, direct helper repro, make check, uv build, and a clean-docker make test-only run

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

REJECT — the RAG-Critic changes test cleanly, but PR #334 is currently not mergeable against dev.

Short reason
The feature itself looks good after the follow-up fixes, but feature/#184 now conflicts with current origin/dev, so it is not ready to land as-is.

Real Result

  • make check → passed
  • uv build → passed
  • make clean-docker && make docker-up && make docker-wait && make test-only1272 passed, 63 deselected, 2 warnings in 42.95s
  • uv run pytest tests/autorag_research/cli/test_utils.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q29 passed in 2.27s
  • uv run python3 - <<'PY' ... RAGCriticPipeline._normalize_string_list("generate_answer") ... PY['generate_answer']
  • gh pr view 334 --repo NomaDamas/AutoRAG-Research --json mergeable,mergeStateStatus{"mergeStateStatus":"DIRTY","mergeable":"CONFLICTING"}
  • git merge-tree $(git merge-base origin/dev HEAD) origin/dev HEAD → conflict markers in autorag_research/pipelines/generation/__init__.py

Concrete evidence
git merge-tree shows autorag_research/pipelines/generation/__init__.py changed on both sides: current origin/dev adds SelfRAGPipeline / SelfRAGPipelineConfig, while this branch adds RAGCriticPipeline / RAGCriticPipelineConfig in the same import/export block. The dry-run merge produces conflict markers there, so the PR cannot be merged cleanly until it is rebased and that file is resolved.

Once the __init__.py conflict is resolved against current dev, I’d be comfortable re-checking for approval.

vkehfdl1 added 2 commits April 9, 2026 01:22
Resolve conflict in pipelines/generation/__init__.py by keeping both
RAGCriticPipeline and SelfRAGPipeline imports.
This moves the local pipeline beyond a simple JSON action loop by adding
an official-style planner/executor path, published critic-checkpoint
integration, and stricter adapters around the public RAG-Critic outputs.

The goal is to reduce the gap with the paper and official repository
while keeping the implementation compatible with AutoRAG-Research's
existing generation/retrieval abstractions and safety expectations.

Constraint: Must fit AutoRAG-Research BaseGenerationPipeline and existing retrieval wrappers
Constraint: Must not add new dependencies or require FlashRAG runtime integration
Rejected: Direct exec of planner-generated Python | too unsafe for repository integration
Rejected: Full FlashRAG port | too invasive for this PR and incompatible with existing pipeline abstractions
Confidence: medium
Scope-risk: moderate
Reversibility: clean
Directive: Keep official-style python_agent execution on the restricted AST path unless a broader sandbox model is introduced
Tested: uv run pytest tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q -k 'not integration'; uv run pytest tests/autorag_research/cli/test_utils.py tests/autorag_research/test_injection.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q -k 'not integration'; uv run ruff check autorag_research/pipelines/generation/rag_critic.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py
Not-tested: End-to-end vLLM smoke test against dongguanting/RAG-Critic-3B; full make check remains blocked by existing deptry noise under .omx/review-worktrees
@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 9, 2026

Implemented a closer-to-official RAG-Critic integration and pushed it on in commit .

What changed:

  • added support for the published critic checkpoint via separate
  • added adapter for published outputs
  • added to follow the official / style more closely
  • added a restricted AST-based executor for official-style planner code using the local function surface (, , , , )
  • kept the earlier hardening fixes for invalid planner and blank rewrites
  • added config example:
  • expanded regression coverage for published critic parsing and python-agent execution

Validation:

  • ................... [100%]
    19 passed, 1 deselected in 2.01s ✅
  • ............................................................... [100%]
    63 passed, 1 deselected in 3.16s ✅
  • All checks passed! ✅
  • 🚀 Checking lock file consistency with 'pyproject.toml'
    🚀 Linting code: Running pre-commit
    check for case conflicts.................................................Passed
    check for merge conflicts................................................Passed
    check toml...............................................................Passed
    check yaml...............................................................Passed
    check json...............................................................Passed
    pretty format json.......................................................Passed
    fix end of files.........................................................Passed
    trim trailing whitespace.................................................Passed
    ruff check...............................................................Passed
    ruff format..............................................................Passed
    🚀 Static type checking: Running ty
    All checks passed!
    🚀 Checking for obsolete dependencies: Running deptry still hits pre-existing noise under , but lint/type portions relevant to this patch are green

Remaining gap vs official repo:

  • no full FlashRAG runtime/index/corpus port
  • no end-to-end smoke test yet against a live vLLM-served
  • official offline artifact flow (, , ) is still adapted into AutoRAG runtime state rather than reproduced 1:1

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 9, 2026

Implemented a closer-to-official RAG-Critic integration and pushed it on branch feature/#184 in commit 0050f44.

What changed:

  • added support for the published dongguanting/RAG-Critic-3B critic checkpoint via separate critic_llm
  • added critic_output_format: rag_critic_tags adapter for published Judgement/tag1/tag2/tag3 outputs
  • added planner_output_format: python_agent to follow the official plan_agent.py / execute_agent.py style more closely
  • added a restricted AST-based executor for official-style planner code using the local function surface (Retrieval, RewriteQuery, DecomposeQuery, RefineDoc, GenerateAnswer)
  • kept the earlier hardening fixes for invalid planner top_k and blank rewrites
  • added config example: configs/llm/rag-critic-3b-vllm.yaml
  • expanded regression coverage for published critic parsing and python-agent execution

Validation:

  • uv run pytest tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q -k 'not integration' ✅
  • uv run pytest tests/autorag_research/cli/test_utils.py tests/autorag_research/test_injection.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py -q -k 'not integration' ✅
  • uv run ruff check autorag_research/pipelines/generation/rag_critic.py tests/autorag_research/pipelines/generation/test_rag_critic_pipeline.py ✅
  • make check still hits pre-existing deptry noise under .omx/review-worktrees/..., but lint/type portions relevant to this patch are green

Remaining gap vs official repo:

  • no full FlashRAG runtime/index/corpus port
  • no end-to-end smoke test yet against a live vLLM-served dongguanting/RAG-Critic-3B
  • official offline artifact flow (retrieval_data, previous_answer_data, error_data) is still adapted into AutoRAG runtime state rather than reproduced 1:1

The local .omx directory is workspace/runtime state and should not affect
version control or dependency scanning.

This adds .omx to gitignore and deptry excludes so local orchestration
artifacts stop polluting status output and project-level dependency checks.

Constraint: Must avoid changing unrelated tooling behavior
Rejected: Leave .omx visible and clean it manually | noisy and brittle across sessions
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep transient OMX state excluded unless a specific .omx artifact is intentionally promoted into version control
Tested: git status --short; uv run deptry .
Not-tested: Full make check after this commit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant