Conversation
Implement BARTScore faithfulness, precision, recall, and F1 variants using a cached BART seq2seq backend. The change also adds TDD coverage, metric configs, and docs so issue #336 can be exercised through the existing generation-evaluation surface. Constraint: Must match the NeurIPS 2021 BARTScore direction definitions without adding new dependencies Constraint: Must preserve existing metric APIs and config discovery patterns Rejected: Shipping only a single BARTScore variant | issue requires faithfulness, precision, recall, and F1 Rejected: Deferring docs/config YAMLs to a follow-up | would leave the metric undiscoverable from normal project workflows Confidence: high Scope-risk: moderate Reversibility: clean Directive: Keep precision as reference→answer and recall as answer→reference to preserve paper-aligned semantics Tested: uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q Tested: make check Tested: uv run mkdocs build --strict Tested: uv run python - <<'PY' ... BARTScore smoke run on cpu Tested: make test Not-tested: Alternate checkpoints beyond facebook/bart-large-cnn Related: #336
|
Round 1 review for PR #342 ( I ran the requested Review result
Real Result
Given the code inspection plus the verification above, this looks good to me for this round. |
BARTScore now fails fast with an actionable dependency message instead of surfacing raw optional-import failures, and the docs now explain the runtime extra needed for both CPU and accelerator-backed usage. Added regression coverage for dependency guidance and device auto-detection. Constraint: torch and transformers remain optional extras rather than core dependencies Rejected: Move BARTScore runtime packages into the default install set | would increase baseline install weight for unrelated workflows Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep BARTScore runtime guards and install docs aligned with the optional dependency group names Tested: uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q Tested: uv run autorag-research --config-path=<repo>/configs show metrics Tested: make check Tested: uv run mkdocs build --strict Tested: CPU BARTScore smoke run on device=cpu Tested: make test Not-tested: package-install flow from a fresh external environment
|
Follow-up implemented on Summary:
Verification:
|
|
Round 2 review for PR #342 ( I ran the requested Review result
Real Result
Given the code inspection plus the verification above, this still looks good to me for this round. |
|
Final follow-up verification completed on Summary:
Verification:
|
|
Round 3 review for PR #342 ( I ran the requested Review result
Real Result
Given the code inspection plus the verification above, this looks good to me for the final round. |
|
Final follow-up verification completed on Summary:
Verification:
|
|
APPROVE — PR head is still Short reason
Real Result
Concrete evidence
|
Merged origin/dev into feature/#336 and resolved the overlapping generation metric changes by keeping both the BARTScore additions from the PR branch and the UniEval additions already landed on dev. Constraint: PR #342 must stay reviewable without dropping either metric family Rejected: Take the dev side of the conflict wholesale | would remove the new BARTScore API surface from this PR Rejected: Take the feature side of the conflict wholesale | would drop UniEval exports and configs already present on dev Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep generation metric exports additive when multiple metric families land in parallel Tested: pytest generation metrics; pytest CLI metric discovery; pytest UniEval metrics; autorag-research show metrics; make check; mkdocs build --strict; BARTScore CPU smoke run; make test Not-tested: GPU-backed BARTScore execution
|
Merged
Verified after the merge update:
Pushed the resolved branch tip: |
|
REJECT — current PR head is no longer mergeable with Short reason
Real Result
Concrete evidence
|
Summary
Verification
Closes #336