Feature/#336 by vkehfdl1 · Pull Request #342 · NomaDamas/AutoRAG-Research

vkehfdl1 · 2026-04-06T19:25:41Z

Summary

add BARTScore generation metrics for faithfulness, precision, recall, and F1
expose new metric configs/YAMLs and document the metric in MkDocs
add TDD coverage plus a CPU smoke run for the new metric family

Verification

uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q
make check
uv run mkdocs build --strict
uv run python - <<'PY' ... BARTScore smoke run on cpu
make test

Closes #336

Implement BARTScore faithfulness, precision, recall, and F1 variants using a cached BART seq2seq backend. The change also adds TDD coverage, metric configs, and docs so issue #336 can be exercised through the existing generation-evaluation surface. Constraint: Must match the NeurIPS 2021 BARTScore direction definitions without adding new dependencies Constraint: Must preserve existing metric APIs and config discovery patterns Rejected: Shipping only a single BARTScore variant | issue requires faithfulness, precision, recall, and F1 Rejected: Deferring docs/config YAMLs to a follow-up | would leave the metric undiscoverable from normal project workflows Confidence: high Scope-risk: moderate Reversibility: clean Directive: Keep precision as reference→answer and recall as answer→reference to preserve paper-aligned semantics Tested: uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q Tested: make check Tested: uv run mkdocs build --strict Tested: uv run python - <<'PY' ... BARTScore smoke run on cpu Tested: make test Not-tested: Alternate checkpoints beyond facebook/bart-large-cnn Related: #336

vkehfdl1 · 2026-04-06T19:43:13Z

Round 1 review for PR #342 (feature/#336)

I ran the requested $code-review workflow, inspected the local diff against origin/dev...HEAD, and did real verification locally.

Review result

No blocking issues found in the BARTScore implementation, config exposure, or docs changes.
The direction wiring is consistent with the PR/issue intent:
- faithfulness = context → answer
- precision = reference → answer
- recall = answer → reference
- F1 = mean of precision and recall (as documented in this PR)

Real Result

uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q
- Result: 14 passed in 13.81s
uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics
- Result: 2 passed, 17 deselected in 0.01s
make check
- Result: passed (pre-commit, ty, deptry all clean)
uv run mkdocs build --strict
- Result: built successfully in 4.24s
CPU smoke run (facebook/bart-large-cnn, device=cpu)
- identical text: faithfulness/precision/recall/f1 = [-0.3970629870891571]
- precision sanity check: same = [-0.3970629870891571], wrong = [-2.7505297660827637]

Given the code inspection plus the verification above, this looks good to me for this round.

BARTScore now fails fast with an actionable dependency message instead of surfacing raw optional-import failures, and the docs now explain the runtime extra needed for both CPU and accelerator-backed usage. Added regression coverage for dependency guidance and device auto-detection. Constraint: torch and transformers remain optional extras rather than core dependencies Rejected: Move BARTScore runtime packages into the default install set | would increase baseline install weight for unrelated workflows Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep BARTScore runtime guards and install docs aligned with the optional dependency group names Tested: uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q Tested: uv run autorag-research --config-path=<repo>/configs show metrics Tested: make check Tested: uv run mkdocs build --strict Tested: CPU BARTScore smoke run on device=cpu Tested: make test Not-tested: package-install flow from a fresh external environment

vkehfdl1 · 2026-04-06T20:32:53Z

Follow-up implemented on feature/#336.

Summary:

added a BARTScore runtime guard with actionable install guidance for missing torch / transformers
added regression coverage for optional-dependency guidance and safer auto-device fallback when torch.backends.mps is unavailable
updated README + MkDocs install/discoverability docs so BARTScore is documented on the main install path and concept pages

Verification:

uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q → 19 passed
uv run autorag-research --config-path=<repo>/configs show metrics | rg 'bart_score_' → all four BARTScore configs listed
make check → passed
uv run mkdocs build --strict → passed
CPU smoke run (device=cpu) → faithfulness/precision/recall/f1 all executed successfully; wrong precision scored lower than identical text
make test → 1309 passed, 64 deselected

vkehfdl1 · 2026-04-06T20:53:27Z

Round 2 review for PR #342 (feature/#336)

I ran the requested $code-review workflow, reviewed the local diff against origin/dev...HEAD, and re-verified the PR in an isolated worktree at commit 60f7193.

Review result

No blocking issues found in the BARTScore implementation, config exposure, or docs follow-up.
The optional-dependency/runtime guard looks good, and the auto-device fallback now handles missing torch.backends.mps safely.
I did not find correctness, regression, or discoverability problems in this round.

Real Result

uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q
- Result: 19 passed in 53.07s
uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics
- Result: 2 passed, 17 deselected in 0.02s
uv run autorag-research --config-path=<repo>/configs show metrics | rg 'bart_score_'
- Result: all four BARTScore configs were discoverable:
  - bart_score_f1
  - bart_score_faithfulness
  - bart_score_precision
  - bart_score_recall
make check
- Result: passed
uv run mkdocs build --strict
- Result: passed; docs built successfully in 3.59s
CPU smoke run (device=cpu)
- Result:
  - faithfulness = [-0.9646367430686951]
  - precision_same = [-0.5200337767601013]
  - recall_same = [-0.6211605072021484]
  - f1_same = [-0.5705971419811249]
  - precision_wrong = [-1.432390809059143]
- Sanity check: the mismatched answer scored lower than the matching answer for precision
make test
- Result: 1309 passed, 64 deselected in 59.45s

Given the code inspection plus the verification above, this still looks good to me for this round.

vkehfdl1 · 2026-04-06T21:23:40Z

Final follow-up verification completed on feature/#336.

Summary:

re-verified the existing BARTScore faithfulness, precision, recall, and F1 implementation at 60f7193
confirmed metric discovery, docs exposure, optional dependency guidance, and CPU execution behavior remain correct
no further code changes were required after final architect review

Verification:

uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q → 19 passed in 14.29s
uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics → 2 passed, 17 deselected in 0.01s
uv run autorag-research --config-path=/Users/jeffrey/Projects/AutoRAG-Research-issue-336/configs show metrics | rg 'bart_score_' → all four BARTScore configs listed
make check → passed
uv run mkdocs build --strict → passed
CPU smoke run (device=cpu) → faithfulness/precision/recall/f1 executed successfully; wrong precision scored lower than identical text
make test → 1309 passed, 64 deselected
architect verification → APPROVED; no blocking issues found

vkehfdl1 · 2026-04-06T21:45:03Z

Round 3 review for PR #342 (feature/#336)

I ran the requested $code-review workflow, reviewed the local diff against origin/dev...HEAD, and re-verified the PR in an isolated worktree at commit 60f7193.

Review result

No blocking issues found in the BARTScore implementation, config exposure, or documentation updates.
The metric direction wiring remains correct and consistent with the PR intent:
- faithfulness = context → answer
- precision = reference → answer
- recall = answer → reference
- F1 = mean of precision and recall
The optional dependency guidance, metric discoverability, and CPU execution behavior all checked out in this final round.

Real Result

uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q
- Result: 19 passed in 51.27s
uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics
- Result: 2 passed, 17 deselected in 0.01s
uv run autorag-research --config-path=/tmp/autorag-pr342-round3/configs show metrics | rg 'bart_score_'
- Result: all four BARTScore configs were discoverable:
  - bart_score_f1
  - bart_score_faithfulness
  - bart_score_precision
  - bart_score_recall
make check
- Result: passed (pre-commit, ty, deptry all clean)
uv run mkdocs build --strict
- Result: passed; docs built successfully in 54.24 seconds
CPU smoke run (device=cpu, batch_size=1)
- Result:
  - faithfulness = [-0.3970629870891571]
  - precision_same = [-0.3970629870891571]
  - recall_same = [-0.3970629870891571]
  - f1_same = [-0.3970629870891571]
  - precision_wrong = [-2.7505297660827637]
- Sanity check: the mismatched answer scored lower than the matching answer for precision
make test
- Result: 1309 passed, 64 deselected in 45.15s

Given the code inspection plus the verification above, this looks good to me for the final round.

vkehfdl1 · 2026-04-06T22:12:48Z

Final follow-up verification completed on feature/#336.

Summary:

re-verified the existing BARTScore faithfulness, precision, recall, and F1 implementation at 60f7193
confirmed metric discovery, docs exposure, optional dependency guidance, and CPU execution behavior remain correct
no further code changes were required after final architect sign-off; branch feature/#336 remains aligned with PR Feature/#336 #342 targeting dev

Verification:

uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q → 19 passed in 13.28s
uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics → 2 passed, 17 deselected in 0.01s
uv run autorag-research --config-path=/Users/jeffrey/Projects/AutoRAG-Research-issue-336/configs show metrics | rg 'bart_score_' → all four BARTScore configs listed
make check → passed
uv run mkdocs build --strict → passed
CPU smoke run (device=cpu, batch_size=1) → faithfulness/precision/recall/f1 executed successfully; wrong precision scored lower than identical text
make test → 1309 passed, 64 deselected in 45.39s
architect verification → APPROVED; no blocking issues found

vkehfdl1 · 2026-04-06T22:27:03Z

APPROVE — PR head is still 60f7193, the same commit reviewed in the last rounds, and the rerun checks I performed on that exact revision all passed.

Short reason

The change stays tightly scoped to the new BARTScore metric family, its config exposure, tests, and docs, and I did not find new blocking issues on the final head.

Real Result

git diff --stat origin/dev...HEAD on 60f7193 shows a focused 15-file change set: BARTScore implementation/config/docs/tests (650 insertions, 5 deletions).
uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q → 19 passed in 52.16s
uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics → 2 passed, 17 deselected in 0.02s
uv run autorag-research --config-path=/tmp/autorag-pr342-final/configs show metrics | rg 'bart_score_' → all four configs are discoverable:
- bart_score_f1
- bart_score_faithfulness
- bart_score_precision
- bart_score_recall
uv run mkdocs build --strict → passed; docs built in 3.85 seconds

Concrete evidence

The current PR head from GitHub is 60f719366821ef332ba41f3cfcf1f5a23a392d22, which matches the already-reviewed final revision.
The diff remains limited to:
- autorag_research/evaluation/metrics/generation.py
- four new configs/metrics/generation/bart_score_*.yaml
- docs/MkDocs navigation updates
- tests/autorag_research/evaluation/metrics/test_generation.py
Metric discovery output explicitly lists the four new BARTScore entries, and the strict docs build succeeded on the same checkout.

Merged origin/dev into feature/#336 and resolved the overlapping generation metric changes by keeping both the BARTScore additions from the PR branch and the UniEval additions already landed on dev. Constraint: PR #342 must stay reviewable without dropping either metric family Rejected: Take the dev side of the conflict wholesale | would remove the new BARTScore API surface from this PR Rejected: Take the feature side of the conflict wholesale | would drop UniEval exports and configs already present on dev Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep generation metric exports additive when multiple metric families land in parallel Tested: pytest generation metrics; pytest CLI metric discovery; pytest UniEval metrics; autorag-research show metrics; make check; mkdocs build --strict; BARTScore CPU smoke run; make test Not-tested: GPU-backed BARTScore execution

vkehfdl1 · 2026-04-06T22:42:08Z

Merged dev into feature/#336 and resolved the generation-metrics conflicts by keeping both change sets:

preserved the PR's BARTScore exports/configs/functions
retained the dev branch's UniEval exports/configs/tests
reconciled autorag_research/evaluation/metrics/__init__.py and autorag_research/evaluation/metrics/generation.py so both metric families remain discoverable together

Verified after the merge update:

uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q
uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics
uv run pytest tests/autorag_research/evaluation/metrics/test_unieval.py -q
uv run autorag-research --config-path=/Users/jeffrey/Projects/AutoRAG-Research-issue-336/configs show metrics | rg 'bart_score_|unieval_'
make check
uv run mkdocs build --strict
uv run python - <<'PY' ... CPU BARTScore smoke run (device=cpu, batch_size=1)
make test (1325 passed, 64 deselected)

Pushed the resolved branch tip: 3f8b5f0.

vkehfdl1 · 2026-04-06T22:57:14Z

REJECT — current PR head is no longer mergeable with dev.

Short reason

The BARTScore change itself verifies cleanly, but PR Feature/#336 #342 is currently blocked by merge conflicts against the latest dev, so it is not ready to merge as-is.

Real Result

Verified in an isolated worktree at 3f8b5f0602bae57c487c600a164c270790c1e356.
uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q → 19 passed in 59.51s
uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics → 2 passed, 17 deselected in 0.02s
uv run autorag-research --config-path=/tmp/autorag-pr342-final-verdict/configs show metrics | rg 'bart_score_|unieval_' → all BARTScore + UniEval metrics listed
uv sync --all-extras --all-groups && make check → passed
uv run mkdocs build --strict → passed
CPU smoke run (device=cpu, batch_size=1) → faithfulness=[-0.7046902775764465], precision_same=[-0.3970629870891571], recall_same=[-0.3970629870891571], f1_same=[-0.3970629870891571], precision_wrong=[-2.110044479370117]
make test → 1325 passed, 64 deselected

Concrete evidence

gh pr view 342 --repo NomaDamas/AutoRAG-Research --json headRefOid,mergeStateStatus,mergeable returned:
- headRefOid = 3f8b5f0602bae57c487c600a164c270790c1e356
- mergeStateStatus = DIRTY
- mergeable = CONFLICTING
Local merge simulation against current origin/dev (92c76c2c81e1dd2d4eb5ce6d2f91f4a94dfb287f) confirmed blocking conflicts in:
- autorag_research/evaluation/metrics/__init__.py
- autorag_research/evaluation/metrics/generation.py
- tests/autorag_research/evaluation/metrics/test_generation.py
The conflict is with newer dev changes that added exact_match / token_f1; this branch previously merged an older dev parent (f0418c587288e9c6fa98c731d2c866ca5ed1fe0e), so it needs one more rebase/merge before approval.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/#336#342

Feature/#336#342
vkehfdl1 wants to merge 3 commits intodevfrom
feature/#336

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026 •

edited

Loading

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vkehfdl1 commented Apr 6, 2026

Summary

Verification

Uh oh!

vkehfdl1 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vkehfdl1 commented Apr 6, 2026 •

edited

Loading