Skip to content

Feature/#336#342

Open
vkehfdl1 wants to merge 3 commits intodevfrom
feature/#336
Open

Feature/#336#342
vkehfdl1 wants to merge 3 commits intodevfrom
feature/#336

Conversation

@vkehfdl1
Copy link
Copy Markdown
Collaborator

@vkehfdl1 vkehfdl1 commented Apr 6, 2026

Summary

  • add BARTScore generation metrics for faithfulness, precision, recall, and F1
  • expose new metric configs/YAMLs and document the metric in MkDocs
  • add TDD coverage plus a CPU smoke run for the new metric family

Verification

  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q
  • make check
  • uv run mkdocs build --strict
  • uv run python - <<'PY' ... BARTScore smoke run on cpu
  • make test

Closes #336

Implement BARTScore faithfulness, precision, recall, and F1 variants using a cached BART seq2seq backend. The change also adds TDD coverage, metric configs, and docs so issue #336 can be exercised through the existing generation-evaluation surface.

Constraint: Must match the NeurIPS 2021 BARTScore direction definitions without adding new dependencies
Constraint: Must preserve existing metric APIs and config discovery patterns
Rejected: Shipping only a single BARTScore variant | issue requires faithfulness, precision, recall, and F1
Rejected: Deferring docs/config YAMLs to a follow-up | would leave the metric undiscoverable from normal project workflows
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: Keep precision as reference→answer and recall as answer→reference to preserve paper-aligned semantics
Tested: uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q
Tested: make check
Tested: uv run mkdocs build --strict
Tested: uv run python - <<'PY' ... BARTScore smoke run on cpu
Tested: make test
Not-tested: Alternate checkpoints beyond facebook/bart-large-cnn
Related: #336
@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Round 1 review for PR #342 (feature/#336)

I ran the requested $code-review workflow, inspected the local diff against origin/dev...HEAD, and did real verification locally.

Review result

  • No blocking issues found in the BARTScore implementation, config exposure, or docs changes.
  • The direction wiring is consistent with the PR/issue intent:
    • faithfulness = context → answer
    • precision = reference → answer
    • recall = answer → reference
    • F1 = mean of precision and recall (as documented in this PR)

Real Result

  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q
    • Result: 14 passed in 13.81s
  • uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics
    • Result: 2 passed, 17 deselected in 0.01s
  • make check
    • Result: passed (pre-commit, ty, deptry all clean)
  • uv run mkdocs build --strict
    • Result: built successfully in 4.24s
  • CPU smoke run (facebook/bart-large-cnn, device=cpu)
    • identical text: faithfulness/precision/recall/f1 = [-0.3970629870891571]
    • precision sanity check: same = [-0.3970629870891571], wrong = [-2.7505297660827637]

Given the code inspection plus the verification above, this looks good to me for this round.

BARTScore now fails fast with an actionable dependency message instead of
surfacing raw optional-import failures, and the docs now explain the
runtime extra needed for both CPU and accelerator-backed usage. Added
regression coverage for dependency guidance and device auto-detection.

Constraint: torch and transformers remain optional extras rather than core dependencies
Rejected: Move BARTScore runtime packages into the default install set | would increase baseline install weight for unrelated workflows
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep BARTScore runtime guards and install docs aligned with the optional dependency group names
Tested: uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q
Tested: uv run autorag-research --config-path=<repo>/configs show metrics
Tested: make check
Tested: uv run mkdocs build --strict
Tested: CPU BARTScore smoke run on device=cpu
Tested: make test
Not-tested: package-install flow from a fresh external environment
@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Follow-up implemented on feature/#336.

Summary:

  • added a BARTScore runtime guard with actionable install guidance for missing torch / transformers
  • added regression coverage for optional-dependency guidance and safer auto-device fallback when torch.backends.mps is unavailable
  • updated README + MkDocs install/discoverability docs so BARTScore is documented on the main install path and concept pages

Verification:

  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q19 passed
  • uv run autorag-research --config-path=<repo>/configs show metrics | rg 'bart_score_' → all four BARTScore configs listed
  • make check → passed
  • uv run mkdocs build --strict → passed
  • CPU smoke run (device=cpu) → faithfulness/precision/recall/f1 all executed successfully; wrong precision scored lower than identical text
  • make test1309 passed, 64 deselected

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Round 2 review for PR #342 (feature/#336)

I ran the requested $code-review workflow, reviewed the local diff against origin/dev...HEAD, and re-verified the PR in an isolated worktree at commit 60f7193.

Review result

  • No blocking issues found in the BARTScore implementation, config exposure, or docs follow-up.
  • The optional-dependency/runtime guard looks good, and the auto-device fallback now handles missing torch.backends.mps safely.
  • I did not find correctness, regression, or discoverability problems in this round.

Real Result

  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q
    • Result: 19 passed in 53.07s
  • uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics
    • Result: 2 passed, 17 deselected in 0.02s
  • uv run autorag-research --config-path=<repo>/configs show metrics | rg 'bart_score_'
    • Result: all four BARTScore configs were discoverable:
      • bart_score_f1
      • bart_score_faithfulness
      • bart_score_precision
      • bart_score_recall
  • make check
    • Result: passed
  • uv run mkdocs build --strict
    • Result: passed; docs built successfully in 3.59s
  • CPU smoke run (device=cpu)
    • Result:
      • faithfulness = [-0.9646367430686951]
      • precision_same = [-0.5200337767601013]
      • recall_same = [-0.6211605072021484]
      • f1_same = [-0.5705971419811249]
      • precision_wrong = [-1.432390809059143]
    • Sanity check: the mismatched answer scored lower than the matching answer for precision
  • make test
    • Result: 1309 passed, 64 deselected in 59.45s

Given the code inspection plus the verification above, this still looks good to me for this round.

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Final follow-up verification completed on feature/#336.

Summary:

  • re-verified the existing BARTScore faithfulness, precision, recall, and F1 implementation at 60f7193
  • confirmed metric discovery, docs exposure, optional dependency guidance, and CPU execution behavior remain correct
  • no further code changes were required after final architect review

Verification:

  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q19 passed in 14.29s
  • uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics2 passed, 17 deselected in 0.01s
  • uv run autorag-research --config-path=/Users/jeffrey/Projects/AutoRAG-Research-issue-336/configs show metrics | rg 'bart_score_' → all four BARTScore configs listed
  • make check → passed
  • uv run mkdocs build --strict → passed
  • CPU smoke run (device=cpu) → faithfulness/precision/recall/f1 executed successfully; wrong precision scored lower than identical text
  • make test1309 passed, 64 deselected
  • architect verification → APPROVED; no blocking issues found

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Round 3 review for PR #342 (feature/#336)

I ran the requested $code-review workflow, reviewed the local diff against origin/dev...HEAD, and re-verified the PR in an isolated worktree at commit 60f7193.

Review result

  • No blocking issues found in the BARTScore implementation, config exposure, or documentation updates.
  • The metric direction wiring remains correct and consistent with the PR intent:
    • faithfulness = context → answer
    • precision = reference → answer
    • recall = answer → reference
    • F1 = mean of precision and recall
  • The optional dependency guidance, metric discoverability, and CPU execution behavior all checked out in this final round.

Real Result

  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q
    • Result: 19 passed in 51.27s
  • uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics
    • Result: 2 passed, 17 deselected in 0.01s
  • uv run autorag-research --config-path=/tmp/autorag-pr342-round3/configs show metrics | rg 'bart_score_'
    • Result: all four BARTScore configs were discoverable:
      • bart_score_f1
      • bart_score_faithfulness
      • bart_score_precision
      • bart_score_recall
  • make check
    • Result: passed (pre-commit, ty, deptry all clean)
  • uv run mkdocs build --strict
    • Result: passed; docs built successfully in 54.24 seconds
  • CPU smoke run (device=cpu, batch_size=1)
    • Result:
      • faithfulness = [-0.3970629870891571]
      • precision_same = [-0.3970629870891571]
      • recall_same = [-0.3970629870891571]
      • f1_same = [-0.3970629870891571]
      • precision_wrong = [-2.7505297660827637]
    • Sanity check: the mismatched answer scored lower than the matching answer for precision
  • make test
    • Result: 1309 passed, 64 deselected in 45.15s

Given the code inspection plus the verification above, this looks good to me for the final round.

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Final follow-up verification completed on feature/#336.

Summary:

  • re-verified the existing BARTScore faithfulness, precision, recall, and F1 implementation at 60f7193
  • confirmed metric discovery, docs exposure, optional dependency guidance, and CPU execution behavior remain correct
  • no further code changes were required after final architect sign-off; branch feature/#336 remains aligned with PR Feature/#336 #342 targeting dev

Verification:

  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q19 passed in 13.28s
  • uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics2 passed, 17 deselected in 0.01s
  • uv run autorag-research --config-path=/Users/jeffrey/Projects/AutoRAG-Research-issue-336/configs show metrics | rg 'bart_score_' → all four BARTScore configs listed
  • make check → passed
  • uv run mkdocs build --strict → passed
  • CPU smoke run (device=cpu, batch_size=1) → faithfulness/precision/recall/f1 executed successfully; wrong precision scored lower than identical text
  • make test1309 passed, 64 deselected in 45.39s
  • architect verification → APPROVED; no blocking issues found

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

APPROVE — PR head is still 60f7193, the same commit reviewed in the last rounds, and the rerun checks I performed on that exact revision all passed.

Short reason

  • The change stays tightly scoped to the new BARTScore metric family, its config exposure, tests, and docs, and I did not find new blocking issues on the final head.

Real Result

  • git diff --stat origin/dev...HEAD on 60f7193 shows a focused 15-file change set: BARTScore implementation/config/docs/tests (650 insertions, 5 deletions).
  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q19 passed in 52.16s
  • uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics2 passed, 17 deselected in 0.02s
  • uv run autorag-research --config-path=/tmp/autorag-pr342-final/configs show metrics | rg 'bart_score_' → all four configs are discoverable:
    • bart_score_f1
    • bart_score_faithfulness
    • bart_score_precision
    • bart_score_recall
  • uv run mkdocs build --strict → passed; docs built in 3.85 seconds

Concrete evidence

  • The current PR head from GitHub is 60f719366821ef332ba41f3cfcf1f5a23a392d22, which matches the already-reviewed final revision.
  • The diff remains limited to:
    • autorag_research/evaluation/metrics/generation.py
    • four new configs/metrics/generation/bart_score_*.yaml
    • docs/MkDocs navigation updates
    • tests/autorag_research/evaluation/metrics/test_generation.py
  • Metric discovery output explicitly lists the four new BARTScore entries, and the strict docs build succeeded on the same checkout.

Merged origin/dev into feature/#336 and resolved the overlapping generation metric changes by keeping both the BARTScore additions from the PR branch and the UniEval additions already landed on dev.

Constraint: PR #342 must stay reviewable without dropping either metric family
Rejected: Take the dev side of the conflict wholesale | would remove the new BARTScore API surface from this PR
Rejected: Take the feature side of the conflict wholesale | would drop UniEval exports and configs already present on dev
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep generation metric exports additive when multiple metric families land in parallel
Tested: pytest generation metrics; pytest CLI metric discovery; pytest UniEval metrics; autorag-research show metrics; make check; mkdocs build --strict; BARTScore CPU smoke run; make test
Not-tested: GPU-backed BARTScore execution
@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Merged dev into feature/#336 and resolved the generation-metrics conflicts by keeping both change sets:

  • preserved the PR's BARTScore exports/configs/functions
  • retained the dev branch's UniEval exports/configs/tests
  • reconciled autorag_research/evaluation/metrics/__init__.py and autorag_research/evaluation/metrics/generation.py so both metric families remain discoverable together

Verified after the merge update:

  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q
  • uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics
  • uv run pytest tests/autorag_research/evaluation/metrics/test_unieval.py -q
  • uv run autorag-research --config-path=/Users/jeffrey/Projects/AutoRAG-Research-issue-336/configs show metrics | rg 'bart_score_|unieval_'
  • make check
  • uv run mkdocs build --strict
  • uv run python - <<'PY' ... CPU BARTScore smoke run (device=cpu, batch_size=1)
  • make test (1325 passed, 64 deselected)

Pushed the resolved branch tip: 3f8b5f0.

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

REJECT — current PR head is no longer mergeable with dev.

Short reason

  • The BARTScore change itself verifies cleanly, but PR Feature/#336 #342 is currently blocked by merge conflicts against the latest dev, so it is not ready to merge as-is.

Real Result

  • Verified in an isolated worktree at 3f8b5f0602bae57c487c600a164c270790c1e356.
  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py -q19 passed in 59.51s
  • uv run pytest tests/autorag_research/cli/test_utils.py -q -k discover_metrics2 passed, 17 deselected in 0.02s
  • uv run autorag-research --config-path=/tmp/autorag-pr342-final-verdict/configs show metrics | rg 'bart_score_|unieval_' → all BARTScore + UniEval metrics listed
  • uv sync --all-extras --all-groups && make check → passed
  • uv run mkdocs build --strict → passed
  • CPU smoke run (device=cpu, batch_size=1) → faithfulness=[-0.7046902775764465], precision_same=[-0.3970629870891571], recall_same=[-0.3970629870891571], f1_same=[-0.3970629870891571], precision_wrong=[-2.110044479370117]
  • make test1325 passed, 64 deselected

Concrete evidence

  • gh pr view 342 --repo NomaDamas/AutoRAG-Research --json headRefOid,mergeStateStatus,mergeable returned:
    • headRefOid = 3f8b5f0602bae57c487c600a164c270790c1e356
    • mergeStateStatus = DIRTY
    • mergeable = CONFLICTING
  • Local merge simulation against current origin/dev (92c76c2c81e1dd2d4eb5ce6d2f91f4a94dfb287f) confirmed blocking conflicts in:
    • autorag_research/evaluation/metrics/__init__.py
    • autorag_research/evaluation/metrics/generation.py
    • tests/autorag_research/evaluation/metrics/test_generation.py
  • The conflict is with newer dev changes that added exact_match / token_f1; this branch previously merged an older dev parent (f0418c587288e9c6fa98c731d2c866ca5ed1fe0e), so it needs one more rebase/merge before approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant