Feature/#337 by vkehfdl1 · Pull Request #341 · NomaDamas/AutoRAG-Research

vkehfdl1 · 2026-04-06T19:06:19Z

Summary

add SQuAD-style Exact Match and Token F1 generation metrics
expose the new metrics through built-in metric exports and generation metric configs
cover normalization, multi-reference scoring, empty-answer handling, and config discovery with tests

Verification

make check
uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q
make test

This adds SQuAD-style Exact Match and Token F1 generation metrics, reusing the existing normalization helper and wiring them into the built-in metric exports and config discovery surface. Constraint: Must stay deterministic and dependency-free for benchmark evaluation Constraint: Must reuse existing normalization logic instead of reimplementing it Rejected: Add a separate normalization helper in generation metrics | duplicates normalize_string Rejected: Implement only one of EM or Token F1 | issue requires both standard QA metrics Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep EM/Token F1 aligned with SQuAD-style normalization if normalization logic changes Tested: make check; uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q; make test Not-tested: gpu/data-marked test paths

vkehfdl1 · 2026-04-06T19:37:57Z

Ran $code-review on the PR diff and I don't have any blocking review findings.

Real Result

make check ✅ passed under Python 3.10.16 after recreating the worktree venv (ruff check, ruff format, ty check, and deptry all passed).
uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q ✅ passed: 36 passed in 60.74s (0:01:00).
make test ⚠️ did not finish cleanly in my environment, but the failure was outside this PR's diff: the run reached the new metric coverage and later stopped in unchanged OpenAI logprob tests because OPENAI_API_KEY was not set.

Concrete evidence

Full run before stopping: 1302 passed, 64 deselected, 1 failed, 4 errors in 65.92s.
Failing tests were all in unchanged tests/autorag_research/pipelines/generation/test_main_rag_pipeline.py (TestLogprobsWithOpenAI::* and test_openai_without_logprobs_falls_back).
Error raised: openai.OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable.

Given the diff scope (generation.py, built-in metric exports/config discovery, and the added tests) plus the passing targeted verification, I did not find a PR-specific issue to block merge.

The feature branch was already functionally complete, so this follow-up stayed narrowly scoped to regression coverage and small helper cleanup. It adds explicit Exact Match empty-normalization coverage, verifies Token F1 bag-of-words behavior with repeated tokens, and routes Hugging Face best-reference scoring through the shared generation helper so multi-reference semantics stay aligned across generation metrics. Constraint: Keep issue #337 scoped to generation metrics and tests only Rejected: Broaden into unrelated API-test cleanup | outside the approved PR scope Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep generation metrics on the shared best-reference helper so deterministic and evaluate-backed scoring stay aligned Tested: uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q Tested: make check Tested: make test Not-tested: None

vkehfdl1 · 2026-04-06T20:07:37Z

Follow-up shipped on feature/#337.

added explicit Exact Match coverage for empty normalized answers so the empty-answer edge case is exercised for both new deterministic metrics
added a Token F1 regression for repeated-token overlap to lock in the intended bag-of-words scoring behavior
reused the shared best-reference helper inside huggingface_evaluate() so multi-reference generation scoring stays aligned across metric implementations

Verification:

uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q
make check
make test

vkehfdl1 · 2026-04-06T20:16:05Z

Superseded by the earlier implementation update on this PR.

The branch already contains the verified issue #337 generation-metric work on feature/#337, and the canonical signed follow-up comment remains here:

Feature/#337 #341 (comment)

vkehfdl1 · 2026-04-06T20:45:37Z

Ran $code-review on the PR diff in a clean local worktree at 301b5a7, and I still don't have any blocking review findings.

Real Result

make check ✅ passed after syncing the review worktree with uv sync --all-groups --all-extras
uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q ✅ passed: 38 passed in 34.23s
make test ✅ passed locally: 1309 passed, 64 deselected in 65.79s (0:01:05)

Concrete evidence

The targeted generation-metric tests now cover the added edge cases end-to-end: SQuAD-style normalization for Exact Match, empty-normalized-answer behavior for both deterministic metrics, repeated-token overlap for Token F1, best-reference selection, and generation metric config discovery.
The full non-GPU/non-data suite completed cleanly after bringing up PostgreSQL in the review worktree, so I did not find a regression outside the targeted metric area either.

No merge-blocking issue found from this review pass.

vkehfdl1 · 2026-04-06T21:11:22Z

Implemented the approved issue #337 generation-metric follow-up on feature/#337.

added SQuAD-style exact_match and token_f1 generation metrics with shared best-reference scoring
exposed both metrics through built-in exports and generation metric configs
covered normalization, multi-reference selection, empty-normalized answers, repeated-token overlap, and config discovery with tests

Verification:

uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q ✅ (38 passed in 12.15s)
make check ✅
make test ✅ (1309 passed, 64 deselected in 43.73s)

vkehfdl1 · 2026-04-06T21:35:05Z

Ran $code-review on the PR diff in a clean local worktree at 301b5a7, and I still don't have any merge-blocking review findings.

Real Result

make check ✅ passed after uv sync --all-groups --all-extras
uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q ✅ passed: 38 passed in 31.24s
make test ✅ passed locally in the clean worktree: 1309 passed, 64 deselected in 52.38s

Concrete evidence

The diff is scoped to the new generation metrics, their built-in exports/configs, and related tests only (6 files changed in the PR review scope).
The targeted generation-metric tests exercise the intended edge cases end-to-end: SQuAD-style normalization for Exact Match, best-reference selection, empty-normalized-answer behavior for both deterministic metrics, repeated-token overlap for Token F1, and generation metric config discovery.
Manual smoke check on the new metric functions matched expectations:
- exact_match → [1.0, 0.0, 1.0]
- token_f1 → [1.0, 0.6666666666666666, 1.0]
The delegated $code-review pass also came back with 0 blocking issues on the same diff.

No merge-blocking issue found from this round.

vkehfdl1 · 2026-04-06T21:59:29Z

Superseded by the later implementation update on this PR.

Canonical signed follow-up for job 17b9987511e94d1c9d7f2380b0211576:

Feature/#337 #341 (comment)

The branch already contained the approved Exact Match and Token F1 metric implementation, so this follow-up records a fresh verified handoff without widening the settled product diff. Constraint: Main worktree has unrelated unstaged changes, so verification ran in a clean detached worktree Rejected: Add a no-op code edit | would widen a settled diff without product value Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep future issue #337 follow-ups scoped to generation-metric behavior and avoid mixing in unrelated local edits Tested: make check; uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q; make test; manual exact_match/token_f1 smoke check Not-tested: Additional changes beyond the already-approved issue #337 diff

vkehfdl1 · 2026-04-06T22:00:00Z

Re-verified the approved issue #337 follow-up on feature/#337 and kept the PR scoped to the shipped generation-metric work.

added SQuAD-style exact_match and token_f1 generation metrics with shared best-reference scoring
exposed both metrics through built-in exports and generation metric configs
covered normalization, multi-reference selection, empty-normalized answers, repeated-token overlap, and config discovery with tests

Verification:

make check ✅
uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q ✅ (38 passed in 54.96s)
make test ✅ (1309 passed, 64 deselected in 60.07s)
manual smoke check ✅ (exact_match → [1.0, 0.0, 1.0], token_f1 → [1.0, 0.6666666666666666, 1.0])

vkehfdl1 · 2026-04-06T22:23:47Z

APPROVE

Short reason: the diff is correctly scoped to issue #337 and I did not find any merge-blocking code issues.

Real Result:

make check ✅
uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q ✅ (38 passed in 55.08s)
make test ✅ (1309 passed, 64 deselected in 61.06s)

Concrete evidence:

Reviewed the full PR diff (origin/dev...origin/feature/#337): 6 files changed, scoped to generation metrics, built-in exports/configs, and tests.
autorag_research/evaluation/metrics/generation.py adds SQuAD-style exact_match and token_f1 with shared best-reference scoring.
autorag_research/evaluation/metrics/__init__.py plus configs/metrics/generation/exact_match.yaml and configs/metrics/generation/token_f1.yaml expose the new metrics through built-in imports/config discovery.
tests/autorag_research/evaluation/metrics/test_generation.py and tests/autorag_research/cli/test_utils.py cover normalization, best-reference selection, empty-normalized answers, repeated-token overlap, and config discovery.
Diagnostics on the modified Python files returned 0 errors, and the delegated code-review pass also found 0 blocking issues.

Resolving the merge from dev required keeping both sides of the generation-metric surface. The branch now preserves the SQuAD-style exact-match/token-F1 work from feature/#337 while retaining the UniEval exports, configs, and coverage that landed on dev. The conflict resolution keeps metric discovery and built-in exports coherent after the branch update. Constraint: Merge update had to preserve both feature/#337 answer metrics and dev's newer UniEval additions Rejected: Favor only the PR branch conflict hunks | would have dropped UniEval exports and config discovery coverage from dev Rejected: Favor only dev's conflict hunks | would have removed feature/#337 exact-match/token-F1 exports and discovery assertions Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep generation metric exports and config-discovery assertions in sync whenever new built-in metrics land on either branch Tested: uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/evaluation/metrics/test_unieval.py tests/autorag_research/cli/test_utils.py -q Tested: make check Tested: make test

vkehfdl1 · 2026-04-06T22:35:45Z

Merged dev into feature/#337 and resolved the remaining conflicts by keeping both sides of the generation-metrics surface:

preserved the PR's exact_match / token_f1 exports and config-discovery coverage
kept the newer UniEval metric exports, configs, and tests from dev
updated the shared metric export surface and CLI discovery assertion so both metric families stay available together after the merge

Verified:

uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/evaluation/metrics/test_unieval.py tests/autorag_research/cli/test_utils.py -q
make check
make test

vkehfdl1 · 2026-04-06T22:48:34Z

APPROVE

Short reason: the live PR head is still narrowly scoped to issue #337's generation-metric work, and I did not find a merge-blocking defect in the current diff after the dev merge.

Real Result:

make check ✅
uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/evaluation/metrics/test_unieval.py tests/autorag_research/cli/test_utils.py -q ✅ (54 passed in 65.54s)
manual smoke check ✅ (exact_match → [1.0, 0.0, 1.0], token_f1 → [0.6666666666666666, 0.8, 1.0])
make test ⚠️ did not finish cleanly in my environment: 1320 passed, 5 failed, 64 deselected, 1 error in 61.92s; the failures were in unchanged DB-backed CLI / main_rag_pipeline tests after PostgreSQL dropped during the run

Concrete evidence:

Reviewed the current live diff against origin/dev: git diff --stat origin/dev...HEAD shows 6 changed files, all limited to generation metrics, built-in exports/configs, and tests.
autorag_research/evaluation/metrics/generation.py adds SQuAD-style exact_match and token_f1, plus shared best-reference scoring reused by huggingface_evaluate().
autorag_research/evaluation/metrics/__init__.py, configs/metrics/generation/exact_match.yaml, and configs/metrics/generation/token_f1.yaml expose the new built-in metrics cleanly.
tests/autorag_research/evaluation/metrics/test_generation.py and tests/autorag_research/cli/test_utils.py cover normalization, best-reference selection, empty-normalized answers, repeated-token overlap, and config discovery; tests/autorag_research/evaluation/metrics/test_unieval.py also passed, so the merge resolution did not break the newer UniEval surface from dev.
The make test failures were outside the diff scope: tests/autorag_research/cli/commands/test_e2e_pipeline.py, tests/autorag_research/cli/commands/test_show.py, and tests/autorag_research/pipelines/generation/test_main_rag_pipeline.py.
Diagnostics on the modified Python files returned 0 errors.

vkehfdl1 merged commit 92c76c2 into dev Apr 6, 2026
6 checks passed

vkehfdl1 deleted the feature/#337 branch April 6, 2026 22:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/#337#341

Feature/#337#341
vkehfdl1 merged 4 commits intodevfrom
feature/#337

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026 •

edited

Loading

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026 •

edited

Loading

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vkehfdl1 commented Apr 6, 2026

Summary

Verification

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026

Uh oh!

vkehfdl1 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vkehfdl1 commented Apr 6, 2026 •

edited

Loading

vkehfdl1 commented Apr 6, 2026 •

edited

Loading

vkehfdl1 commented Apr 6, 2026 •

edited

Loading