Conversation
This adds SQuAD-style Exact Match and Token F1 generation metrics, reusing the existing normalization helper and wiring them into the built-in metric exports and config discovery surface. Constraint: Must stay deterministic and dependency-free for benchmark evaluation Constraint: Must reuse existing normalization logic instead of reimplementing it Rejected: Add a separate normalization helper in generation metrics | duplicates normalize_string Rejected: Implement only one of EM or Token F1 | issue requires both standard QA metrics Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep EM/Token F1 aligned with SQuAD-style normalization if normalization logic changes Tested: make check; uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q; make test Not-tested: gpu/data-marked test paths
|
Ran $code-review on the PR diff and I don't have any blocking review findings. Real Result
Concrete evidence
Given the diff scope ( |
The feature branch was already functionally complete, so this follow-up stayed narrowly scoped to regression coverage and small helper cleanup. It adds explicit Exact Match empty-normalization coverage, verifies Token F1 bag-of-words behavior with repeated tokens, and routes Hugging Face best-reference scoring through the shared generation helper so multi-reference semantics stay aligned across generation metrics. Constraint: Keep issue #337 scoped to generation metrics and tests only Rejected: Broaden into unrelated API-test cleanup | outside the approved PR scope Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep generation metrics on the shared best-reference helper so deterministic and evaluate-backed scoring stay aligned Tested: uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q Tested: make check Tested: make test Not-tested: None
|
Follow-up shipped on
Verification:
|
|
Superseded by the earlier implementation update on this PR. The branch already contains the verified issue #337 generation-metric work on |
|
Ran $code-review on the PR diff in a clean local worktree at Real Result
Concrete evidence
No merge-blocking issue found from this review pass. |
|
Implemented the approved issue #337 generation-metric follow-up on
Verification:
|
|
Ran $code-review on the PR diff in a clean local worktree at Real Result
Concrete evidence
No merge-blocking issue found from this round. |
|
Superseded by the later implementation update on this PR. Canonical signed follow-up for job |
The branch already contained the approved Exact Match and Token F1 metric implementation, so this follow-up records a fresh verified handoff without widening the settled product diff. Constraint: Main worktree has unrelated unstaged changes, so verification ran in a clean detached worktree Rejected: Add a no-op code edit | would widen a settled diff without product value Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep future issue #337 follow-ups scoped to generation-metric behavior and avoid mixing in unrelated local edits Tested: make check; uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q; make test; manual exact_match/token_f1 smoke check Not-tested: Additional changes beyond the already-approved issue #337 diff
|
Re-verified the approved issue #337 follow-up on
Verification:
|
|
APPROVE Short reason: the diff is correctly scoped to issue #337 and I did not find any merge-blocking code issues. Real Result:
Concrete evidence:
|
Resolving the merge from dev required keeping both sides of the generation-metric surface. The branch now preserves the SQuAD-style exact-match/token-F1 work from feature/#337 while retaining the UniEval exports, configs, and coverage that landed on dev. The conflict resolution keeps metric discovery and built-in exports coherent after the branch update. Constraint: Merge update had to preserve both feature/#337 answer metrics and dev's newer UniEval additions Rejected: Favor only the PR branch conflict hunks | would have dropped UniEval exports and config discovery coverage from dev Rejected: Favor only dev's conflict hunks | would have removed feature/#337 exact-match/token-F1 exports and discovery assertions Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep generation metric exports and config-discovery assertions in sync whenever new built-in metrics land on either branch Tested: uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/evaluation/metrics/test_unieval.py tests/autorag_research/cli/test_utils.py -q Tested: make check Tested: make test
|
Merged
Verified:
|
|
APPROVE Short reason: the live PR head is still narrowly scoped to issue #337's generation-metric work, and I did not find a merge-blocking defect in the current diff after the Real Result:
Concrete evidence:
|
Summary
Verification
make checkuv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -qmake test