Skip to content

Feature/#337#341

Merged
vkehfdl1 merged 4 commits intodevfrom
feature/#337
Apr 6, 2026
Merged

Feature/#337#341
vkehfdl1 merged 4 commits intodevfrom
feature/#337

Conversation

@vkehfdl1
Copy link
Copy Markdown
Collaborator

@vkehfdl1 vkehfdl1 commented Apr 6, 2026

Summary

  • add SQuAD-style Exact Match and Token F1 generation metrics
  • expose the new metrics through built-in metric exports and generation metric configs
  • cover normalization, multi-reference scoring, empty-answer handling, and config discovery with tests

Verification

  • make check
  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q
  • make test

This adds SQuAD-style Exact Match and Token F1 generation metrics,
reusing the existing normalization helper and wiring them into the
built-in metric exports and config discovery surface.

Constraint: Must stay deterministic and dependency-free for benchmark evaluation
Constraint: Must reuse existing normalization logic instead of reimplementing it
Rejected: Add a separate normalization helper in generation metrics | duplicates normalize_string
Rejected: Implement only one of EM or Token F1 | issue requires both standard QA metrics
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep EM/Token F1 aligned with SQuAD-style normalization if normalization logic changes
Tested: make check; uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q; make test
Not-tested: gpu/data-marked test paths
@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Ran $code-review on the PR diff and I don't have any blocking review findings.

Real Result

  • make check ✅ passed under Python 3.10.16 after recreating the worktree venv (ruff check, ruff format, ty check, and deptry all passed).
  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q ✅ passed: 36 passed in 60.74s (0:01:00).
  • make test ⚠️ did not finish cleanly in my environment, but the failure was outside this PR's diff: the run reached the new metric coverage and later stopped in unchanged OpenAI logprob tests because OPENAI_API_KEY was not set.

Concrete evidence

  • Full run before stopping: 1302 passed, 64 deselected, 1 failed, 4 errors in 65.92s.
  • Failing tests were all in unchanged tests/autorag_research/pipelines/generation/test_main_rag_pipeline.py (TestLogprobsWithOpenAI::* and test_openai_without_logprobs_falls_back).
  • Error raised: openai.OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable.

Given the diff scope (generation.py, built-in metric exports/config discovery, and the added tests) plus the passing targeted verification, I did not find a PR-specific issue to block merge.

The feature branch was already functionally complete, so this follow-up stayed narrowly scoped to regression coverage and small helper cleanup. It adds explicit Exact Match empty-normalization coverage, verifies Token F1 bag-of-words behavior with repeated tokens, and routes Hugging Face best-reference scoring through the shared generation helper so multi-reference semantics stay aligned across generation metrics.

Constraint: Keep issue #337 scoped to generation metrics and tests only
Rejected: Broaden into unrelated API-test cleanup | outside the approved PR scope
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep generation metrics on the shared best-reference helper so deterministic and evaluate-backed scoring stay aligned
Tested: uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q
Tested: make check
Tested: make test
Not-tested: None
@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Follow-up shipped on feature/#337.

  • added explicit Exact Match coverage for empty normalized answers so the empty-answer edge case is exercised for both new deterministic metrics
  • added a Token F1 regression for repeated-token overlap to lock in the intended bag-of-words scoring behavior
  • reused the shared best-reference helper inside huggingface_evaluate() so multi-reference generation scoring stays aligned across metric implementations

Verification:

  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q
  • make check
  • make test

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Superseded by the earlier implementation update on this PR.

The branch already contains the verified issue #337 generation-metric work on feature/#337, and the canonical signed follow-up comment remains here:

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Ran $code-review on the PR diff in a clean local worktree at 301b5a7, and I still don't have any blocking review findings.

Real Result

  • make check ✅ passed after syncing the review worktree with uv sync --all-groups --all-extras
  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q ✅ passed: 38 passed in 34.23s
  • make test ✅ passed locally: 1309 passed, 64 deselected in 65.79s (0:01:05)

Concrete evidence

  • The targeted generation-metric tests now cover the added edge cases end-to-end: SQuAD-style normalization for Exact Match, empty-normalized-answer behavior for both deterministic metrics, repeated-token overlap for Token F1, best-reference selection, and generation metric config discovery.
  • The full non-GPU/non-data suite completed cleanly after bringing up PostgreSQL in the review worktree, so I did not find a regression outside the targeted metric area either.

No merge-blocking issue found from this review pass.

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Implemented the approved issue #337 generation-metric follow-up on feature/#337.

  • added SQuAD-style exact_match and token_f1 generation metrics with shared best-reference scoring
  • exposed both metrics through built-in exports and generation metric configs
  • covered normalization, multi-reference selection, empty-normalized answers, repeated-token overlap, and config discovery with tests

Verification:

  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q ✅ (38 passed in 12.15s)
  • make check
  • make test ✅ (1309 passed, 64 deselected in 43.73s)

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Ran $code-review on the PR diff in a clean local worktree at 301b5a7, and I still don't have any merge-blocking review findings.

Real Result

  • make check ✅ passed after uv sync --all-groups --all-extras
  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q ✅ passed: 38 passed in 31.24s
  • make test ✅ passed locally in the clean worktree: 1309 passed, 64 deselected in 52.38s

Concrete evidence

  • The diff is scoped to the new generation metrics, their built-in exports/configs, and related tests only (6 files changed in the PR review scope).
  • The targeted generation-metric tests exercise the intended edge cases end-to-end: SQuAD-style normalization for Exact Match, best-reference selection, empty-normalized-answer behavior for both deterministic metrics, repeated-token overlap for Token F1, and generation metric config discovery.
  • Manual smoke check on the new metric functions matched expectations:
    • exact_match[1.0, 0.0, 1.0]
    • token_f1[1.0, 0.6666666666666666, 1.0]
  • The delegated $code-review pass also came back with 0 blocking issues on the same diff.

No merge-blocking issue found from this round.

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Superseded by the later implementation update on this PR.

Canonical signed follow-up for job 17b9987511e94d1c9d7f2380b0211576:

The branch already contained the approved Exact Match and Token F1
metric implementation, so this follow-up records a fresh verified
handoff without widening the settled product diff.

Constraint: Main worktree has unrelated unstaged changes, so verification ran in a clean detached worktree
Rejected: Add a no-op code edit | would widen a settled diff without product value
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep future issue #337 follow-ups scoped to generation-metric behavior and avoid mixing in unrelated local edits
Tested: make check; uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q; make test; manual exact_match/token_f1 smoke check
Not-tested: Additional changes beyond the already-approved issue #337 diff
@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Re-verified the approved issue #337 follow-up on feature/#337 and kept the PR scoped to the shipped generation-metric work.

  • added SQuAD-style exact_match and token_f1 generation metrics with shared best-reference scoring
  • exposed both metrics through built-in exports and generation metric configs
  • covered normalization, multi-reference selection, empty-normalized answers, repeated-token overlap, and config discovery with tests

Verification:

  • make check
  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q ✅ (38 passed in 54.96s)
  • make test ✅ (1309 passed, 64 deselected in 60.07s)
  • manual smoke check ✅ (exact_match[1.0, 0.0, 1.0], token_f1[1.0, 0.6666666666666666, 1.0])

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

APPROVE

Short reason: the diff is correctly scoped to issue #337 and I did not find any merge-blocking code issues.

Real Result:

  • make check
  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/cli/test_utils.py -q ✅ (38 passed in 55.08s)
  • make test ✅ (1309 passed, 64 deselected in 61.06s)

Concrete evidence:

  • Reviewed the full PR diff (origin/dev...origin/feature/#337): 6 files changed, scoped to generation metrics, built-in exports/configs, and tests.
  • autorag_research/evaluation/metrics/generation.py adds SQuAD-style exact_match and token_f1 with shared best-reference scoring.
  • autorag_research/evaluation/metrics/__init__.py plus configs/metrics/generation/exact_match.yaml and configs/metrics/generation/token_f1.yaml expose the new metrics through built-in imports/config discovery.
  • tests/autorag_research/evaluation/metrics/test_generation.py and tests/autorag_research/cli/test_utils.py cover normalization, best-reference selection, empty-normalized answers, repeated-token overlap, and config discovery.
  • Diagnostics on the modified Python files returned 0 errors, and the delegated code-review pass also found 0 blocking issues.

Resolving the merge from dev required keeping both sides of the generation-metric surface. The branch now preserves the SQuAD-style exact-match/token-F1 work from feature/#337 while retaining the UniEval exports, configs, and coverage that landed on dev. The conflict resolution keeps metric discovery and built-in exports coherent after the branch update.

Constraint: Merge update had to preserve both feature/#337 answer metrics and dev's newer UniEval additions
Rejected: Favor only the PR branch conflict hunks | would have dropped UniEval exports and config discovery coverage from dev
Rejected: Favor only dev's conflict hunks | would have removed feature/#337 exact-match/token-F1 exports and discovery assertions
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep generation metric exports and config-discovery assertions in sync whenever new built-in metrics land on either branch
Tested: uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/evaluation/metrics/test_unieval.py tests/autorag_research/cli/test_utils.py -q
Tested: make check
Tested: make test
@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

Merged dev into feature/#337 and resolved the remaining conflicts by keeping both sides of the generation-metrics surface:

  • preserved the PR's exact_match / token_f1 exports and config-discovery coverage
  • kept the newer UniEval metric exports, configs, and tests from dev
  • updated the shared metric export surface and CLI discovery assertion so both metric families stay available together after the merge

Verified:

  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/evaluation/metrics/test_unieval.py tests/autorag_research/cli/test_utils.py -q
  • make check
  • make test

@vkehfdl1
Copy link
Copy Markdown
Collaborator Author

vkehfdl1 commented Apr 6, 2026

APPROVE

Short reason: the live PR head is still narrowly scoped to issue #337's generation-metric work, and I did not find a merge-blocking defect in the current diff after the dev merge.

Real Result:

  • make check
  • uv run pytest tests/autorag_research/evaluation/metrics/test_generation.py tests/autorag_research/evaluation/metrics/test_unieval.py tests/autorag_research/cli/test_utils.py -q ✅ (54 passed in 65.54s)
  • manual smoke check ✅ (exact_match[1.0, 0.0, 1.0], token_f1[0.6666666666666666, 0.8, 1.0])
  • make test ⚠️ did not finish cleanly in my environment: 1320 passed, 5 failed, 64 deselected, 1 error in 61.92s; the failures were in unchanged DB-backed CLI / main_rag_pipeline tests after PostgreSQL dropped during the run

Concrete evidence:

  • Reviewed the current live diff against origin/dev: git diff --stat origin/dev...HEAD shows 6 changed files, all limited to generation metrics, built-in exports/configs, and tests.
  • autorag_research/evaluation/metrics/generation.py adds SQuAD-style exact_match and token_f1, plus shared best-reference scoring reused by huggingface_evaluate().
  • autorag_research/evaluation/metrics/__init__.py, configs/metrics/generation/exact_match.yaml, and configs/metrics/generation/token_f1.yaml expose the new built-in metrics cleanly.
  • tests/autorag_research/evaluation/metrics/test_generation.py and tests/autorag_research/cli/test_utils.py cover normalization, best-reference selection, empty-normalized answers, repeated-token overlap, and config discovery; tests/autorag_research/evaluation/metrics/test_unieval.py also passed, so the merge resolution did not break the newer UniEval surface from dev.
  • The make test failures were outside the diff scope: tests/autorag_research/cli/commands/test_e2e_pipeline.py, tests/autorag_research/cli/commands/test_show.py, and tests/autorag_research/pipelines/generation/test_main_rag_pipeline.py.
  • Diagnostics on the modified Python files returned 0 errors.

@vkehfdl1 vkehfdl1 merged commit 92c76c2 into dev Apr 6, 2026
6 checks passed
@vkehfdl1 vkehfdl1 deleted the feature/#337 branch April 6, 2026 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant