-
Notifications
You must be signed in to change notification settings - Fork 305
Description
Hi 👋
I’m currently building an evaluation framework for long-context / QA benchmarks (e.g. HotpotQA), and I ran into a usability gap in evaluate that I wanted to discuss before opening a PR.
Context / motivation
While evaluating HotpotQA, I wanted to use the standard QA token-level F1 (the same one used in SQuAD: token overlap, max over references, averaged across examples).
I noticed that:
- This exact logic already exists in
metrics/squad/compute_score.py(and is correct). - However, it is tightly coupled to the SQuAD data schema, making it hard to use “out of the box” for other QA datasets or generic LLM evaluation pipelines.
- As a result, I had to manually re-implement token-level F1 for my own benchmarking, even though the canonical implementation already lives in
evaluate.
This feels like a missed opportunity for reuse, especially since many QA datasets (HotpotQA, Natural Questions, LongBench QA, etc.) use the same evaluation logic.
Proposal
Rather than adding a new implementation, I’d like to explore making the existing SQuAD token-level F1 reusable / pluggable for other datasets.
I see three possible approaches and would love feedback on which direction is preferred:
Option A: Extract a generic QA F1 metric
-
Factor out the token-level F1 logic (normalization + token overlap + max-over-refs) into a generic QA metric with a simple interface:
predictions: List[str] references: List[List[str]]
-
Keep the current
squadmetric as a thin wrapper that adapts SQuAD’s schema to this generic metric. -
This avoids duplication and creates a single source of truth.
Option B: Add a thin “QA F1” wrapper that reuses SQuAD logic
- Introduce a new metric (e.g.
qa_f1) that internally calls the existing SQuAD scoring code. - Expose it with a dataset-agnostic input format (strings + list of references).
- Minimal refactor, but improves usability significantly.
Option C: Extend the existing squad metric interface
- Allow
evaluate.load("squad")to optionally accept a simpler input format (in addition to the current SQuAD JSON schema). - Lowest surface-area change, but still enables reuse for non-SQuAD datasets.
Why this seems useful
- Avoids re-implementing a very standard QA metric across projects.
- Makes
evaluateeasier to use for modern LLM benchmarking workflows. - Keeps existing SQuAD behavior intact while improving modularity.
- Complements
exact_match(which already exists as a standalone metric).
If this direction makes sense, I’m happy to implement whichever option aligns best with the project’s design goals and contribution guidelines. Please let me know the preferred option.
Thanks for your time, and happy to iterate on the approach.