Proposal: Make SQuAD QA F1 logic reusable as a generic QA metric in `evaluate`

Hi 👋

I’m currently building an evaluation framework for long-context / QA benchmarks (e.g. HotpotQA), and I ran into a usability gap in `evaluate` that I wanted to discuss before opening a PR.

### Context / motivation

While evaluating HotpotQA, I wanted to use the **standard QA token-level F1** (the same one used in SQuAD: token overlap, max over references, averaged across examples).

I noticed that:

* This exact logic already exists in [`metrics/squad/compute_score.py`](https://github.com/huggingface/evaluate/blob/main/metrics/squad/compute_score.py) (and is correct).
* However, it is **tightly coupled to the SQuAD data schema**, making it hard to use “out of the box” for other QA datasets or generic LLM evaluation pipelines.
* As a result, I had to manually re-implement token-level F1 for my own benchmarking, even though the canonical implementation already lives in `evaluate`.

This feels like a missed opportunity for reuse, especially since many QA datasets (HotpotQA, Natural Questions, LongBench QA, etc.) use the same evaluation logic.

### Proposal

Rather than adding a new implementation, I’d like to explore making the existing **SQuAD token-level F1 reusable / pluggable** for other datasets.

I see three possible approaches and would love feedback on which direction is preferred:

#### Option A: Extract a generic QA F1 metric

* Factor out the token-level F1 logic (normalization + token overlap + max-over-refs) into a **generic QA metric** with a simple interface:

  ```python
  predictions: List[str]
  references: List[List[str]]
  ```
* Keep the current `squad` metric as a thin wrapper that adapts SQuAD’s schema to this generic metric.
* This avoids duplication and creates a single source of truth.

#### Option B: Add a thin “QA F1” wrapper that reuses SQuAD logic

* Introduce a new metric (e.g. `qa_f1`) that internally calls the existing SQuAD scoring code.
* Expose it with a dataset-agnostic input format (strings + list of references).
* Minimal refactor, but improves usability significantly.

#### Option C: Extend the existing `squad` metric interface

* Allow `evaluate.load("squad")` to optionally accept a simpler input format (in addition to the current SQuAD JSON schema).
* Lowest surface-area change, but still enables reuse for non-SQuAD datasets.

### Why this seems useful

* Avoids re-implementing a very standard QA metric across projects.
* Makes `evaluate` easier to use for modern LLM benchmarking workflows.
* Keeps existing SQuAD behavior intact while improving modularity.
* Complements `exact_match` (which already exists as a standalone metric).

If this direction makes sense, I’m happy to implement whichever option aligns best with the project’s design goals and contribution guidelines. Please let me know the preferred option.

Thanks for your time, and happy to iterate on the approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Make SQuAD QA F1 logic reusable as a generic QA metric in `evaluate` #724

Context / motivation

Proposal

Option A: Extract a generic QA F1 metric

Option B: Add a thin “QA F1” wrapper that reuses SQuAD logic

Option C: Extend the existing `squad` metric interface

Why this seems useful

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Make SQuAD QA F1 logic reusable as a generic QA metric in evaluate #724

Description

Context / motivation

Proposal

Option A: Extract a generic QA F1 metric

Option B: Add a thin “QA F1” wrapper that reuses SQuAD logic

Option C: Extend the existing squad metric interface

Why this seems useful

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposal: Make SQuAD QA F1 logic reusable as a generic QA metric in `evaluate` #724

Option C: Extend the existing `squad` metric interface