Skip to content

Proposal: Make SQuAD QA F1 logic reusable as a generic QA metric in evaluate #724

@Ayesha-Imr

Description

@Ayesha-Imr

Hi 👋

I’m currently building an evaluation framework for long-context / QA benchmarks (e.g. HotpotQA), and I ran into a usability gap in evaluate that I wanted to discuss before opening a PR.

Context / motivation

While evaluating HotpotQA, I wanted to use the standard QA token-level F1 (the same one used in SQuAD: token overlap, max over references, averaged across examples).

I noticed that:

  • This exact logic already exists in metrics/squad/compute_score.py (and is correct).
  • However, it is tightly coupled to the SQuAD data schema, making it hard to use “out of the box” for other QA datasets or generic LLM evaluation pipelines.
  • As a result, I had to manually re-implement token-level F1 for my own benchmarking, even though the canonical implementation already lives in evaluate.

This feels like a missed opportunity for reuse, especially since many QA datasets (HotpotQA, Natural Questions, LongBench QA, etc.) use the same evaluation logic.

Proposal

Rather than adding a new implementation, I’d like to explore making the existing SQuAD token-level F1 reusable / pluggable for other datasets.

I see three possible approaches and would love feedback on which direction is preferred:

Option A: Extract a generic QA F1 metric

  • Factor out the token-level F1 logic (normalization + token overlap + max-over-refs) into a generic QA metric with a simple interface:

    predictions: List[str]
    references: List[List[str]]
  • Keep the current squad metric as a thin wrapper that adapts SQuAD’s schema to this generic metric.

  • This avoids duplication and creates a single source of truth.

Option B: Add a thin “QA F1” wrapper that reuses SQuAD logic

  • Introduce a new metric (e.g. qa_f1) that internally calls the existing SQuAD scoring code.
  • Expose it with a dataset-agnostic input format (strings + list of references).
  • Minimal refactor, but improves usability significantly.

Option C: Extend the existing squad metric interface

  • Allow evaluate.load("squad") to optionally accept a simpler input format (in addition to the current SQuAD JSON schema).
  • Lowest surface-area change, but still enables reuse for non-SQuAD datasets.

Why this seems useful

  • Avoids re-implementing a very standard QA metric across projects.
  • Makes evaluate easier to use for modern LLM benchmarking workflows.
  • Keeps existing SQuAD behavior intact while improving modularity.
  • Complements exact_match (which already exists as a standalone metric).

If this direction makes sense, I’m happy to implement whichever option aligns best with the project’s design goals and contribution guidelines. Please let me know the preferred option.

Thanks for your time, and happy to iterate on the approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions