Skip to content

Conversation

@mkhi238
Copy link
Contributor

@mkhi238 mkhi238 commented Oct 9, 2025

Description

This PR adds the FEVER (Fact Extraction and VERification) metric to the evaluate library.

What does this metric do?

The FEVER metric evaluates fact verification systems against evidence retrieved from Wikipedia. It includes:

  • Label accuracy: Measures how often predicted labels match gold labels
  • FEVER score: The official metric - requires both correct label AND complete evidence retrieval
  • Evidence precision/recall/F1: Micro-averaged metrics for evidence retrieval

Checklist

  • Implementation in fever.py
  • Comprehensive test suite with 12 tests (all passing)
  • README.md with metric card and examples
  • Gradio app for interactive testing
  • Code formatted with black and isort
  • All tests pass

Tests:

All tests pass successfully:

test_fever.py
............
----------------------------------------------------------------------
Ran 12 tests in 0.142s

OK

References

Note: This is my first contribution to the evaluate library. I'd appreciate any feedback or suggestions for improvement!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !

@lhoestq lhoestq merged commit 9e0a446 into huggingface:main Nov 14, 2025
@lhoestq
Copy link
Member

lhoestq commented Nov 14, 2025

the metric is available now, with its page at https://huggingface.co/spaces/evaluate-metric/fever :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants