Description
Implement BARTScore (Yuan et al., NeurIPS 2021) as a generation evaluation metric.
What it measures
Directional semantic evaluation: faithfulness, precision, recall, F1.
How it works
- Uses pretrained BART model to compute log-probability of one text given another
- Three configurations:
BARTScore(context → answer): faithfulness to retrieved context
BARTScore(reference → answer): precision
BARTScore(answer → reference): recall
- Information-theoretic grounding (conditional log-likelihood)
- Deterministic, no LLM-as-judge
Why
- Published at NeurIPS — strong venue credibility
- Provides directional evaluation (faithfulness vs precision vs recall)
- Complements AlignScore (NLI-based) with probability-based approach → triangulation
- NeurIPS 2026 ED Track submission
Reference
Yuan et al., "BARTScore: Evaluating Generated Text as Text Generation," NeurIPS 2021
Description
Implement BARTScore (Yuan et al., NeurIPS 2021) as a generation evaluation metric.
What it measures
Directional semantic evaluation: faithfulness, precision, recall, F1.
How it works
BARTScore(context → answer): faithfulness to retrieved contextBARTScore(reference → answer): precisionBARTScore(answer → reference): recallWhy
Reference
Yuan et al., "BARTScore: Evaluating Generated Text as Text Generation," NeurIPS 2021