Description
Implement AlignScore (Zha et al., ACL 2023) as a generation evaluation metric.
What it measures
Factual consistency / faithfulness of generated text to source context.
How it works
- RoBERTa-based model trained on 15 datasets (4.7M examples): NLI, fact verification, QA, paraphrase
- Splits generated text into chunks, scores each against source, aggregates
- Outperforms GPT-3.5-based evaluation on SummaC and TRUE benchmarks
- Fully deterministic, no LLM-as-judge
Why
- Replaces RAGAS Faithfulness with academically rigorous alternative
- Reference-free (evaluates against retrieved context only)
- NeurIPS 2026 ED Track submission
Reference
Zha et al., "AlignScore: Evaluating Factual Consistency with a Unified Alignment Function," ACL 2023
Description
Implement AlignScore (Zha et al., ACL 2023) as a generation evaluation metric.
What it measures
Factual consistency / faithfulness of generated text to source context.
How it works
Why
Reference
Zha et al., "AlignScore: Evaluating Factual Consistency with a Unified Alignment Function," ACL 2023