Skip to content

ModalityDance/Awesome-Agent-as-a-Judge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome list badge arXiv Hugging Face MIT License

Welcome to Awesome Agent-as-a-Judge! πŸ‘‹ This repository provides a collection of papers for A Survey on Agent-as-a-Judge, where LLM-based agents are used as judges to evaluate different types of outputs, including natural language generation, code generation, mathematical reasoning, and more.

Agent-as-a-Judge Illustration

πŸ“‘ Table of Contents

πŸ“Š Taxonomy

Agent-as-a-Judge Taxonomy

πŸ“š Papers

Methodologies

Multi-Agent Collaboration

  • [2024/01] ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate | paper | Venue: ICLR 2024
  • [2024/12] M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation | paper | Venue: ACL 2025
  • [2024/11] SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation | paper
  • [2025/11] HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation | paper | Venue: EMNLP 2025
  • [2025/05] CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring | paper
  • [2025/03] GEMA-Score: Granular Explainable Multi-Agent Score for Radiology Report Evaluation | paper
  • [2025/07] CourtEval: A courtroom-based multi-agent evaluation framework. | paper | Venue: ACL 2025
  • [2025/02] Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation | paper | Venue: NAACL 2025
  • [2023/03] Large language models are diverse role-players for summarization evaluation | paper | Venue: NLPCC 2023

Planning

  • [2024/05] MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation | paper | Venue: DASFAA 2024
  • [2024/12] Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models | paper | Venue: ACL 2025
  • [2025/04] EvalAgent: Discovering Implicit Evaluation Criteria from the Web | paper | Venue: COLM 2025
  • [2025/05] AGENT-X: Adaptive Guideline-based Expert Network for Threshold-free AI-generated teXt detection | paper
  • [2025/11] Large Language Models as User-Agents for Evaluating Task-Oriented-Dialogue Systems . | paper
  • [2025/10] Online Rubrics Elicitation from Pairwise Comparisons | paper

Tool Integration

  • [2024/10] Agent-as-a-Judge: Evaluate Agents with Agent | paper | Venue: ICML 2025
  • [2025/04] CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation | paper | Venue: ASE 2025
  • [2024/12] Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models | paper | Venue: ACL 2025
  • [2025/12] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning | paper
  • [2025/11] HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs. | paper
  • [2025/04] VerifiAgent: a Unified Verification Agent in Language Model Reasoning | paper | Venue: EMNLP 2025
  • [2025/02] Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals. | paper | Venue: ACL 2025

Memory and Personalization

  • [2025/11] HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs. | paper
  • [2025/12] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning | paper
  • [2024/10] Agent-as-a-Judge: Evaluate Agents with Agent | paper | Venue: ICML 2025
  • [2025/05] Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment | paper | Venue: NeurIPS 2025
  • [2025/06] SynthesizeMe! Inducing Persona-Guided Prompts for Personalized Reward Models in LLMs | paper | Venue: ACL 2025
  • [2025/08] PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning | paper
  • [2025/02] FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users | paper

Optimization Paradigms

  • [2024/12] Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models | paper | Venue: ACL 2025
  • [2025/11] HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs. | paper
  • [2025/04] Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications | paper
  • [2024/11] SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation | paper
  • [2025/05] AGENT-X: Adaptive Guideline-based Expert Network for Threshold-free AI-generated teXt detection | paper
  • [2025/06] SynthesizeMe! Inducing Persona-Guided Prompts for Personalized Reward Models in LLMs | paper | Venue: ACL 2025
  • [2025/10] Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning | paper
  • [2025/12] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning | paper

Applications

Professional Domains

Education
  • [2025/09] AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition | paper
  • [2025/07] Multi-agent-as-judge: Aligning llm-agent-based automated evaluation with multi-dimensional human evaluation. | paper
  • [2024/10] A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization | paper
  • [2024/05] Grade Like a Human: Rethinking Automated Assessment with Multi-Agent LLMs | paper
Finance
  • [2025/07] FinResearchBench: A Logic Tree based Agent-as-a-Judge Evaluation Framework for Financial Research Agents | paper | Venue: ICAIF 2025
  • [2025/02] FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis | paper
  • [2025/02] Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk | paper
  • [2025/07] From Tasks to Teams: A Risk-First Evaluation Framework for Multi-Agent LLM Systems in Finance | paper | Venue: ICML Workshop 2025
Law
  • [2024/03] AgentsCourt: Building judicial decision-making agents with court debate simulation and legal knowledge augmentation. | paper | Venue: EMNLP 2024
  • [2025/09] SAMVAD: A Multi-Agent System for Simulating Judicial Deliberation Dynamics in India | paper
  • [2024/12] AgentsBench: A Multi-Agent LLM Simulation Framework for Legal Judgment Prediction | paper
Medicine
  • [2025/07] Multi-agent-as-judge: Aligning llm-agent-based automated evaluation with multi-dimensional human evaluation. | paper
  • [2025/03] GEMA-Score: Granular Explainable Multi-Agent Score for Radiology Report Evaluation | paper
  • [2024/02] Benchmarking Large Language Models on Communicative Medical Coaching: A Dataset and a Novel System | paper | Venue: ACL 2024
  • [2024/02] Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator | paper | Venue: COLING 2025

General Domains

Multimodal and Vision
  • [2025/04] CIGEval: A Unified Agentic Framework for Evaluating Conditional Image Generation | paper | Venue: ACL 2025
  • [2024/12] Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models | paper | Venue: ACL 2025
  • [2024/10] LRQ-Fact: LLM-Generated Relevant Questions for Multimodal Fact-Checking . | paper
  • [2025/12] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning | paper
Conversation and Interaction
  • [2025/01] IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems | paper
  • [2025/05] ESC-Judge: A Framework for Comparing Emotional Support Conversational Agents | paper | Venue: EMNLP 2025
  • [2025/05] Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition | paper
  • [2025/01] PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of Psychiatric Assessment Conversational Agent | paper
  • [2025/11] Large Language Models as User-Agents for Evaluating Task-Oriented-Dialogue Systems . | paper
Fact-Checking
  • [2025/02] FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation | paper | Venue: ACL 2025
  • [2025/05] UrduFactCheck: An Agentic Fact-Checking Framework for Urdu | paper | Venue: EMNLP 2025
  • [2025/01] NarrativeFactScore: Agent-as-Judge for Factual Summarization of Long Narratives | paper | Venue: EMNLP 2025
Math and Code
  • [2025/11] HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs. | paper
  • [2025/04] VerifiAgent: a Unified Verification Agent in Language Model Reasoning | paper | Venue: EMNLP 2025
  • [2025/08] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward | paper | Venue: EMNLP 2025
  • [2025/04] xVerify: Efficient Answer Verifier for Reasoning Model Evaluations | paper
  • [2025/02] Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals. | paper | Venue: ACL 2025
  • [2025/02] Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers | paper | Venue: COLM 2025
  • [2025/02] Popper: Automated Hypothesis Validation with Agentic Sequential Falsifications | paper | Venue: ICML 2025
  • [2025/04] CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation | paper | Venue: ASE 2025

πŸͺ΄ Acknowledge

We would like to thank the contributors, open-source projects, and research communities whose work made this collection possible. This repository builds upon the excellent research in agent-based evaluation methods.

πŸ“– Citation

@misc{you2026agentasajudge,
      title={Agent-as-a-Judge}, 
      author={Runyang You and Hongru Cai and Caiqi Zhang and Qiancheng Xu and Meng Liu and Tiezheng Yu and Yongqi Li and Wenjie Li},
      year={2026},
      eprint={2601.05111},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.05111}, 
}

⭐ Thank you for visiting Awesome Agent-as-a-Judge! ⭐

About

"A Survey on Agent-as-a-Judge"

Resources

License

Stars

Watchers

Forks

Contributors