Abstract
Large language models face limitations in evaluating complex, multi-step tasks, prompting the development of agent-based evaluation systems that utilize planning, tool-augmented verification, and multi-agent collaboration for more robust assessments.
LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Jenius Agent: Towards Experience-Driven Accuracy Optimization in Real-World Scenarios (2026)
- The Path Ahead for Agentic AI: Challenges and Opportunities (2026)
- Step-DeepResearch Technical Report (2025)
- AI Agent Systems: Architectures, Applications, and Evaluation (2026)
- ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment (2025)
- Environment Scaling for Interactive Agentic Experience Collection: A Survey (2025)
- Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper