This repository presents an exploratory analysis of the repository 500+ AI agent projects, with a primary focus on the presence and reliability of test or evaluations implementations.
The analysis aims to provide an overview of current testing practices in agent-based AI systems. In addition to identifying whether tests are present, the study extracts high-level characteristics such as frameworks, architectural patterns, and usage contexts.
The repository is organized into two main sections:
Frameworks: analysis of AI agent frameworks and their testing support.
Use Cases: analysis of real-world AI agent projects and how testing is applied in practice.
The goal is to support discussions in software engineering and verification & validation by highlighting gaps, trends, and opportunities for improving test practices in AI agent development.
To automatically map and classify AI agents from multiple frameworks, extracting insights related to:
- Implementation patterns (notebooks vs. structured repositories)
- Use of advanced techniques (RAG, multi-agent systems, workflows)
- Technical quality indicators (presence of tests, link validity)
- Application domains and task types
Framework-specific parsers were implemented to extract structured information from the repository documentation.
parse_langgraph_agents(readme_text)
→ Extracts: Use Case | Industry | Description | LinkTest presence was inferred using lightweight heuristics, depending on the source type.
| Context | Evidence Searched |
|---|---|
| Notebook / Script | assert, unittest, pytest, def test, @test, await aevaluate, evaluate( |
| Repository | test/, tests/, pytest.ini, tox.ini, nose.cfg, *_test.py, test_*.py |
These heuristics may generate false positives and were followed by manual verification.
Classification relied exclusively on textual metadata (Use Case, Industry, Description).
| Category | Keywords |
|---|---|
| RAG | rag, retrieval, document, knowledge base, vector |
| Multi-Agent | multi-agent, hierarchical, supervisor, collaboration |
| Workflow | graph, state, conditional (graph-based) |
| Intent | code, chat, retrieval, data |
Agent Distribution by Framework
| Framework | Total | Notebook | Repo | Script | Docs | Tests | Test % | RAG | Multi-Agent | Valid Links |
|---|---|---|---|---|---|---|---|---|---|---|
| Autogen | 61 | 59 | 0 | 1 | 1 | 5 | 8.2% | 10 | 13 | 60 |
| CrewAI | 22 | 0 | 22 | 0 | 0 | 0 | 0.0% | 2 | 0 | 22 |
| LangGraph | 20 | 0 | 0 | 20 | 0 | 0 | 0.0% | 10 | 3 | 0 |
| Agno | 18 | 0 | 0 | 18 | 0 | 0 | 0.0% | 6 | 0 | 0 |
| Intent | Count |
|---|---|
| Other / Generic | 55 |
| Chat | 26 |
| Retrieval | 22 |
| Code | 14 |
| Data | 4 |
Initially, five agents were flagged as potentially containing tests.
After manual inspection:
| Framework | Agent | Assessment |
|---|---|---|
| Autogen | Chat with OpenAI Assistant with Retrieval Augmentation | ❌ No real tests (analysis script only) |
| Autogen | Multimodal Agent Chat with DALLE and GPT-4V | ❌ Assertions used as exceptions |
| Autogen | Multimodal Agent Chat with Llava | ❌ Assertions not related to testing |
| Autogen | AgentEval | ✅ Valid evaluation framework demonstrated |
| Autogen | Optimize for Code Generation | ✅ Evaluation routines present |
Total agents: 121 Frameworks analyzed: 4 With tests: 5 (2 validated) Using RAG: 28 Multi-agent systems: 16 Valid links: 82 Invalid links: 39
This section analyzes real-world AI agent projects, focusing on whether testing artifacts are present and what they actually validate.
| Agent | Domain | Tests | Evidence | Created | Last Commit | Manual Test Analysis | Notes |
|---|---|---|---|---|---|---|---|
| Product Recommendation Agent | Retail | ✅ | tests/ folder |
2023-09-07 | 2025-09-28 | LLM-as-a-judge evaluations without testing framework | |
| Property Pricing Agent | Real Estate | ✅ | tests/ folder |
2024-07-27 | 2026-01-11 | Utility-level tests, minimal integration | |
| Energy Demand Forecasting Agent | Energy | ✅ | agent_evaluation/ |
2024-07-01 | 2024-07-02 | Performance metrics vs ground truth, not software tests | |
| Recruitment Recommendation Agent | HR | ✅ | test/ folder |
2024-08-17 | 2024-09-09 | Scenario execution, deterministic checks | |
| Logistics Optimization Agent | Supply Chain | ✅ | test/ subfolder |
2023-07-31 | 2025-11-25 | Module-level tests unrelated to agent behavior |
➡️ In most cases, tests validate auxiliary code or models, not the agent’s behavior, reasoning, or decision-making.
The vast majority of AI agent projects do not implement meaningful tests. When tests are present, they typically target infrastructure or model performance, rather than agent-level correctness or behavioral guarantees.