🧪 AutoAgentLab

Automated experimentation for AI agents.

Agents run tasks, analyze failures, improve themselves, and repeat.

💡 What is this?

AutoAgentLab turns agent improvement into an automated research loop. Instead of manually tweaking prompts, let your AI agents experiment on themselves:

Tasks → Agent → Evaluation → Research → Mutation → Repeat

Agents that research themselves.

⚡ Quick Start

pip install -e .

export OPENAI_API_KEY="sk-..."

autoagentlab run qa

30 seconds to your first experiment.

🎬 Demo

════════════════════════════════════════════════════════════
  🧪  AutoAgentLab — Agent Evolution
════════════════════════════════════════════════════════════
  Tasks: 20 | Max iterations: 5
  Model: gpt-4o-mini

── Iteration 1 ─────────────────────────────────────
  📊 Accuracy: 54%  ██████████████████░░░░░░░░░░░░░░  (11/20)
  ❌ 9 failure(s)
  🔬 Analyzing failures...
  💡 Suggestion: Add step-by-step reasoning instructions...
  🧬 Generating improved prompt...
  ✅ Mutation accepted — new accuracy: 67%

── Iteration 2 ─────────────────────────────────────
  📊 Accuracy: 67%  ████████████████████████░░░░░░░░  (13/20)
  ❌ 7 failure(s)
  🔬 Analyzing failures...
  💡 Suggestion: Use structured output format...
  🧬 Generating improved prompt...
  ✅ Mutation accepted — new accuracy: 82%

── Iteration 3 ─────────────────────────────────────
  📊 Accuracy: 82%  █████████████████████████████░░░  (16/20)
  ...

════════════════════════════════════════════════════════════
  📈  Evolution Summary
════════════════════════════════════════════════════════════
  ✅ Iteration 1: 54%
  ✅ Iteration 2: 67%
  ✅ Iteration 3: 82%

  Best: Iteration 3 — 82%

AI is doing research on itself. 🤯

🧠 How It Works

         ┌─────────────┐
         │   Benchmark  │   A set of (question, answer) tasks
         │    Tasks     │
         └──────┬───────┘
                │
                ▼
         ┌─────────────┐
         │    Agent     │   LLM + system prompt
         │              │
         └──────┬───────┘
                │ answers
                ▼
         ┌─────────────┐
         │  Evaluator   │   Score accuracy, collect failures
         │              │
         └──────┬───────┘
                │ failures
                ▼
         ┌─────────────┐
         │  Researcher  │   Analyze failure patterns
         │              │   → improvement suggestion
         └──────┬───────┘
                │ suggestion
                ▼
         ┌─────────────┐
         │   Mutator    │   Rewrite agent prompt
         │              │   using suggestion
         └──────┬───────┘
                │ new prompt
                ▼
         ┌─────────────┐
         │  New Agent   │   Accept if better,
         │              │   reject otherwise
         └──────┬───────┘
                │
                └──────── Repeat ─────────┘

📦 Project Structure

AutoAgentLab/
├── autoagentlab/
│   ├── __init__.py         # Public API
│   ├── agent.py            # LLM-powered agent
│   ├── evaluator.py        # Accuracy scoring
│   ├── researcher.py       # Failure analysis
│   ├── mutator.py          # Prompt rewriting
│   ├── loop.py             # Experiment orchestrator
│   └── cli.py              # CLI entry point
├── benchmarks/
│   └── qa.json             # 20 general-knowledge Q&A tasks
├── examples/
│   └── custom_benchmark.py # Programmatic usage example
├── pyproject.toml
├── LICENSE
└── README.md

🔧 Usage

CLI

# Run with default settings (5 iterations, gpt-4o-mini)
autoagentlab run qa

# Custom iterations and model
autoagentlab run qa --iterations 10 --model gpt-4o

# Custom initial prompt
autoagentlab run qa --prompt "You are a trivia expert. Be precise."

# List available benchmarks
autoagentlab list

Python API

from autoagentlab import Agent, ExperimentLoop

# Define your tasks
tasks = [
    ["What is 15 * 13?", "195"],
    ["What is the derivative of x^2?", "2x"],
    ["What is 2^10?", "1024"],
]

# Create an agent
agent = Agent("You are a math tutor. Answer concisely.")

# Run the experiment
loop = ExperimentLoop(agent, tasks, max_iterations=5)
history = loop.run()

# Check results
best = max(history, key=lambda r: r.accuracy)
print(f"Best: {best.accuracy * 100:.0f}% at iteration {best.iteration}")

🧩 Core Modules

Module	Class	Purpose
`agent.py`	`Agent`	Wraps an LLM with a mutable system prompt
`evaluator.py`	`Evaluator`	Scores accuracy, records failure cases
`researcher.py`	`Researcher`	Analyzes failures, suggests improvements
`mutator.py`	`Mutator`	Rewrites the prompt based on suggestions
`loop.py`	`ExperimentLoop`	Orchestrates the full improvement cycle
`cli.py`	—	CLI interface (`autoagentlab run/list`)

⚙️ Configuration

Environment Variable	Default	Description
`OPENAI_API_KEY`	required	Your OpenAI API key
`OPENAI_BASE_URL`	`https://api.openai.com/v1`	Custom API endpoint (for compatible providers)
`AUTOAGENTLAB_MODEL`	`gpt-4o-mini`	Default LLM model

Works with any OpenAI-compatible API — use Ollama, Together AI, Groq, etc. by setting OPENAI_BASE_URL.

📝 Create Your Own Benchmark

Create a JSON file in benchmarks/:

[
  ["Your question here?", "expected answer"],
  ["Another question?", "another answer"]
]

Run it:

autoagentlab run your_benchmark_name

🌟 Roadmap

Tool Discovery — agents auto-discover and use tools (calculator, search, Python)
Workflow Mutation — evolve multi-step reasoning chains
Agent Lineage — track evolution history as a tree
Multi-objective — optimize for accuracy + cost + latency
Population-based — evolve a population of agents in parallel

📄 License

MIT — see LICENSE.

AutoAgentLab — Let your agents do the research. 🧬

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧪 AutoAgentLab

💡 What is this?

⚡ Quick Start

🎬 Demo

🧠 How It Works

📦 Project Structure

🔧 Usage

CLI

Python API

🧩 Core Modules

⚙️ Configuration

📝 Create Your Own Benchmark

🌟 Roadmap

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
autoagentlab		autoagentlab
benchmarks		benchmarks
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

🧪 AutoAgentLab

💡 What is this?

⚡ Quick Start

🎬 Demo

🧠 How It Works

📦 Project Structure

🔧 Usage

CLI

Python API

🧩 Core Modules

⚙️ Configuration

📝 Create Your Own Benchmark

🌟 Roadmap

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages