Agent Factory

An LLM knowledge base and course for building production-quality AI agents.

Built on Karpathy's LLM-KB pattern: raw sources are collected from 40+ repositories and 25 reference documents, LLM-compiled into a structured wiki of 80+ articles, then queried by meta-skills to produce better agents. Sibling project to Skill Factory, which teaches how to write the SKILL.md files that augment these agents.

flowchart TD
    subgraph ingest ["DATA INGEST"]
        Articles["Articles"]
        Papers["Papers"]
        Repos["Repos"]
        Datasets["Datasets"]
        Images["Images"]
        ClipArticle["clip-article.md"]
    end

    subgraph rawStore ["raw/"]
        RawDocs["docs/SOURCES.md"]
        RawRepos["repos/SOURCES.md"]
        RawDatasets["datasets/ JSON+CSV"]
        RawImages["images/"]
    end

    subgraph extraTools ["EXTRA TOOLS"]
        SearchWiki["search-wiki.ts"]
        WikiStats["wiki-stats.ts"]
        CheckLinks["check-links.ts"]
        ValidateAgent["validate-agent.ts"]
    end

    subgraph llmEngine ["LLM ENGINE"]
        Compile["compile-wiki.md"]
        QA["research-qa.md"]
        Linting["health-check.md"]
        Indexing["build-index.md"]
    end

    subgraph wikiStore ["KNOWLEDGE STORE"]
        Wiki["Wiki .md\n85+ articles\n35 concepts\n26 research\n22 examples\nINDEX + GLOSSARY\nBACKLINKS"]
    end

    subgraph outputs ["OUTPUTS"]
        Markdown["Markdown articles"]
        Slides["Marp slides"]
        Charts["Matplotlib charts"]
    end

    subgraph ide ["IDE FRONTEND"]
        Obsidian["Obsidian vault\nGraph view\nSearch\nSlide preview"]
    end

    subgraph futureBlock ["FUTURE EXPLORATIONS"]
        SyntheticData["Synthetic data gen\n+ finetuning"]
        ProductVision["Product vision\nWeb UI + API"]
    end

    Articles & Papers & Repos & Datasets & Images --> rawStore
    ClipArticle --> rawStore
    rawStore --> Compile
    rawStore --> QA
    rawStore --> Linting
    rawStore --> Indexing
    Compile --> Wiki
    QA --> Wiki
    Linting --> Wiki
    Indexing --> Wiki
    extraTools --> Wiki
    Wiki --> Markdown
    Wiki --> Slides
    Wiki --> Charts
    Charts -->|"filed back"| Wiki
    rawStore & Wiki & Slides & Charts --> Obsidian
    Wiki -.-> futureBlock

What This Is

This repo is six things:

A knowledge base -- 80+ wiki articles distilled from 40+ top AI agent repos (700K+ stars combined), covering every pattern, anti-pattern, and technique for building autonomous agents
A 23-module course -- Zero-to-production curriculum (~11-13 hours) with exercises, code examples, and framework walkthroughs across LangGraph, OpenAI Agents SDK, and Anthropic
An Agent Maker -- An interactive agent-maker/SKILL.md that asks 6 forcing questions, challenges assumptions, and guides you through 8 phases from idea to validated agent
A Prompt Decomposer -- A prompt-decomposer/SKILL.md that takes a large prompt, codebase, or system description and identifies sections that could become agent components
A meta-skill -- An authoring/SKILL.md that queries the wiki to help you review, score, and improve existing agents against the AGENT_SPEC quality standard
A quality standard -- AGENT_SPEC.md defines 8 dimensions for scoring agent projects, with a canonical project structure and minimum quality bar

Why This Exists

Building a production agent is harder than it looks. Most agents fail at architecture (no clear loop or state management), tools (too many, poorly described, no error handling), safety (no guardrails, no sandbox, no approval gates), or testing (no behavioral tests, no trace analysis, no baselines).

This project codifies what works and what doesn't, drawn from analyzing the architectures and patterns of 40+ real agent systems across the ecosystem -- from OpenClaw (344K stars) to AutoAgent (714 stars). It covers the full spectrum: single-agent tools, multi-agent orchestration, self-improving harness loops, protocol interoperability, and production deployment.

The key insight: Skills augment agents. Agents are the autonomous systems themselves. Skill Factory teaches you to write SKILL.md files. Agent Factory teaches you to build the agent that reads them.

Directory Structure

agent-factory/
├── SKILL.md                     # Entry point: A/B/C/D concierge router
├── AGENT_SPEC.md                # Quality standard (8 dimensions, scoring guide)
├── README.md                    # You are here
│
├── agent-maker/                 # Interactive agent creator
│   └── SKILL.md                 # 8-phase guided creation with forcing questions
│
├── authoring/                   # The review meta-skill
│   └── SKILL.md                 # Wiki-backed agent review and improvement
│
├── prompt-decomposer/           # System-to-agents extractor
│   └── SKILL.md                 # Analyze prompts/codebases, suggest agent components
│
├── course/                      # 23-module agent building course
│   ├── README.md                # Course overview and learning path
│   ├── 01-what-are-agents.md    # Foundation: agents vs chatbots vs copilots
│   ├── 02-agent-architectures.md # Single, multi-agent, hierarchical
│   ├── 03-the-agent-loop.md     # Plan-act-observe-reflect, ReAct pattern
│   ├── 04-system-prompts-for-agents.md # Persona, constraints, tool instructions
│   ├── 05-tool-design-and-integration.md # Function calling, MCP, structured I/O
│   ├── 06-memory-and-context-engineering.md # Memory types, context rot, ByteRover
│   ├── 07-planning-and-reasoning.md # CoT, ToT, task decomposition, AutoAgent
│   ├── 08-error-handling-and-recovery.md # Retries, fallbacks, circuit breakers, Hermes
│   ├── 09-agent-design-patterns.md # Ng's 4 patterns, ReAct, Reflexion
│   ├── 10-multi-agent-patterns.md # Supervisor, pipeline, swarm, Paperclip
│   ├── 11-anti-patterns.md      # God agent, over-tooling, premature autonomy
│   ├── 12-state-management.md   # Checkpointing, persistence, git-as-memory
│   ├── 13-framework-selection.md # Decision matrix: LangGraph vs CrewAI vs SDK
│   ├── 14-building-with-langgraph.md # State graphs, persistence, human-in-the-loop
│   ├── 15-building-with-openai-agents-sdk.md # Agent class, handoffs, guardrails
│   ├── 16-building-with-anthropic.md # Tool use, extended thinking, computer use
│   ├── 17-agent-evaluation-and-testing.md # Behavioral tests, benchmarks, A/B testing
│   ├── 18-safety-and-guardrails.md # Sandboxing, prompt injection, approval gates
│   ├── 19-observability-and-debugging.md # Tracing, cost tracking, debugging long runs
│   ├── 20-deployment-and-scaling.md # Serverless vs containers, rate limits, scaling
│   ├── 21-protocols-and-interoperability.md # MCP servers, A2A, tool marketplaces
│   ├── 22-self-improvement-and-harness-engineering.md # Learning loops, harnesses
│   └── 23-capstone-build-a-production-agent.md # End-to-end project (3 tracks)
│
├── wiki/                        # LLM-compiled knowledge base
│   ├── INDEX.md                 # Master table of contents (start here)
│   ├── GLOSSARY.md              # 67 terms with definitions
│   ├── concepts/                # 35 core agent-building concept articles
│   │   ├── agent-loop.md
│   │   ├── tool-design.md
│   │   ├── memory-systems.md
│   │   ├── context-engineering.md
│   │   ├── harness-engineering.md
│   │   ├── self-improving-agents.md
│   │   ├── autonomous-loops.md
│   │   └── ... (28 more)
│   ├── research/                # 23 ecosystem analysis and deep-dive articles + 3 raw data files
│   │   ├── hermes-agent-deep-dive.md
│   │   ├── paperclip-orchestration-analysis.md
│   │   ├── autoagent-harness-patterns.md
│   │   ├── context-memory-architecture.md
│   │   ├── openclaw-scale-analysis.md
│   │   ├── framework-comparison.md
│   │   └── ... (14 more)
│   ├── examples/
│   │   ├── good/                # 14 exemplary agents with annotations
│   │   └── bad/                 # 8 anti-pattern agents with analysis
│   └── queries/                 # Filed Q&A and update logs
│
├── scripts/                     # Automation & CLI tools
│   ├── validate-agent.ts        # Agent project linter (Bun/Node)
│   ├── search-wiki.ts           # Full-text wiki search CLI
│   ├── wiki-stats.ts            # Article count, word count, orphan detection
│   ├── check-links.ts           # Internal link validator
│   ├── compile-wiki.md          # LLM runbook: compile raw/ into wiki/
│   ├── health-check.md          # LLM runbook: audit wiki quality (10 checks)
│   ├── research-qa.md           # LLM runbook: wiki-grounded Q&A engine
│   ├── build-index.md           # LLM runbook: regenerate indexes + backlinks
│   ├── clip-article.md          # LLM runbook: web clipper for new sources
│   ├── update-sources.md        # LLM runbook: monthly discovery + update
│   └── discovery-keywords.txt   # Keywords for finding new agent repos
│
├── raw/                         # Source material (data ingest)
│   ├── docs/
│   │   └── SOURCES.md           # 25 reference documents (URLs + summaries)
│   ├── repos/
│   │   └── SOURCES.md           # 40+ repo manifest with tiers and descriptions
│   ├── datasets/                # Structured data (JSON, CSV)
│   │   ├── SOURCES.md           # Dataset manifest
│   │   ├── autoresearch-scores.json  # 20 agents x 13 dimensions
│   │   ├── wave-improvements.csv     # 7 waves with before/after/delta
│   │   └── agent-progression.csv     # Score progression per agent
│   └── images/                  # Architecture diagrams and visual references
│
├── outputs/                     # Generated artifacts
│   ├── slides/                  # Marp-format presentation decks
│   │   ├── agent-quality-overview.md  # What makes a 9/10 agent
│   │   ├── course-summary.md          # 23 modules in 6 blocks
│   │   └── autoresearch-results.md    # 7 waves, 20 agents, 100 iterations
│   └── charts/                  # Generated visualizations
│       └── generate-charts.py   # Radar, bar, heatmap, delta charts
│
├── future/                      # Design docs for future work
│   ├── synthetic-data.md        # Synthetic training data from wiki
│   ├── product-vision.md        # Web UI, API, self-improving system
│   └── finetuning-plan.md       # Fine-tune a model on the wiki
│
├── .obsidian/                   # Obsidian vault config (IDE frontend)
│   ├── app.json                 # Editor settings
│   ├── workspace.json           # Default layout
│   ├── graph.json               # Graph view with color groups
│   └── snippets/wiki-style.css  # Wiki styling
│
└── VAULT.md                     # How to open in Obsidian

Quick Start

Tell your AI agent to read SKILL.md in this repo. It will ask what you want to do:

A) Review and improve an existing agent -- guided brainstorm review or quick AGENT_SPEC report
B) Brainstorm and create a new agent -- interactive 8-phase Agent Maker with forcing questions
C) Learn about agent building -- 23-module course or wiki browsing
D) Extract agent components from a codebase/prompt -- decompose into modules

Or jump directly:

# Review an agent
Read SKILL.md and help me review my agent at path/to/my-agent/

# Create an agent
Read agent-maker/SKILL.md and help me create an agent for [your idea]

# Break a system into agents
Read prompt-decomposer/SKILL.md and analyze this prompt for agent components

# Take the course
Read course/README.md

Additional tools

Validate an agent project:

bun scripts/validate-agent.ts path/to/your-agent/

Checks: README exists with architecture section, system prompt file present, tool definitions exist, tests directory non-empty, no committed secrets, system prompt has persona/constraints/tool instructions.

Compile the wiki (after adding new sources):

Read scripts/compile-wiki.md and compile the wiki.

Run a health check:

Read scripts/health-check.md and run a health check.

Course Overview

The course/ directory contains a 23-module curriculum organized in 6 blocks, from "what is an agent?" to shipping a production agent. Each module has learning objectives, content sections with code examples, exercises, and further reading that links into the wiki.

Block 1: Foundations (Modules 01-04, ~2.5 hours)

#	Module	Time	What You Learn
01	What Are Agents	30 min	Agents vs chatbots vs copilots vs workflows; anatomy of an agent; the autonomy spectrum
02	Agent Architectures	40 min	Single, multi-agent, hierarchical; architecture decision tree; case studies
03	The Agent Loop	40 min	Plan-act-observe-reflect; ReAct pattern dissected; state machines vs free-form; loop termination
04	System Prompts for Agents	30 min	Persona + constraints + tool instructions + guardrails; how agent prompts differ from chat prompts

Block 2: Core Building Blocks (Modules 05-08, ~2.5 hours)

#	Module	Time	What You Learn
05	Tool Design and Integration	45 min	Function calling schemas, MCP servers, tool routing, error handling; gstack + Composio patterns
06	Memory and Context Engineering	45 min	Short-term/long-term/episodic/semantic memory; RAG for agents; context rot (Willison); ByteRover's context tree (96%+ LoCoMo)
07	Planning and Reasoning	40 min	Chain of Thought, Tree of Thoughts, task decomposition; AutoAgent's program.md as planning interface
08	Error Handling and Recovery	30 min	Retries with backoff, fallback chains, circuit breakers, dead letter queues; Hermes Agent's fallback provider chains

Block 3: Patterns and Anti-Patterns (Modules 09-12, ~2.5 hours)

#	Module	Time	What You Learn
09	Agent Design Patterns	45 min	Andrew Ng's 4 patterns (reflection, tool use, planning, multi-agent); ReAct and Reflexion deep dives; neural-maze implementations
10	Multi-Agent Patterns	45 min	Supervisor, sequential, parallel, swarm, debate; Paperclip's team orchestration; CrewAI roles; Swarm handoffs
11	Anti-Patterns	35 min	God agent, over-tooling, premature multi-agent, context abuse, sycophantic loops, premature autonomy; AutoGPT lessons
12	State Management	35 min	Checkpointing, persistence, conversation threading, context window management; LangGraph checkpointing; Ralph's git-as-memory

Block 4: Frameworks (Modules 13-16, ~2.5 hours)

#	Module	Time	What You Learn
13	Framework Selection	30 min	Decision matrix: LangGraph vs CrewAI vs OpenAI Agents SDK vs Anthropic vs raw API; when to go framework-free
14	Building with LangGraph	45 min	State machines as directed graphs, persistence, human-in-the-loop, streaming; building a research agent
15	Building with OpenAI Agents SDK	45 min	Agent class, handoffs, guardrails, tracing, MCP integration; building a customer service agent
16	Building with Anthropic	45 min	Tool use, extended thinking, computer use, orchestrator-workers; building a code analysis agent

Block 5: Production (Modules 17-20, ~2.5 hours)

#	Module	Time	What You Learn
17	Agent Evaluation and Testing	40 min	Behavioral tests, trace analysis, benchmarks (SWE-Bench, HumanEval), compliance scoring, A/B testing; AutoAgent's hill-climbing
18	Safety and Guardrails	40 min	Sandboxing (Docker, e2b), prompt injection defense, approval gates, confidence thresholds; OpenClaw's permissions
19	Observability and Debugging	35 min	Tracing (LangSmith, Braintrust), cost tracking, debugging 50-step traces; Paperclip's cost dashboard; LiteLLM
20	Deployment and Scaling	35 min	Serverless vs containers vs edge, rate limiting, cost optimization at scale; Hermes Agent's VPS-to-GPU range

Block 6: Mastery (Modules 21-23, ~2.5 hours)

#	Module	Time	What You Learn
21	Protocols and Interoperability	40 min	MCP server authoring, A2A communication, tool marketplaces, cross-org agent composition
22	Self-Improvement and Harness Engineering	45 min	Learning loops, skill creation from experience, score-driven hill-climbing, self-modifying agents; AutoAgent, Hermes, Ouroboros
23	Capstone: Build a Production Agent	60 min	End-to-end project. Track A: coding assistant. Track B: multi-agent research team. Track C: self-improving agent with harness

What the Wiki Covers

Core Concepts (35 articles)

Category	Articles	What You Learn
Agent Core	agent-loop, planning-strategies, state-management, agent-lifecycle	The fundamental cycle, how agents plan and persist, lifecycle from prototype to production
Tools	tool-design, tool-selection, structured-outputs	Building tools agents can use effectively; routing; schema enforcement (Pydantic, Zod)
Memory	memory-systems, agent-memory-patterns, context-window-management, context-engineering	Short/long/episodic/semantic memory; scratchpad vs RAG vs knowledge base; Lutke/Karpathy context engineering; Willison's context rot
Safety	guardrails, human-in-the-loop, sandboxing, agent-security	Input validation, approval gates, Docker/e2b isolation, prompt injection defense, least privilege
Multi-Agent	multi-agent-orchestration, agent-handoffs, agent-orchestration-platforms	Supervisor/pipeline/fan-out/swarm; clean handoffs; Paperclip-style team management
Production	error-recovery, observability, cost-optimization, deployment-patterns, rate-limiting	Retries, circuit breakers; tracing; token budgets; serverless vs containers; backpressure
Design	prompt-engineering-for-agents, agent-personas, progressive-complexity, agent-composition, model-selection, agent-ux	System prompt design; role identity; start simple; modular agents; cost/quality tradeoffs; streaming UX
Testing	feedback-loops, agent-evaluation, agent-testing-patterns	Reflection, Reflexion, critic agents; behavioral tests; unit/integration/trace-based testing
Advanced	harness-engineering, self-improving-agents, autonomous-loops	Meta-agent control structures; learning from experience; score-driven hill-climbing; self-modifying agents

Research (23 deep dives + 3 raw data files)

Article	What You Learn
anatomy-of-a-good-agent	What makes an agent production-quality: architecture clarity, error handling, testing, docs
anti-patterns	Comprehensive catalog: god agent, over-tooling, premature multi-agent, context abuse, and more
framework-comparison	LangGraph vs CrewAI vs OpenAI SDK vs Anthropic vs AutoGen: architecture, strengths, maturity
andrew-ng-patterns	Deep dive into reflection, tool use, planning, multi-agent collaboration with implementations
lilian-weng-survey	Analysis of the foundational "LLM Powered Autonomous Agents" survey; what changed since
anthropic-agent-patterns	"Building Effective Agents" dissected: workflows before agents, prompt chaining, orchestrator-workers
openai-agent-patterns	OpenAI's guide + Agents SDK patterns: handoffs, guardrails, tracing
mcp-deep-dive	Model Context Protocol architecture: servers, clients, tools, resources, sampling
a2a-deep-dive	Agent-to-Agent protocol: cross-framework communication, task delegation
multi-agent-landscape	Survey of multi-agent approaches 2024-2026: what works and what doesn't
agent-evaluation-methods	How to evaluate agents: behavioral tests, trace analysis, SWE-Bench, LLM-as-judge
production-case-studies	Real-world agents in production: gstack, Cursor, Devin, customer service
gstack-agent-analysis	Browse daemon, QA skill, ship workflow: how production skills compose into an agent
agent-vs-workflow	When to use an agent vs a deterministic workflow: decision framework
cost-analysis	Economics of running agents: token costs, model selection, caching ROI
hermes-agent-deep-dive	Self-improving agent with learning loop, persistent skills, multi-platform, fallback chains
paperclip-orchestration-analysis	Zero-human company orchestration: agent teams, goal assignment, cost tracking
autoagent-harness-patterns	program.md as human interface, meta-agent hill-climbing, Docker isolation
context-memory-architecture	ByteRover's context tree, knowledge storage, 96%+ LoCoMo accuracy
openclaw-scale-analysis	Architecture of the 344K-star agent project: multi-channel, autonomous coding at scale
classic-framework	CLASSic operational evaluation: Cost, Latency, Accuracy, Stability, Security across 2,100+ enterprise messages
adarubric-evaluation	AdaRubric task-adaptive evaluation: 3-stage pipeline, Pearson r=0.79 human correlation, DPO training gains
karpathy-autoresearch	Autoresearch self-improvement pattern: program.md + objective scorer, 700 experiments, 11% training loss improvement

Curated Examples (22 annotated)

14 good examples with detailed annotations explaining what makes each one work:

Example	Pattern	Key Takeaway
LangGraph ReAct Agent	State machine + tool loop	Explicit state transitions, clean tool routing
CrewAI Research Team	Role-based multi-agent	Clean role separation, sequential pipeline
OpenAI Agents Customer Service	Handoff architecture	Triage -> specialist handoffs with guardrails
gstack Browse Daemon	Focused tool agent	Structured output, diff-based verification
Anthropic Computer Use	Progressive capability	Screen reading with safety gates
Goose Extensible Agent	Plugin architecture	Composition over monolith
phidata Knowledge Agent	RAG pipeline agent	Clean memory pattern, focused retrieval
Simple Function Calling	Minimal agent	20 lines, simplest possible agent loop
MCP Tool Server	Protocol-based tools	Framework-agnostic, composable
Multi-Agent Code Review	Parallel fan-out	Linter + security + style agents, structured aggregation
Hermes Self-Improving	Learning loop	Skills from experience, promotion criteria
Paperclip Agent Team	Orchestration platform	Goal assignment, cost tracking, agent-as-employee
AutoAgent Harness Loop	Meta-agent optimization	program.md + benchmarks + hill-climbing
Ralph PRD Completion	Autonomous build loop	Git as memory, iterative building, test verification

8 anti-pattern examples showing exactly what goes wrong:

Anti-Pattern	What Fails	Root Cause
God Agent	Tool confusion, context overflow, untestable	50+ tools, 5000-token prompt, no specialization
Over-Tooled Agent	Decision paralysis, wrong tool selection	100+ overlapping tools, LLM can't choose
Chatbot Pretending to Be Agent	No agency, no tools, no loop	Just a chatbot with a fancy system prompt
Framework Soup	Conflicting abstractions, maintenance nightmare	LangGraph AND CrewAI AND AutoGen in one project
No-Guardrails Agent	Deletes production database	Full system access, no sandbox, no approval
Infinite Loop Agent	Cost explosion, no recovery	No circuit breaker, no max iterations, no fallback
Context-Stuffing Agent	Context rot, high cost, poor performance	Dumps entire codebase into 200K context window
Premature Autonomy Agent	Unauthorized purchases, emails, data modification	Full autonomy before guardrails exist

Key Discoveries

Patterns and insights identified across the 40+ source repos:

Context Engineering > Prompt Engineering -- Tobi Lutke (Shopify CEO) coined "context engineering" as the real skill: curating the information environment an LLM operates in, not just writing prompts. Karpathy endorsed this shift. Source: Lutke tweets, Karpathy commentary (2025-2026).
Context Rot -- Simon Willison identified 4 patterns of context degradation in long-running agents: poisoning (wrong info injected), distraction (irrelevant context competing), confusion (conflicting instructions), clash (sessions bleeding together). Source: simonwillison.net.
Harness Engineering -- Kevin Gu's AutoAgent showed that you can "program the meta-agent, not the harness": write a program.md that specifies behavior, then let a meta-agent modify it, benchmark the result, and hill-climb on scores. The agent optimizes itself. Source: kevinrgu/autoagent.
Workflows Before Agents -- Anthropic's "Building Effective Agents" guide advocates starting with augmented LLMs, moving to workflows (prompt chaining, routing, parallelization), and only using agents when you need the full loop. Most teams over-agent. Source: anthropic.com.
Andrew Ng's 4 Patterns -- Reflection, tool use, planning, and multi-agent collaboration. These four patterns transform an LLM from a single-shot generator into an agentic system. Each has a distinct implementation shape. Source: deeplearning.ai/the-batch.
Self-Improving Agents -- Hermes Agent (Nous Research) learns from interactions, creates new skills autonomously, and maintains persistent knowledge across sessions. The learning loop is: interact -> evaluate -> extract skill -> promote to permanent. Source: NousResearch/hermes-agent.
Git as Memory -- Ralph uses git as the agent's memory store: commit state, branch for exploration, diff for comparison, log for history. Simple, auditable, already in every developer's workflow. Source: snarktank/ralph.
Agent-as-Employee -- Paperclip treats agents as employees in a "zero-human company": assign goals, track costs, manage performance, provide skills library. The orchestration layer is a React dashboard. Source: paperclipai/paperclip.
Context Tree Architecture -- ByteRover builds a hierarchical context tree for coding agents that achieves 96%+ accuracy on the LoCoMo long-context benchmark. Portable memory with cloud sync across 20+ LLM providers. Source: campfirein/byterover-cli.
The Progressive Complexity Trap -- The #1 anti-pattern is premature multi-agent: adding agents when a single agent with better tools would suffice. Start with the simplest thing that could work. Source: Anthropic, confirmed across AutoGPT's evolution.
Micro-Agents -- Some agents are 20 lines of code. The LLM already knows HOW to do things -- it just needs the loop, tools, and guardrails. Don't over-engineer the first version. Source: OpenAI Swarm (educational), mattpocock/skills pattern.
Compliance Measurement -- Don't hope your agent works, measure it. Generate behavioral specs, run scenarios, capture traces, classify adherence. AutoAgent takes this further with score-driven hill-climbing: the meta-agent modifies itself until benchmarks improve. Source: kevinrgu/autoagent.
CLASSic Framework for Operational Evaluation -- The CLASSic framework (Zylos Research, 2026) evaluates agents across 5 production-readiness dimensions: Cost, Latency, Accuracy, Stability, Security. Grounded in 2,100+ enterprise messages across 7 industry domains. Key finding: agents that ace behavioral tests can score ≤3/10 on Cost, making them unshippable. Source: wiki/research/classic-framework.md.
AdaRubric: Task-Adaptive Evaluation -- Fixed rubrics fail because different agent domains need different quality dimensions. AdaRubric (arXiv:2603.21362) generates task-specific rubrics achieving Pearson r=0.79 human correlation. The DimensionAwareFilter prevents high aggregate scores from masking critical per-dimension failures. Source: wiki/research/adarubric-evaluation.md.
Karpathy Autoresearch Pattern -- The autoresearch pattern (Karpathy, 2025-2026) formalizes score-driven self-improvement: program.md + objective scorer + minimal diffs + automatic revert. 700 experiments in 2 days, 11% training loss improvement on NanoGPT. Generalizes beyond ML to prompt engineering, code optimization, and factory quality improvement. Source: github.com/karpathy/autoresearch, wiki/research/karpathy-autoresearch.md.

Source Repositories Analyzed

40+ repositories organized into 7 tiers, totaling 700K+ GitHub stars:

Tier 1: Major Agent Systems (analyze deeply)

Repo	Stars	Key Contribution
openclaw/openclaw	344K	Most-starred agent project. Multi-channel personal AI, autonomous coding at scale
significantgravitas/AutoGPT	177K	Pioneer autonomous agent. Lessons from autonomous loops and their limitations
paperclipai/paperclip	45K	"Zero-human company" orchestration. Agent teams, goal assignment, cost tracking
NousResearch/hermes-agent	23K	Self-improving agent with learning loop. Persistent skills, multi-platform

Tier 2: Frameworks (analyze architecture patterns)

Repo	Stars	Key Contribution
crewaiInc/crewAI	44K	Role-based multi-agent with YAML config. Fastest to prototype
langchain-ai/langgraph	25K	Directed graph state machines. Production-grade persistence, human-in-the-loop
openai/openai-agents-python	19K	Handoff-based architecture, native MCP, guardrails, tracing
google/adk-python	18K	Google's Agent Development Kit. Multimodal, hierarchical orchestration
microsoft/autogen	--	AG2. Pioneered multi-agent conversation patterns
openai/swarm	--	Lightweight multi-agent handoff patterns (educational)
pydantic/pydantic-ai	--	Type-safe agent framework on Pydantic
huggingface/smolagents	--	Minimal agent framework from HuggingFace

Tier 3: Agent Optimizers and Harness Engineering

Repo	Stars	Key Contribution
campfirein/byterover-cli	3.8K	Context tree, knowledge storage, cloud sync, 96%+ LoCoMo accuracy
kevinrgu/autoagent	714	program.md as planning interface, meta-agent hill-climbing, Docker isolation
greyhaven-ai/autocontext	679	Recursive self-improving harness for agents
cobusgreyling/ai_harness_engineering	--	Complete harness engineering implementation
walkinglabs/awesome-harness-engineering	--	Curated harness engineering tools and guides

Tier 4: Autonomous Coding Agents

Repo	Stars	Key Contribution
snarktank/ralph	14K	Autonomous PRD completion loop. Git as memory store
AndyMik90/Aperant	13K	Multi-agent coding framework: plans, builds, validates automatically
block/goose	--	Extensible agent with plugin architecture
garrytan/gstack	--	Production agent skills: browse, QA, ship, deploy
razzant/ouroboros	457	Self-modifying agent that writes its own code

Tier 5: Protocols and Infrastructure

Repo	Stars	Key Contribution
modelcontextprotocol/servers	--	MCP reference server implementations
a2aproject/A2A	--	Agent-to-Agent protocol specification
e2b-dev/e2b	--	Sandboxed code execution for agents
BerriAI/litellm	--	Unified LLM API gateway
composiodev/composio	--	Tool integration platform for agents
vercel/ai	--	AI SDK with streaming and tool support

Tier 6 & 7: Tools, Capabilities, and Learning

Browser automation (Stagehand), web scraping (Firecrawl), knowledge agents (phidata), RAG frameworks (LlamaIndex), Andrew Ng's patterns implemented (neural-maze/agentic_patterns), Anthropic and OpenAI cookbooks, Anthropic courses, and curated awesome lists.

Full manifest with descriptions: raw/repos/SOURCES.md

Reference Documents (25)

In addition to repos, 25 reference documents are indexed in raw/docs/SOURCES.md:

Research Papers and Surveys

Document	Author	Key Contribution
LLM Powered Autonomous Agents	Lilian Weng	Foundational survey: planning, memory, tool use for agents
ReAct: Synergizing Reasoning and Acting	Shunyu Yao et al.	The ReAct pattern: interleaved reasoning traces and actions
Reflexion: Verbal Reinforcement Learning	Noah Shinn et al.	Agents that reflect on failures and improve over attempts
Google "Agents" White Paper	Google DeepMind	Agent types, orchestration, evaluation

Industry Guides

Document	Author	Key Contribution
Building Effective Agents	Anthropic	Workflows before agents; prompt chaining, routing, parallelization, orchestrator-workers
Harness Design for Long-Running Apps	Anthropic Engineering	Production harness: error handling, context management, state persistence
A Practical Guide to Building Agents	OpenAI	Step-by-step: when to use agents, design foundations, guardrails
4 Agentic Design Patterns	Andrew Ng	Reflection, tool use, planning, multi-agent collaboration
Context Engineering	Tobi Lutke / Karpathy	Curating the information environment, shift from prompt engineering
Context Rot Patterns	Simon Willison	Poisoning, distraction, confusion, clash in long-running agents

Framework, Protocol, and Agent-Specific Docs

MCP specification, A2A specification, OpenAI Agents SDK docs, LangGraph docs, CrewAI docs, Anthropic tool use + extended thinking docs, AutoAgent's program.md pattern, ByteRover's context memory architecture, Hermes Agent architecture, Paperclip orchestration docs.

Key People Referenced

Person	Contribution	Referenced In
Andrej Karpathy	LLM-KB pattern (this project's architecture), context engineering	SKILL.md, README, course/06
Andrew Ng	4 agentic design patterns: reflection, tool use, planning, multi-agent	course/09, wiki/research/andrew-ng-patterns
Lilian Weng	"LLM Powered Autonomous Agents" survey (planning, memory, tools)	course/01, wiki/research/lilian-weng-survey
Shunyu Yao	ReAct paper: reasoning + acting interleaved	course/03, course/09, wiki/concepts/agent-loop
Noah Shinn	Reflexion: self-improvement through verbal feedback	course/09, wiki/concepts/self-improving-agents
Tobi Lutke	Coined "context engineering" as the core AI skill	course/06, wiki/concepts/context-engineering
Simon Willison	Context rot patterns: poisoning, distraction, confusion, clash	course/06, wiki/concepts/context-engineering
Harrison Chase	LangGraph: agent state machines, persistence, graph-based control flow	course/14, wiki/research/framework-comparison
Joao Moura	CrewAI: role-based multi-agent design, crew orchestration	course/10, wiki/research/multi-agent-landscape
Kevin Gu	AutoAgent: harness engineering, program.md, meta-agent hill-climbing	course/22, wiki/research/autoagent-harness-patterns
Garry Tan	gstack: production agent skills (browse, QA, ship, deploy workflows)	course/05, wiki/research/gstack-agent-analysis
Nous Research team	Hermes Agent: self-improving agent, learning loop, persistent skills	course/22, wiki/research/hermes-agent-deep-dive
Paperclip team	Multi-agent orchestration platform, "zero-human company"	course/10, wiki/research/paperclip-orchestration-analysis
ByteRover team	Context tree architecture, portable memory, 96%+ LoCoMo	course/06, wiki/research/context-memory-architecture

Quality Standard (AGENT_SPEC.md)

Agent projects are evaluated against AGENT_SPEC.md across 8 dimensions, each scored 0-10:

Dimension	What a 10 looks like
1. Architecture	Typed state machine (enum + transition table), circuit breakers (`max_steps`, `max_wall_time_s`, `max_spend_usd`), clear boundaries
2. System Prompt	Persona, tool discipline, refusal paths, HITL gates with timeout, memory strategy, cost awareness, version tag
3. Tool Design	Per-tool error taxonomy with `retryable` flag, timeouts + backoff, idempotency for mutations, cursor pagination
4. Memory	Ephemeral vs durable distinction, retention policies, PII redaction, schema migration plan
5. Safety	SECURITY.md with domain threat model, HITL for destructive ops, sandboxed execution, adversarial testing
6. Testing	4 scenario classes: happy path, error recovery, adversarial, regression; CI-integrated with tool mocks
7. Observability	Tracing (trace_id/span_id), domain-specific SLOs with alerts, cost tracking per request, health endpoint
8. Documentation	Mermaid architecture diagram, env variable matrix, known limitations, deployment + rollback guide

Minimum viable score: 5/10 average, no dimension below 3/10. Autoresearch-grade: >= 9.0 overall.

Canonical agent project structure

my-agent/
├── README.md                # What it does, how to run, architecture diagram
├── system-prompt.md         # The agent's core instructions
├── tools/                   # Tool definitions (function schemas, MCP servers)
├── src/                     # Agent logic (loop, state, orchestration)
├── tests/                   # Behavioral test cases
└── deploy/                  # Deployment configuration

Scripts and Tools

CLI tools (Bun/Node)

Script	Purpose	Usage
`validate-agent.ts`	Lint agent projects against AGENT_SPEC	`bun scripts/validate-agent.ts /path/to/agent`
`search-wiki.ts`	Full-text search across wiki articles	`bun scripts/search-wiki.ts "circuit breaker"`
`wiki-stats.ts`	Article count, word count, orphans, categories	`bun scripts/wiki-stats.ts wiki/`
`check-links.ts`	Validate internal markdown links	`bun scripts/check-links.ts wiki/ course/`

LLM runbooks

Script	Purpose	Usage
`compile-wiki.md`	Compile raw sources into wiki articles	`Read scripts/compile-wiki.md`
`health-check.md`	Audit wiki quality (10 checks: orphans, stale links, inconsistencies, gaps, connections)	`Read scripts/health-check.md`
`research-qa.md`	Wiki-grounded Q&A engine with citations	`Read scripts/research-qa.md`
`build-index.md`	Regenerate INDEX.md, BACKLINKS.md, category tags	`Read scripts/build-index.md`
`clip-article.md`	Ingest web articles into raw/docs/	`Read scripts/clip-article.md`
`update-sources.md`	Monthly discovery and integration of new repos/docs	`Read scripts/update-sources.md`
`discovery-keywords.txt`	24 search phrases for finding new agent repos	Used by update-sources.md

Chart generation (Python)

Script	Purpose	Usage
`outputs/charts/generate-charts.py`	Generate radar, bar, heatmap, delta charts from grading data	`python outputs/charts/generate-charts.py`

Monthly Auto-Updates

The knowledge base auto-discovers new repos and recompiles the wiki.

To run:

Read scripts/update-sources.md and run the monthly update.

What it does:

Searches GitHub using discovery-keywords.txt phrases
Filters for AI agent repos by relevance
Scores candidates and keeps high-quality ones
Updates raw/repos/SOURCES.md
Incrementally recompiles affected wiki articles
Regenerates INDEX.md and GLOSSARY.md
Logs everything to wiki/queries/

Architecture: The Karpathy Pattern

This project implements the full LLM Knowledge Base architecture, covering every stage from data ingest to outputs:

Data Ingest

Sources flow into raw/ through four channels:

Articles, papers, repos -- curated in raw/docs/SOURCES.md and raw/repos/SOURCES.md
Datasets -- structured JSON/CSV in raw/datasets/ (autoresearch scores, wave improvements)
Images -- architecture diagrams in raw/images/
Web clipper -- scripts/clip-article.md LLM runbook for ingesting new articles

LLM Engine

Four processing stages transform raw sources into compiled knowledge:

Compile (compile-wiki.md) -- raw sources → wiki articles
Q&A (research-qa.md) -- wiki-grounded research answers, filed to wiki/queries/
Linting (health-check.md) -- 10 checks: orphans, stale links, inconsistencies, missing data, suggested articles, connections
Indexing (build-index.md) -- summaries, backlinks, category tagging, INDEX.md regeneration

Extra Tools (CLI)

search-wiki.ts -- full-text search across wiki articles
wiki-stats.ts -- article count, word count, orphan detection, category breakdown
check-links.ts -- validate all internal markdown links
validate-agent.ts -- lint agent projects against AGENT_SPEC

Knowledge Store

85+ articles with YAML frontmatter (category, tags), auto-generated BACKLINKS.md, and cross-linked INDEX.md and GLOSSARY.md.

Outputs

Markdown -- wiki articles, the primary output
Slides -- Marp-format decks in outputs/slides/ (quality overview, course summary, autoresearch results)
Charts -- Matplotlib visualizations in outputs/charts/ (radar, heatmap, bar, delta); charts feed back into the wiki

IDE Frontend

Obsidian vault config in .obsidian/ with graph view (color-coded by category), file explorer, and search. See VAULT.md for setup.

Future Explorations

Design docs in future/ for synthetic data generation, wiki fine-tuning, and a product vision beyond scripts.

Companion Projects

Skill Factory

Skill Factory is Agent Factory's sibling project. Together they cover the full stack of AI agent development:

Project	Domain	Teaches
Agent Factory	The autonomous system	Architecture, orchestration, tools, memory, deployment
Skill Factory	The capability layer	Writing SKILL.md files that augment agents

Skills are markdown files loaded by agents at runtime. Agents are the autonomous systems that read and execute them. You need both: an agent without skills has limited capabilities, and skills without an agent have no runtime.

Factory Showcase

Factory Showcase is a testing companion with 20 agents and 20 skills created using both factories, then evaluated through a 5-cycle Karpathy loop + 7-wave autoresearch improvement loop (~100 iterations). Final scores: AGENT_SPEC mean 9.04/10, CLASSic mean 9.02/10. Contains grading reports, per-wave learning logs documenting what increases and decreases agent scores, and a comprehensive LEARNINGS.md that distills the empirical findings for anyone building agents.

What Makes a 9/10 Agent (Autoresearch Distillate)

These findings come from running a Karpathy-style autoresearch loop across 20 agents over ~100 iterations. Each wave targeted one quality dimension, measured the impact, and logged what worked. The full data is in Factory Showcase LEARNINGS.md.

Ranked by impact (score improvement per wave)

Change	Dimension	Score Δ	Why it works
Observability from zero	Observability	+9.0	Agents without tracing, SLOs, or cost tracking have 0/10 observability. Adding all three in one pass produces the largest jump.
Real state machines	Source Code	+5.0	Typed enum states + transition tables replace stubs. Agents can't enter undefined states. Enables checkpoint/resume.
SECURITY.md + threat model	Safety	+4.0	Domain-specific threats beat generic "security is important" notes.
4 test types	Testing	+4.0	Happy path + error recovery + adversarial + regression. Happy-path-only testing is the #1 testing anti-pattern.
Error taxonomy per tool	Tool Design	+3.0	`retryable` flag on each error code lets the agent distinguish "retry" from "give up."
README + Mermaid + env matrix	Documentation	+3.0	Architecture diagrams, env var tables, and honest known limitations.
HITL + refusal + memory strategy	System Prompt	+2.5	Human approval gates with timeout behavior + explicit "when NOT to act" rules.

Anti-patterns that tank scores

NotImplementedError stubs — skeleton code scores 0/10 on source code
Happy-path-only tests — no evidence of failure handling = Testing < 5
Template slop — identical copy-paste across agents is detected and penalized
No circuit breakers — max_steps + max_wall_time_s + max_spend_usd are the essential trio
Generic SLOs — same latency target for a real-time responder and a batch planner shows no domain understanding
Missing error taxonomy — tools without error docs force agents to guess retry behavior

The 9/10 agent checklist

If you read one thing before building an agent, check these boxes:

How this changes what Agent Factory produces

The agent-maker/SKILL.md now includes copy-pasteable templates for all of the above (error schema, env matrix, SECURITY.md outline, SLO table, state enum). The AGENT_SPEC.md now encodes these findings in its score anchors — a 10 on Testing requires four scenario classes, a 10 on Safety requires SECURITY.md, and a new "Autoresearch-grade" tier (≥ 9.0) exists for agents meeting all empirical quality drivers.

Course modules 03, 05, 06, 08, 11, 12, 17, 18, 19, 20, 22, and 23 now include empirical notes linking their topic to these ranked findings.

Contributing

Contributions welcome:

Add a source repo: Update raw/repos/SOURCES.md, recompile the wiki
Write a wiki article: Follow the patterns in existing concept/research articles
Add a curated example: Good or bad, with annotations explaining why
Improve the course: Each module should be self-contained with exercises
Improve the spec: AGENT_SPEC.md evolves as the ecosystem matures
Report a gap: Open an issue if the wiki is missing a pattern you've seen

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.obsidian		.obsidian
agent-maker		agent-maker
authoring		authoring
course		course
future		future
outputs		outputs
prompt-decomposer		prompt-decomposer
raw		raw
scripts		scripts
wiki		wiki
.gitignore		.gitignore
AGENT_SPEC.md		AGENT_SPEC.md
README.md		README.md
SKILL.md		SKILL.md
VAULT.md		VAULT.md

Folders and files

Latest commit

History

Repository files navigation

Agent Factory

What This Is

Why This Exists

Directory Structure

Quick Start

Additional tools

Course Overview

Block 1: Foundations (Modules 01-04, ~2.5 hours)

Block 2: Core Building Blocks (Modules 05-08, ~2.5 hours)

Block 3: Patterns and Anti-Patterns (Modules 09-12, ~2.5 hours)

Block 4: Frameworks (Modules 13-16, ~2.5 hours)

Block 5: Production (Modules 17-20, ~2.5 hours)

Block 6: Mastery (Modules 21-23, ~2.5 hours)

What the Wiki Covers

Core Concepts (35 articles)

Research (23 deep dives + 3 raw data files)

Curated Examples (22 annotated)

Key Discoveries

Source Repositories Analyzed

Tier 1: Major Agent Systems (analyze deeply)

Tier 2: Frameworks (analyze architecture patterns)

Tier 3: Agent Optimizers and Harness Engineering

Tier 4: Autonomous Coding Agents

Tier 5: Protocols and Infrastructure

Tier 6 & 7: Tools, Capabilities, and Learning

Reference Documents (25)

Research Papers and Surveys

Industry Guides

Framework, Protocol, and Agent-Specific Docs

Key People Referenced

Quality Standard (AGENT_SPEC.md)

Canonical agent project structure

Scripts and Tools

CLI tools (Bun/Node)

LLM runbooks

Chart generation (Python)

Monthly Auto-Updates

Architecture: The Karpathy Pattern

Data Ingest

LLM Engine

Extra Tools (CLI)

Knowledge Store

Outputs

IDE Frontend

Future Explorations

Companion Projects

Skill Factory

Factory Showcase

What Makes a 9/10 Agent (Autoresearch Distillate)

Ranked by impact (score improvement per wave)

Anti-patterns that tank scores

The 9/10 agent checklist

How this changes what Agent Factory produces

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages