Skip to content

akijain2000/agent-factory

Repository files navigation

Agent Factory

An LLM knowledge base and course for building production-quality AI agents.

Built on Karpathy's LLM-KB pattern: raw sources are collected from 40+ repositories and 25 reference documents, LLM-compiled into a structured wiki of 80+ articles, then queried by meta-skills to produce better agents. Sibling project to Skill Factory, which teaches how to write the SKILL.md files that augment these agents.

flowchart TD
    subgraph ingest ["DATA INGEST"]
        Articles["Articles"]
        Papers["Papers"]
        Repos["Repos"]
        Datasets["Datasets"]
        Images["Images"]
        ClipArticle["clip-article.md"]
    end

    subgraph rawStore ["raw/"]
        RawDocs["docs/SOURCES.md"]
        RawRepos["repos/SOURCES.md"]
        RawDatasets["datasets/ JSON+CSV"]
        RawImages["images/"]
    end

    subgraph extraTools ["EXTRA TOOLS"]
        SearchWiki["search-wiki.ts"]
        WikiStats["wiki-stats.ts"]
        CheckLinks["check-links.ts"]
        ValidateAgent["validate-agent.ts"]
    end

    subgraph llmEngine ["LLM ENGINE"]
        Compile["compile-wiki.md"]
        QA["research-qa.md"]
        Linting["health-check.md"]
        Indexing["build-index.md"]
    end

    subgraph wikiStore ["KNOWLEDGE STORE"]
        Wiki["Wiki .md\n85+ articles\n35 concepts\n26 research\n22 examples\nINDEX + GLOSSARY\nBACKLINKS"]
    end

    subgraph outputs ["OUTPUTS"]
        Markdown["Markdown articles"]
        Slides["Marp slides"]
        Charts["Matplotlib charts"]
    end

    subgraph ide ["IDE FRONTEND"]
        Obsidian["Obsidian vault\nGraph view\nSearch\nSlide preview"]
    end

    subgraph futureBlock ["FUTURE EXPLORATIONS"]
        SyntheticData["Synthetic data gen\n+ finetuning"]
        ProductVision["Product vision\nWeb UI + API"]
    end

    Articles & Papers & Repos & Datasets & Images --> rawStore
    ClipArticle --> rawStore
    rawStore --> Compile
    rawStore --> QA
    rawStore --> Linting
    rawStore --> Indexing
    Compile --> Wiki
    QA --> Wiki
    Linting --> Wiki
    Indexing --> Wiki
    extraTools --> Wiki
    Wiki --> Markdown
    Wiki --> Slides
    Wiki --> Charts
    Charts -->|"filed back"| Wiki
    rawStore & Wiki & Slides & Charts --> Obsidian
    Wiki -.-> futureBlock
Loading

What This Is

This repo is six things:

  1. A knowledge base -- 80+ wiki articles distilled from 40+ top AI agent repos (700K+ stars combined), covering every pattern, anti-pattern, and technique for building autonomous agents
  2. A 23-module course -- Zero-to-production curriculum (~11-13 hours) with exercises, code examples, and framework walkthroughs across LangGraph, OpenAI Agents SDK, and Anthropic
  3. An Agent Maker -- An interactive agent-maker/SKILL.md that asks 6 forcing questions, challenges assumptions, and guides you through 8 phases from idea to validated agent
  4. A Prompt Decomposer -- A prompt-decomposer/SKILL.md that takes a large prompt, codebase, or system description and identifies sections that could become agent components
  5. A meta-skill -- An authoring/SKILL.md that queries the wiki to help you review, score, and improve existing agents against the AGENT_SPEC quality standard
  6. A quality standard -- AGENT_SPEC.md defines 8 dimensions for scoring agent projects, with a canonical project structure and minimum quality bar

Why This Exists

Building a production agent is harder than it looks. Most agents fail at architecture (no clear loop or state management), tools (too many, poorly described, no error handling), safety (no guardrails, no sandbox, no approval gates), or testing (no behavioral tests, no trace analysis, no baselines).

This project codifies what works and what doesn't, drawn from analyzing the architectures and patterns of 40+ real agent systems across the ecosystem -- from OpenClaw (344K stars) to AutoAgent (714 stars). It covers the full spectrum: single-agent tools, multi-agent orchestration, self-improving harness loops, protocol interoperability, and production deployment.

The key insight: Skills augment agents. Agents are the autonomous systems themselves. Skill Factory teaches you to write SKILL.md files. Agent Factory teaches you to build the agent that reads them.


Directory Structure

agent-factory/
├── SKILL.md                     # Entry point: A/B/C/D concierge router
├── AGENT_SPEC.md                # Quality standard (8 dimensions, scoring guide)
├── README.md                    # You are here
│
├── agent-maker/                 # Interactive agent creator
│   └── SKILL.md                 # 8-phase guided creation with forcing questions
│
├── authoring/                   # The review meta-skill
│   └── SKILL.md                 # Wiki-backed agent review and improvement
│
├── prompt-decomposer/           # System-to-agents extractor
│   └── SKILL.md                 # Analyze prompts/codebases, suggest agent components
│
├── course/                      # 23-module agent building course
│   ├── README.md                # Course overview and learning path
│   ├── 01-what-are-agents.md    # Foundation: agents vs chatbots vs copilots
│   ├── 02-agent-architectures.md # Single, multi-agent, hierarchical
│   ├── 03-the-agent-loop.md     # Plan-act-observe-reflect, ReAct pattern
│   ├── 04-system-prompts-for-agents.md # Persona, constraints, tool instructions
│   ├── 05-tool-design-and-integration.md # Function calling, MCP, structured I/O
│   ├── 06-memory-and-context-engineering.md # Memory types, context rot, ByteRover
│   ├── 07-planning-and-reasoning.md # CoT, ToT, task decomposition, AutoAgent
│   ├── 08-error-handling-and-recovery.md # Retries, fallbacks, circuit breakers, Hermes
│   ├── 09-agent-design-patterns.md # Ng's 4 patterns, ReAct, Reflexion
│   ├── 10-multi-agent-patterns.md # Supervisor, pipeline, swarm, Paperclip
│   ├── 11-anti-patterns.md      # God agent, over-tooling, premature autonomy
│   ├── 12-state-management.md   # Checkpointing, persistence, git-as-memory
│   ├── 13-framework-selection.md # Decision matrix: LangGraph vs CrewAI vs SDK
│   ├── 14-building-with-langgraph.md # State graphs, persistence, human-in-the-loop
│   ├── 15-building-with-openai-agents-sdk.md # Agent class, handoffs, guardrails
│   ├── 16-building-with-anthropic.md # Tool use, extended thinking, computer use
│   ├── 17-agent-evaluation-and-testing.md # Behavioral tests, benchmarks, A/B testing
│   ├── 18-safety-and-guardrails.md # Sandboxing, prompt injection, approval gates
│   ├── 19-observability-and-debugging.md # Tracing, cost tracking, debugging long runs
│   ├── 20-deployment-and-scaling.md # Serverless vs containers, rate limits, scaling
│   ├── 21-protocols-and-interoperability.md # MCP servers, A2A, tool marketplaces
│   ├── 22-self-improvement-and-harness-engineering.md # Learning loops, harnesses
│   └── 23-capstone-build-a-production-agent.md # End-to-end project (3 tracks)
│
├── wiki/                        # LLM-compiled knowledge base
│   ├── INDEX.md                 # Master table of contents (start here)
│   ├── GLOSSARY.md              # 67 terms with definitions
│   ├── concepts/                # 35 core agent-building concept articles
│   │   ├── agent-loop.md
│   │   ├── tool-design.md
│   │   ├── memory-systems.md
│   │   ├── context-engineering.md
│   │   ├── harness-engineering.md
│   │   ├── self-improving-agents.md
│   │   ├── autonomous-loops.md
│   │   └── ... (28 more)
│   ├── research/                # 23 ecosystem analysis and deep-dive articles + 3 raw data files
│   │   ├── hermes-agent-deep-dive.md
│   │   ├── paperclip-orchestration-analysis.md
│   │   ├── autoagent-harness-patterns.md
│   │   ├── context-memory-architecture.md
│   │   ├── openclaw-scale-analysis.md
│   │   ├── framework-comparison.md
│   │   └── ... (14 more)
│   ├── examples/
│   │   ├── good/                # 14 exemplary agents with annotations
│   │   └── bad/                 # 8 anti-pattern agents with analysis
│   └── queries/                 # Filed Q&A and update logs
│
├── scripts/                     # Automation & CLI tools
│   ├── validate-agent.ts        # Agent project linter (Bun/Node)
│   ├── search-wiki.ts           # Full-text wiki search CLI
│   ├── wiki-stats.ts            # Article count, word count, orphan detection
│   ├── check-links.ts           # Internal link validator
│   ├── compile-wiki.md          # LLM runbook: compile raw/ into wiki/
│   ├── health-check.md          # LLM runbook: audit wiki quality (10 checks)
│   ├── research-qa.md           # LLM runbook: wiki-grounded Q&A engine
│   ├── build-index.md           # LLM runbook: regenerate indexes + backlinks
│   ├── clip-article.md          # LLM runbook: web clipper for new sources
│   ├── update-sources.md        # LLM runbook: monthly discovery + update
│   └── discovery-keywords.txt   # Keywords for finding new agent repos
│
├── raw/                         # Source material (data ingest)
│   ├── docs/
│   │   └── SOURCES.md           # 25 reference documents (URLs + summaries)
│   ├── repos/
│   │   └── SOURCES.md           # 40+ repo manifest with tiers and descriptions
│   ├── datasets/                # Structured data (JSON, CSV)
│   │   ├── SOURCES.md           # Dataset manifest
│   │   ├── autoresearch-scores.json  # 20 agents x 13 dimensions
│   │   ├── wave-improvements.csv     # 7 waves with before/after/delta
│   │   └── agent-progression.csv     # Score progression per agent
│   └── images/                  # Architecture diagrams and visual references
│
├── outputs/                     # Generated artifacts
│   ├── slides/                  # Marp-format presentation decks
│   │   ├── agent-quality-overview.md  # What makes a 9/10 agent
│   │   ├── course-summary.md          # 23 modules in 6 blocks
│   │   └── autoresearch-results.md    # 7 waves, 20 agents, 100 iterations
│   └── charts/                  # Generated visualizations
│       └── generate-charts.py   # Radar, bar, heatmap, delta charts
│
├── future/                      # Design docs for future work
│   ├── synthetic-data.md        # Synthetic training data from wiki
│   ├── product-vision.md        # Web UI, API, self-improving system
│   └── finetuning-plan.md       # Fine-tune a model on the wiki
│
├── .obsidian/                   # Obsidian vault config (IDE frontend)
│   ├── app.json                 # Editor settings
│   ├── workspace.json           # Default layout
│   ├── graph.json               # Graph view with color groups
│   └── snippets/wiki-style.css  # Wiki styling
│
└── VAULT.md                     # How to open in Obsidian

Quick Start

Tell your AI agent to read SKILL.md in this repo. It will ask what you want to do:

  • A) Review and improve an existing agent -- guided brainstorm review or quick AGENT_SPEC report
  • B) Brainstorm and create a new agent -- interactive 8-phase Agent Maker with forcing questions
  • C) Learn about agent building -- 23-module course or wiki browsing
  • D) Extract agent components from a codebase/prompt -- decompose into modules

Or jump directly:

# Review an agent
Read SKILL.md and help me review my agent at path/to/my-agent/

# Create an agent
Read agent-maker/SKILL.md and help me create an agent for [your idea]

# Break a system into agents
Read prompt-decomposer/SKILL.md and analyze this prompt for agent components

# Take the course
Read course/README.md

Additional tools

Validate an agent project:

bun scripts/validate-agent.ts path/to/your-agent/

Checks: README exists with architecture section, system prompt file present, tool definitions exist, tests directory non-empty, no committed secrets, system prompt has persona/constraints/tool instructions.

Compile the wiki (after adding new sources):

Read scripts/compile-wiki.md and compile the wiki.

Run a health check:

Read scripts/health-check.md and run a health check.

Course Overview

The course/ directory contains a 23-module curriculum organized in 6 blocks, from "what is an agent?" to shipping a production agent. Each module has learning objectives, content sections with code examples, exercises, and further reading that links into the wiki.

Block 1: Foundations (Modules 01-04, ~2.5 hours)

# Module Time What You Learn
01 What Are Agents 30 min Agents vs chatbots vs copilots vs workflows; anatomy of an agent; the autonomy spectrum
02 Agent Architectures 40 min Single, multi-agent, hierarchical; architecture decision tree; case studies
03 The Agent Loop 40 min Plan-act-observe-reflect; ReAct pattern dissected; state machines vs free-form; loop termination
04 System Prompts for Agents 30 min Persona + constraints + tool instructions + guardrails; how agent prompts differ from chat prompts

Block 2: Core Building Blocks (Modules 05-08, ~2.5 hours)

# Module Time What You Learn
05 Tool Design and Integration 45 min Function calling schemas, MCP servers, tool routing, error handling; gstack + Composio patterns
06 Memory and Context Engineering 45 min Short-term/long-term/episodic/semantic memory; RAG for agents; context rot (Willison); ByteRover's context tree (96%+ LoCoMo)
07 Planning and Reasoning 40 min Chain of Thought, Tree of Thoughts, task decomposition; AutoAgent's program.md as planning interface
08 Error Handling and Recovery 30 min Retries with backoff, fallback chains, circuit breakers, dead letter queues; Hermes Agent's fallback provider chains

Block 3: Patterns and Anti-Patterns (Modules 09-12, ~2.5 hours)

# Module Time What You Learn
09 Agent Design Patterns 45 min Andrew Ng's 4 patterns (reflection, tool use, planning, multi-agent); ReAct and Reflexion deep dives; neural-maze implementations
10 Multi-Agent Patterns 45 min Supervisor, sequential, parallel, swarm, debate; Paperclip's team orchestration; CrewAI roles; Swarm handoffs
11 Anti-Patterns 35 min God agent, over-tooling, premature multi-agent, context abuse, sycophantic loops, premature autonomy; AutoGPT lessons
12 State Management 35 min Checkpointing, persistence, conversation threading, context window management; LangGraph checkpointing; Ralph's git-as-memory

Block 4: Frameworks (Modules 13-16, ~2.5 hours)

# Module Time What You Learn
13 Framework Selection 30 min Decision matrix: LangGraph vs CrewAI vs OpenAI Agents SDK vs Anthropic vs raw API; when to go framework-free
14 Building with LangGraph 45 min State machines as directed graphs, persistence, human-in-the-loop, streaming; building a research agent
15 Building with OpenAI Agents SDK 45 min Agent class, handoffs, guardrails, tracing, MCP integration; building a customer service agent
16 Building with Anthropic 45 min Tool use, extended thinking, computer use, orchestrator-workers; building a code analysis agent

Block 5: Production (Modules 17-20, ~2.5 hours)

# Module Time What You Learn
17 Agent Evaluation and Testing 40 min Behavioral tests, trace analysis, benchmarks (SWE-Bench, HumanEval), compliance scoring, A/B testing; AutoAgent's hill-climbing
18 Safety and Guardrails 40 min Sandboxing (Docker, e2b), prompt injection defense, approval gates, confidence thresholds; OpenClaw's permissions
19 Observability and Debugging 35 min Tracing (LangSmith, Braintrust), cost tracking, debugging 50-step traces; Paperclip's cost dashboard; LiteLLM
20 Deployment and Scaling 35 min Serverless vs containers vs edge, rate limiting, cost optimization at scale; Hermes Agent's VPS-to-GPU range

Block 6: Mastery (Modules 21-23, ~2.5 hours)

# Module Time What You Learn
21 Protocols and Interoperability 40 min MCP server authoring, A2A communication, tool marketplaces, cross-org agent composition
22 Self-Improvement and Harness Engineering 45 min Learning loops, skill creation from experience, score-driven hill-climbing, self-modifying agents; AutoAgent, Hermes, Ouroboros
23 Capstone: Build a Production Agent 60 min End-to-end project. Track A: coding assistant. Track B: multi-agent research team. Track C: self-improving agent with harness

What the Wiki Covers

Core Concepts (35 articles)

Category Articles What You Learn
Agent Core agent-loop, planning-strategies, state-management, agent-lifecycle The fundamental cycle, how agents plan and persist, lifecycle from prototype to production
Tools tool-design, tool-selection, structured-outputs Building tools agents can use effectively; routing; schema enforcement (Pydantic, Zod)
Memory memory-systems, agent-memory-patterns, context-window-management, context-engineering Short/long/episodic/semantic memory; scratchpad vs RAG vs knowledge base; Lutke/Karpathy context engineering; Willison's context rot
Safety guardrails, human-in-the-loop, sandboxing, agent-security Input validation, approval gates, Docker/e2b isolation, prompt injection defense, least privilege
Multi-Agent multi-agent-orchestration, agent-handoffs, agent-orchestration-platforms Supervisor/pipeline/fan-out/swarm; clean handoffs; Paperclip-style team management
Production error-recovery, observability, cost-optimization, deployment-patterns, rate-limiting Retries, circuit breakers; tracing; token budgets; serverless vs containers; backpressure
Design prompt-engineering-for-agents, agent-personas, progressive-complexity, agent-composition, model-selection, agent-ux System prompt design; role identity; start simple; modular agents; cost/quality tradeoffs; streaming UX
Testing feedback-loops, agent-evaluation, agent-testing-patterns Reflection, Reflexion, critic agents; behavioral tests; unit/integration/trace-based testing
Advanced harness-engineering, self-improving-agents, autonomous-loops Meta-agent control structures; learning from experience; score-driven hill-climbing; self-modifying agents

Research (23 deep dives + 3 raw data files)

Article What You Learn
anatomy-of-a-good-agent What makes an agent production-quality: architecture clarity, error handling, testing, docs
anti-patterns Comprehensive catalog: god agent, over-tooling, premature multi-agent, context abuse, and more
framework-comparison LangGraph vs CrewAI vs OpenAI SDK vs Anthropic vs AutoGen: architecture, strengths, maturity
andrew-ng-patterns Deep dive into reflection, tool use, planning, multi-agent collaboration with implementations
lilian-weng-survey Analysis of the foundational "LLM Powered Autonomous Agents" survey; what changed since
anthropic-agent-patterns "Building Effective Agents" dissected: workflows before agents, prompt chaining, orchestrator-workers
openai-agent-patterns OpenAI's guide + Agents SDK patterns: handoffs, guardrails, tracing
mcp-deep-dive Model Context Protocol architecture: servers, clients, tools, resources, sampling
a2a-deep-dive Agent-to-Agent protocol: cross-framework communication, task delegation
multi-agent-landscape Survey of multi-agent approaches 2024-2026: what works and what doesn't
agent-evaluation-methods How to evaluate agents: behavioral tests, trace analysis, SWE-Bench, LLM-as-judge
production-case-studies Real-world agents in production: gstack, Cursor, Devin, customer service
gstack-agent-analysis Browse daemon, QA skill, ship workflow: how production skills compose into an agent
agent-vs-workflow When to use an agent vs a deterministic workflow: decision framework
cost-analysis Economics of running agents: token costs, model selection, caching ROI
hermes-agent-deep-dive Self-improving agent with learning loop, persistent skills, multi-platform, fallback chains
paperclip-orchestration-analysis Zero-human company orchestration: agent teams, goal assignment, cost tracking
autoagent-harness-patterns program.md as human interface, meta-agent hill-climbing, Docker isolation
context-memory-architecture ByteRover's context tree, knowledge storage, 96%+ LoCoMo accuracy
openclaw-scale-analysis Architecture of the 344K-star agent project: multi-channel, autonomous coding at scale
classic-framework CLASSic operational evaluation: Cost, Latency, Accuracy, Stability, Security across 2,100+ enterprise messages
adarubric-evaluation AdaRubric task-adaptive evaluation: 3-stage pipeline, Pearson r=0.79 human correlation, DPO training gains
karpathy-autoresearch Autoresearch self-improvement pattern: program.md + objective scorer, 700 experiments, 11% training loss improvement

Curated Examples (22 annotated)

14 good examples with detailed annotations explaining what makes each one work:

Example Pattern Key Takeaway
LangGraph ReAct Agent State machine + tool loop Explicit state transitions, clean tool routing
CrewAI Research Team Role-based multi-agent Clean role separation, sequential pipeline
OpenAI Agents Customer Service Handoff architecture Triage -> specialist handoffs with guardrails
gstack Browse Daemon Focused tool agent Structured output, diff-based verification
Anthropic Computer Use Progressive capability Screen reading with safety gates
Goose Extensible Agent Plugin architecture Composition over monolith
phidata Knowledge Agent RAG pipeline agent Clean memory pattern, focused retrieval
Simple Function Calling Minimal agent 20 lines, simplest possible agent loop
MCP Tool Server Protocol-based tools Framework-agnostic, composable
Multi-Agent Code Review Parallel fan-out Linter + security + style agents, structured aggregation
Hermes Self-Improving Learning loop Skills from experience, promotion criteria
Paperclip Agent Team Orchestration platform Goal assignment, cost tracking, agent-as-employee
AutoAgent Harness Loop Meta-agent optimization program.md + benchmarks + hill-climbing
Ralph PRD Completion Autonomous build loop Git as memory, iterative building, test verification

8 anti-pattern examples showing exactly what goes wrong:

Anti-Pattern What Fails Root Cause
God Agent Tool confusion, context overflow, untestable 50+ tools, 5000-token prompt, no specialization
Over-Tooled Agent Decision paralysis, wrong tool selection 100+ overlapping tools, LLM can't choose
Chatbot Pretending to Be Agent No agency, no tools, no loop Just a chatbot with a fancy system prompt
Framework Soup Conflicting abstractions, maintenance nightmare LangGraph AND CrewAI AND AutoGen in one project
No-Guardrails Agent Deletes production database Full system access, no sandbox, no approval
Infinite Loop Agent Cost explosion, no recovery No circuit breaker, no max iterations, no fallback
Context-Stuffing Agent Context rot, high cost, poor performance Dumps entire codebase into 200K context window
Premature Autonomy Agent Unauthorized purchases, emails, data modification Full autonomy before guardrails exist

Key Discoveries

Patterns and insights identified across the 40+ source repos:

  1. Context Engineering > Prompt Engineering -- Tobi Lutke (Shopify CEO) coined "context engineering" as the real skill: curating the information environment an LLM operates in, not just writing prompts. Karpathy endorsed this shift. Source: Lutke tweets, Karpathy commentary (2025-2026).

  2. Context Rot -- Simon Willison identified 4 patterns of context degradation in long-running agents: poisoning (wrong info injected), distraction (irrelevant context competing), confusion (conflicting instructions), clash (sessions bleeding together). Source: simonwillison.net.

  3. Harness Engineering -- Kevin Gu's AutoAgent showed that you can "program the meta-agent, not the harness": write a program.md that specifies behavior, then let a meta-agent modify it, benchmark the result, and hill-climb on scores. The agent optimizes itself. Source: kevinrgu/autoagent.

  4. Workflows Before Agents -- Anthropic's "Building Effective Agents" guide advocates starting with augmented LLMs, moving to workflows (prompt chaining, routing, parallelization), and only using agents when you need the full loop. Most teams over-agent. Source: anthropic.com.

  5. Andrew Ng's 4 Patterns -- Reflection, tool use, planning, and multi-agent collaboration. These four patterns transform an LLM from a single-shot generator into an agentic system. Each has a distinct implementation shape. Source: deeplearning.ai/the-batch.

  6. Self-Improving Agents -- Hermes Agent (Nous Research) learns from interactions, creates new skills autonomously, and maintains persistent knowledge across sessions. The learning loop is: interact -> evaluate -> extract skill -> promote to permanent. Source: NousResearch/hermes-agent.

  7. Git as Memory -- Ralph uses git as the agent's memory store: commit state, branch for exploration, diff for comparison, log for history. Simple, auditable, already in every developer's workflow. Source: snarktank/ralph.

  8. Agent-as-Employee -- Paperclip treats agents as employees in a "zero-human company": assign goals, track costs, manage performance, provide skills library. The orchestration layer is a React dashboard. Source: paperclipai/paperclip.

  9. Context Tree Architecture -- ByteRover builds a hierarchical context tree for coding agents that achieves 96%+ accuracy on the LoCoMo long-context benchmark. Portable memory with cloud sync across 20+ LLM providers. Source: campfirein/byterover-cli.

  10. The Progressive Complexity Trap -- The #1 anti-pattern is premature multi-agent: adding agents when a single agent with better tools would suffice. Start with the simplest thing that could work. Source: Anthropic, confirmed across AutoGPT's evolution.

  11. Micro-Agents -- Some agents are 20 lines of code. The LLM already knows HOW to do things -- it just needs the loop, tools, and guardrails. Don't over-engineer the first version. Source: OpenAI Swarm (educational), mattpocock/skills pattern.

  12. Compliance Measurement -- Don't hope your agent works, measure it. Generate behavioral specs, run scenarios, capture traces, classify adherence. AutoAgent takes this further with score-driven hill-climbing: the meta-agent modifies itself until benchmarks improve. Source: kevinrgu/autoagent.

  13. CLASSic Framework for Operational Evaluation -- The CLASSic framework (Zylos Research, 2026) evaluates agents across 5 production-readiness dimensions: Cost, Latency, Accuracy, Stability, Security. Grounded in 2,100+ enterprise messages across 7 industry domains. Key finding: agents that ace behavioral tests can score ≤3/10 on Cost, making them unshippable. Source: wiki/research/classic-framework.md.

  14. AdaRubric: Task-Adaptive Evaluation -- Fixed rubrics fail because different agent domains need different quality dimensions. AdaRubric (arXiv:2603.21362) generates task-specific rubrics achieving Pearson r=0.79 human correlation. The DimensionAwareFilter prevents high aggregate scores from masking critical per-dimension failures. Source: wiki/research/adarubric-evaluation.md.

  15. Karpathy Autoresearch Pattern -- The autoresearch pattern (Karpathy, 2025-2026) formalizes score-driven self-improvement: program.md + objective scorer + minimal diffs + automatic revert. 700 experiments in 2 days, 11% training loss improvement on NanoGPT. Generalizes beyond ML to prompt engineering, code optimization, and factory quality improvement. Source: github.com/karpathy/autoresearch, wiki/research/karpathy-autoresearch.md.


Source Repositories Analyzed

40+ repositories organized into 7 tiers, totaling 700K+ GitHub stars:

Tier 1: Major Agent Systems (analyze deeply)

Repo Stars Key Contribution
openclaw/openclaw 344K Most-starred agent project. Multi-channel personal AI, autonomous coding at scale
significantgravitas/AutoGPT 177K Pioneer autonomous agent. Lessons from autonomous loops and their limitations
paperclipai/paperclip 45K "Zero-human company" orchestration. Agent teams, goal assignment, cost tracking
NousResearch/hermes-agent 23K Self-improving agent with learning loop. Persistent skills, multi-platform

Tier 2: Frameworks (analyze architecture patterns)

Repo Stars Key Contribution
crewaiInc/crewAI 44K Role-based multi-agent with YAML config. Fastest to prototype
langchain-ai/langgraph 25K Directed graph state machines. Production-grade persistence, human-in-the-loop
openai/openai-agents-python 19K Handoff-based architecture, native MCP, guardrails, tracing
google/adk-python 18K Google's Agent Development Kit. Multimodal, hierarchical orchestration
microsoft/autogen -- AG2. Pioneered multi-agent conversation patterns
openai/swarm -- Lightweight multi-agent handoff patterns (educational)
pydantic/pydantic-ai -- Type-safe agent framework on Pydantic
huggingface/smolagents -- Minimal agent framework from HuggingFace

Tier 3: Agent Optimizers and Harness Engineering

Repo Stars Key Contribution
campfirein/byterover-cli 3.8K Context tree, knowledge storage, cloud sync, 96%+ LoCoMo accuracy
kevinrgu/autoagent 714 program.md as planning interface, meta-agent hill-climbing, Docker isolation
greyhaven-ai/autocontext 679 Recursive self-improving harness for agents
cobusgreyling/ai_harness_engineering -- Complete harness engineering implementation
walkinglabs/awesome-harness-engineering -- Curated harness engineering tools and guides

Tier 4: Autonomous Coding Agents

Repo Stars Key Contribution
snarktank/ralph 14K Autonomous PRD completion loop. Git as memory store
AndyMik90/Aperant 13K Multi-agent coding framework: plans, builds, validates automatically
block/goose -- Extensible agent with plugin architecture
garrytan/gstack -- Production agent skills: browse, QA, ship, deploy
razzant/ouroboros 457 Self-modifying agent that writes its own code

Tier 5: Protocols and Infrastructure

Repo Stars Key Contribution
modelcontextprotocol/servers -- MCP reference server implementations
a2aproject/A2A -- Agent-to-Agent protocol specification
e2b-dev/e2b -- Sandboxed code execution for agents
BerriAI/litellm -- Unified LLM API gateway
composiodev/composio -- Tool integration platform for agents
vercel/ai -- AI SDK with streaming and tool support

Tier 6 & 7: Tools, Capabilities, and Learning

Browser automation (Stagehand), web scraping (Firecrawl), knowledge agents (phidata), RAG frameworks (LlamaIndex), Andrew Ng's patterns implemented (neural-maze/agentic_patterns), Anthropic and OpenAI cookbooks, Anthropic courses, and curated awesome lists.

Full manifest with descriptions: raw/repos/SOURCES.md


Reference Documents (25)

In addition to repos, 25 reference documents are indexed in raw/docs/SOURCES.md:

Research Papers and Surveys

Document Author Key Contribution
LLM Powered Autonomous Agents Lilian Weng Foundational survey: planning, memory, tool use for agents
ReAct: Synergizing Reasoning and Acting Shunyu Yao et al. The ReAct pattern: interleaved reasoning traces and actions
Reflexion: Verbal Reinforcement Learning Noah Shinn et al. Agents that reflect on failures and improve over attempts
Google "Agents" White Paper Google DeepMind Agent types, orchestration, evaluation

Industry Guides

Document Author Key Contribution
Building Effective Agents Anthropic Workflows before agents; prompt chaining, routing, parallelization, orchestrator-workers
Harness Design for Long-Running Apps Anthropic Engineering Production harness: error handling, context management, state persistence
A Practical Guide to Building Agents OpenAI Step-by-step: when to use agents, design foundations, guardrails
4 Agentic Design Patterns Andrew Ng Reflection, tool use, planning, multi-agent collaboration
Context Engineering Tobi Lutke / Karpathy Curating the information environment, shift from prompt engineering
Context Rot Patterns Simon Willison Poisoning, distraction, confusion, clash in long-running agents

Framework, Protocol, and Agent-Specific Docs

MCP specification, A2A specification, OpenAI Agents SDK docs, LangGraph docs, CrewAI docs, Anthropic tool use + extended thinking docs, AutoAgent's program.md pattern, ByteRover's context memory architecture, Hermes Agent architecture, Paperclip orchestration docs.


Key People Referenced

Person Contribution Referenced In
Andrej Karpathy LLM-KB pattern (this project's architecture), context engineering SKILL.md, README, course/06
Andrew Ng 4 agentic design patterns: reflection, tool use, planning, multi-agent course/09, wiki/research/andrew-ng-patterns
Lilian Weng "LLM Powered Autonomous Agents" survey (planning, memory, tools) course/01, wiki/research/lilian-weng-survey
Shunyu Yao ReAct paper: reasoning + acting interleaved course/03, course/09, wiki/concepts/agent-loop
Noah Shinn Reflexion: self-improvement through verbal feedback course/09, wiki/concepts/self-improving-agents
Tobi Lutke Coined "context engineering" as the core AI skill course/06, wiki/concepts/context-engineering
Simon Willison Context rot patterns: poisoning, distraction, confusion, clash course/06, wiki/concepts/context-engineering
Harrison Chase LangGraph: agent state machines, persistence, graph-based control flow course/14, wiki/research/framework-comparison
Joao Moura CrewAI: role-based multi-agent design, crew orchestration course/10, wiki/research/multi-agent-landscape
Kevin Gu AutoAgent: harness engineering, program.md, meta-agent hill-climbing course/22, wiki/research/autoagent-harness-patterns
Garry Tan gstack: production agent skills (browse, QA, ship, deploy workflows) course/05, wiki/research/gstack-agent-analysis
Nous Research team Hermes Agent: self-improving agent, learning loop, persistent skills course/22, wiki/research/hermes-agent-deep-dive
Paperclip team Multi-agent orchestration platform, "zero-human company" course/10, wiki/research/paperclip-orchestration-analysis
ByteRover team Context tree architecture, portable memory, 96%+ LoCoMo course/06, wiki/research/context-memory-architecture

Quality Standard (AGENT_SPEC.md)

Agent projects are evaluated against AGENT_SPEC.md across 8 dimensions, each scored 0-10:

Dimension What a 10 looks like
1. Architecture Typed state machine (enum + transition table), circuit breakers (max_steps, max_wall_time_s, max_spend_usd), clear boundaries
2. System Prompt Persona, tool discipline, refusal paths, HITL gates with timeout, memory strategy, cost awareness, version tag
3. Tool Design Per-tool error taxonomy with retryable flag, timeouts + backoff, idempotency for mutations, cursor pagination
4. Memory Ephemeral vs durable distinction, retention policies, PII redaction, schema migration plan
5. Safety SECURITY.md with domain threat model, HITL for destructive ops, sandboxed execution, adversarial testing
6. Testing 4 scenario classes: happy path, error recovery, adversarial, regression; CI-integrated with tool mocks
7. Observability Tracing (trace_id/span_id), domain-specific SLOs with alerts, cost tracking per request, health endpoint
8. Documentation Mermaid architecture diagram, env variable matrix, known limitations, deployment + rollback guide

Minimum viable score: 5/10 average, no dimension below 3/10. Autoresearch-grade: >= 9.0 overall.

Canonical agent project structure

my-agent/
├── README.md                # What it does, how to run, architecture diagram
├── system-prompt.md         # The agent's core instructions
├── tools/                   # Tool definitions (function schemas, MCP servers)
├── src/                     # Agent logic (loop, state, orchestration)
├── tests/                   # Behavioral test cases
└── deploy/                  # Deployment configuration

Scripts and Tools

CLI tools (Bun/Node)

Script Purpose Usage
validate-agent.ts Lint agent projects against AGENT_SPEC bun scripts/validate-agent.ts /path/to/agent
search-wiki.ts Full-text search across wiki articles bun scripts/search-wiki.ts "circuit breaker"
wiki-stats.ts Article count, word count, orphans, categories bun scripts/wiki-stats.ts wiki/
check-links.ts Validate internal markdown links bun scripts/check-links.ts wiki/ course/

LLM runbooks

Script Purpose Usage
compile-wiki.md Compile raw sources into wiki articles Read scripts/compile-wiki.md
health-check.md Audit wiki quality (10 checks: orphans, stale links, inconsistencies, gaps, connections) Read scripts/health-check.md
research-qa.md Wiki-grounded Q&A engine with citations Read scripts/research-qa.md
build-index.md Regenerate INDEX.md, BACKLINKS.md, category tags Read scripts/build-index.md
clip-article.md Ingest web articles into raw/docs/ Read scripts/clip-article.md
update-sources.md Monthly discovery and integration of new repos/docs Read scripts/update-sources.md
discovery-keywords.txt 24 search phrases for finding new agent repos Used by update-sources.md

Chart generation (Python)

Script Purpose Usage
outputs/charts/generate-charts.py Generate radar, bar, heatmap, delta charts from grading data python outputs/charts/generate-charts.py

Monthly Auto-Updates

The knowledge base auto-discovers new repos and recompiles the wiki.

To run:

Read scripts/update-sources.md and run the monthly update.

What it does:

  1. Searches GitHub using discovery-keywords.txt phrases
  2. Filters for AI agent repos by relevance
  3. Scores candidates and keeps high-quality ones
  4. Updates raw/repos/SOURCES.md
  5. Incrementally recompiles affected wiki articles
  6. Regenerates INDEX.md and GLOSSARY.md
  7. Logs everything to wiki/queries/

Architecture: The Karpathy Pattern

This project implements the full LLM Knowledge Base architecture, covering every stage from data ingest to outputs:

Data Ingest

Sources flow into raw/ through four channels:

  • Articles, papers, repos -- curated in raw/docs/SOURCES.md and raw/repos/SOURCES.md
  • Datasets -- structured JSON/CSV in raw/datasets/ (autoresearch scores, wave improvements)
  • Images -- architecture diagrams in raw/images/
  • Web clipper -- scripts/clip-article.md LLM runbook for ingesting new articles

LLM Engine

Four processing stages transform raw sources into compiled knowledge:

  • Compile (compile-wiki.md) -- raw sources → wiki articles
  • Q&A (research-qa.md) -- wiki-grounded research answers, filed to wiki/queries/
  • Linting (health-check.md) -- 10 checks: orphans, stale links, inconsistencies, missing data, suggested articles, connections
  • Indexing (build-index.md) -- summaries, backlinks, category tagging, INDEX.md regeneration

Extra Tools (CLI)

  • search-wiki.ts -- full-text search across wiki articles
  • wiki-stats.ts -- article count, word count, orphan detection, category breakdown
  • check-links.ts -- validate all internal markdown links
  • validate-agent.ts -- lint agent projects against AGENT_SPEC

Knowledge Store

85+ articles with YAML frontmatter (category, tags), auto-generated BACKLINKS.md, and cross-linked INDEX.md and GLOSSARY.md.

Outputs

  • Markdown -- wiki articles, the primary output
  • Slides -- Marp-format decks in outputs/slides/ (quality overview, course summary, autoresearch results)
  • Charts -- Matplotlib visualizations in outputs/charts/ (radar, heatmap, bar, delta); charts feed back into the wiki

IDE Frontend

Obsidian vault config in .obsidian/ with graph view (color-coded by category), file explorer, and search. See VAULT.md for setup.

Future Explorations

Design docs in future/ for synthetic data generation, wiki fine-tuning, and a product vision beyond scripts.


Companion Projects

Skill Factory

Skill Factory is Agent Factory's sibling project. Together they cover the full stack of AI agent development:

Project Domain Teaches
Agent Factory The autonomous system Architecture, orchestration, tools, memory, deployment
Skill Factory The capability layer Writing SKILL.md files that augment agents

Skills are markdown files loaded by agents at runtime. Agents are the autonomous systems that read and execute them. You need both: an agent without skills has limited capabilities, and skills without an agent have no runtime.

Factory Showcase

Factory Showcase is a testing companion with 20 agents and 20 skills created using both factories, then evaluated through a 5-cycle Karpathy loop + 7-wave autoresearch improvement loop (~100 iterations). Final scores: AGENT_SPEC mean 9.04/10, CLASSic mean 9.02/10. Contains grading reports, per-wave learning logs documenting what increases and decreases agent scores, and a comprehensive LEARNINGS.md that distills the empirical findings for anyone building agents.


What Makes a 9/10 Agent (Autoresearch Distillate)

These findings come from running a Karpathy-style autoresearch loop across 20 agents over ~100 iterations. Each wave targeted one quality dimension, measured the impact, and logged what worked. The full data is in Factory Showcase LEARNINGS.md.

Ranked by impact (score improvement per wave)

Change Dimension Score Δ Why it works
Observability from zero Observability +9.0 Agents without tracing, SLOs, or cost tracking have 0/10 observability. Adding all three in one pass produces the largest jump.
Real state machines Source Code +5.0 Typed enum states + transition tables replace stubs. Agents can't enter undefined states. Enables checkpoint/resume.
SECURITY.md + threat model Safety +4.0 Domain-specific threats beat generic "security is important" notes.
4 test types Testing +4.0 Happy path + error recovery + adversarial + regression. Happy-path-only testing is the #1 testing anti-pattern.
Error taxonomy per tool Tool Design +3.0 retryable flag on each error code lets the agent distinguish "retry" from "give up."
README + Mermaid + env matrix Documentation +3.0 Architecture diagrams, env var tables, and honest known limitations.
HITL + refusal + memory strategy System Prompt +2.5 Human approval gates with timeout behavior + explicit "when NOT to act" rules.

Anti-patterns that tank scores

  • NotImplementedError stubs — skeleton code scores 0/10 on source code
  • Happy-path-only tests — no evidence of failure handling = Testing < 5
  • Template slop — identical copy-paste across agents is detected and penalized
  • No circuit breakersmax_steps + max_wall_time_s + max_spend_usd are the essential trio
  • Generic SLOs — same latency target for a real-time responder and a batch planner shows no domain understanding
  • Missing error taxonomy — tools without error docs force agents to guess retry behavior

The 9/10 agent checklist

If you read one thing before building an agent, check these boxes:

  • State machine with typed enum states and transition table
  • Circuit breakers: max_steps, max_wall_time_s, max_spend_usd
  • Per-tool error taxonomy with retryable flag
  • System prompt with refusal paths, HITL gates, memory strategy, cost awareness
  • 4 test scenarios: happy path, error recovery, adversarial, regression
  • SLOs with domain-specific numerical targets and alert rules
  • SECURITY.md with domain-specific threat model
  • README with Mermaid architecture diagram, env variable matrix, known limitations
  • Tracing with trace_id/span_id, cost tracking per request
  • Deploy config with Dockerfile, health check, required secrets list

How this changes what Agent Factory produces

The agent-maker/SKILL.md now includes copy-pasteable templates for all of the above (error schema, env matrix, SECURITY.md outline, SLO table, state enum). The AGENT_SPEC.md now encodes these findings in its score anchors — a 10 on Testing requires four scenario classes, a 10 on Safety requires SECURITY.md, and a new "Autoresearch-grade" tier (≥ 9.0) exists for agents meeting all empirical quality drivers.

Course modules 03, 05, 06, 08, 11, 12, 17, 18, 19, 20, 22, and 23 now include empirical notes linking their topic to these ranked findings.


Contributing

Contributions welcome:

  • Add a source repo: Update raw/repos/SOURCES.md, recompile the wiki
  • Write a wiki article: Follow the patterns in existing concept/research articles
  • Add a curated example: Good or bad, with annotations explaining why
  • Improve the course: Each module should be self-contained with exercises
  • Improve the spec: AGENT_SPEC.md evolves as the ecosystem matures
  • Report a gap: Open an issue if the wiki is missing a pattern you've seen

License

MIT

About

LLM knowledge base for building production-quality AI agents. 23-module course, 80+ wiki articles, 40+ reference repos, quality spec, and meta-skills.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors