14-agent AI research swarm. CLI tool invoked via
/research "topic". Adapts Karpathy's autoresearch loop + blitz-swarm parallel execution.
research-swarm/
├── orchestrator.py # Main entrypoint + phased execution + consensus
├── agents.py # 14 agent definitions + role prompts
├── consensus.py # Voting logic + keep/discard + actionability scoring
├── memory.py # SQLite G-Memory (simplified, no Redis/LanceDB)
├── program.md # The research program (Karpathy-style, human-editable)
├── research-log.tsv # Persistent experiment log
├── config.toml # Swarm configuration
├── output/ # Dated research briefs
└── memory.db # SQLite G-Memory storage
All scouts blast simultaneously. They gather raw signal.
| # | ID | Role | Model | Job |
|---|---|---|---|---|
| 1 | arxiv-scout |
scout | haiku | Find recent papers on the topic from arXiv, conferences, preprints |
| 2 | impl-scout |
scout | haiku | Find actual implementations: GitHub repos, HF models, blog posts with code |
| 3 | bench-scout |
scout | haiku | Find benchmarks, comparisons, eval results related to the topic |
Output: Each scout returns { sources: [...], summary: "...", relevance_score: 0-1 }
Why haiku: Scouts do breadth, not depth. Fast and cheap. Their job is to find signal, not analyze it.
All researchers blast simultaneously with scout findings as context.
| # | ID | Role | Model | Job |
|---|---|---|---|---|
| 4 | arch-researcher |
researcher | sonnet | Agent architectures: planning, tool use, multi-agent coordination |
| 5 | memory-researcher |
researcher | sonnet | Context management: RAG, hierarchical memory, compression |
| 6 | prompt-researcher |
researcher | sonnet | Prompt engineering: system prompts, structured output, CoT |
| 7 | eval-researcher |
researcher | sonnet | Evaluation: measuring agent quality, automated testing, benchmarks |
| 8 | infra-researcher |
researcher | sonnet | Orchestration, parallel execution, cost optimization, caching |
Output: Each returns { findings: "...", key_points: [...], techniques: [...], confidence: 0-1 }
Context injection: Each researcher receives ALL scout outputs + their role-specific focus area. They don't see each other's output (round 1). In round 2+ (if consensus fails), they see everything.
Map research findings to our actual codebase.
| # | ID | Role | Model | Job |
|---|---|---|---|---|
| 9 | codebase-auditor |
applied | sonnet | Read current agent code, identify patterns and gaps |
| 10 | gap-analyst |
applied | sonnet | Compare current approach to research findings, rank by impact |
| 11 | experiment-designer |
applied | sonnet | Design concrete experiments to test promising techniques |
Output: Each returns { analysis: "...", recommendations: [...], priority: "high/med/low" }
Context injection: All researcher outputs + relevant file paths from our codebase. The --codebase flag points to the directory to audit (defaults to ~/Claude x LoveSpark/).
Critic runs first, then judge sees critic's assessment.
| # | ID | Role | Model | Job |
|---|---|---|---|---|
| 12 | critic |
quality | sonnet | Challenge hype, check reproducibility, flag weak claims |
| 13 | judge |
quality | opus | Score on coverage/accuracy/actionability, cast final vote |
Output:
- Critic:
{ issues: [...], confidence: 0-1, verdict: "pass/concerns/fail" } - Judge:
{ actionability_score: 0-10, quality_vote: "keep/discard", notes: "..." }
The keep/discard decision: If judge votes "discard", the run is logged but no research brief is produced. This is the Karpathy git reset equivalent — we don't advance on bad research.
Only runs if judge voted "keep".
| # | ID | Role | Model | Job |
|---|---|---|---|---|
| 14 | synthesizer |
output | opus | Produce final research brief with actionable recommendations |
Output: Structured markdown research brief saved to output/.
/research "how to make multi-agent consensus converge faster"
Phase 1: SCOUT ──────────────────────────── ~30s
├── arxiv-scout ─┐
├── impl-scout ─┤ parallel (haiku)
└── bench-scout ─┘
Phase 2: RESEARCH ───────────────────────── ~60-120s
├── arch-researcher ─┐
├── memory-researcher ─┤
├── prompt-researcher ─┤ parallel (sonnet)
├── eval-researcher ─┤
└── infra-researcher ─┘
Phase 3: APPLIED ────────────────────────── ~60-90s
├── codebase-auditor ─┐
├── gap-analyst ─┤ parallel (sonnet)
└── experiment-designer ─┘
Phase 4: QUALITY GATE ──────────────────── ~30-60s
├── critic → (sequential)
└── judge → keep/discard decision
Phase 5: SYNTHESIS (if kept) ────────────── ~60s
└── synthesizer → research brief
Total: ~4-6 minutes for a full research run
# Basic research run
python3 orchestrator.py "multi-agent consensus convergence"
# Dry run — show agent plan without executing
python3 orchestrator.py "topic" --dry-run
# Limit to specific phases
python3 orchestrator.py "topic" --phases scout,research
# Point to specific codebase for applied phase
python3 orchestrator.py "topic" --codebase ~/Claude\ x\ LoveSpark/blitz-swarm/
# Skip quality gate (keep all findings)
python3 orchestrator.py "topic" --no-gate
# Verbose logging
python3 orchestrator.py "topic" --verboseThe program.md file defines the research program. The human edits this to steer the swarm's research focus. This is the equivalent of Karpathy's program.md — you're not editing Python, you're programming the agents via markdown.
# Research Program
## Focus Areas
- Multi-agent coordination and consensus
- Memory systems for agent continuity
- Prompt optimization for structured output
- Tool use patterns and MCP integration
- Cost optimization (token efficiency)
## Current Priorities
1. Make blitz-swarm consensus converge in fewer rounds
2. Reduce agent timeout issues in subprocess invocation
3. Improve G-Memory retrieval accuracy
## Codebase Context
- Primary: ~/Claude x LoveSpark/blitz-swarm/
- Secondary: ~/Claude x LoveSpark/AGENT-SWARM.md
- Skills: ~/.claude/commands/
## Quality Bar
- Actionability score >= 6 to keep
- At least 2 concrete code changes recommended
- Techniques must be implementable without new dependenciesPersistent log of every research run. Tab-separated.
date topic agents phases actionability status output_file key_finding
2026-03-14 multi-agent consensus 14 all 8 keep output/multi_agent_consensus_20260314.md Async voting with timeout reduces rounds by 40%
2026-03-14 prompt optimization 14 all 4 discard - No actionable findings for our stack
Stripped-down version of blitz-swarm's memory. SQLite only, no Redis, no LanceDB.
Tables:
research_runs— id, topic, date, actionability, status, insighttechniques— id, name, source, domain, applicable_to, tried, workedinsights— id, content, supporting_runs, access_count
Retrieval: On each new run, query insights and techniques tables for entries matching the topic. Inject as historical context into researcher prompts.
Compounding: After each "keep" run, extract one-sentence insight and store it. After 10+ runs, the swarm has accumulated domain knowledge that makes it progressively better.
Blitz-swarm's #1 issue: researcher agents timeout. Fixes:
- Narrow prompts — each agent has ONE focused job, not "research everything about X"
- Haiku for scouts — fast model for breadth tasks
- 120s timeout for scouts, 180s for researchers, 120s for applied/quality — tuned per phase
- Prompt size cap — context injection is capped at 3000 chars per agent
- JSON output mode —
--output-format jsonforces structured, concise responses
~/.claude/commands/research.md — invokes the orchestrator.
/research "topic"
/research "topic" --dry-run
/research "topic" --phases scout,research
-
orchestrator.py— main entrypoint, phased execution, CLI args -
agents.py— 14 agent definitions with role prompts -
consensus.py— quality gate voting, keep/discard logic -
memory.py— SQLite G-Memory (simplified) -
program.md— initial research program -
config.toml— default configuration -
research-log.tsv— header row -
~/.claude/commands/research.md— slash command skill - Test: dry run
- Test: full run on a narrow topic
- Test: G-Memory persistence across runs