Research Swarm

A multi-agent research pipeline for investigating agentic system design through automated literature review, adversarial evaluation, and ablation testing.

Abstract

Research Swarm is an 8-agent pipeline that automates AI systems research from a single terminal command. Given a topic, it deploys scouts with live web search, domain researchers, an applied analyst, an adversarial critic, a judge, and a synthesizer — producing an actionable research brief or discarding the output if quality thresholds are not met. The system has been iteratively developed over 40 instrumented runs, with each architectural change driven by measured deficiencies in prior runs. The central finding so far: 8 well-scoped agents match or exceed the output quality of 14 agents (avg quality 6.5 vs 6.0) while producing higher actionability scores (6.2 vs 6.1), primarily by eliminating redundant researcher overlap.

Research Motivation

Multi-agent LLM systems face a practical design tension: more agents can cover more ground, but overlapping scope between agents degrades output quality through redundancy, contradictory recommendations, and "integrative compromise" — where synthesis agents average out distinct findings into generic advice. This project investigates that tension empirically by building a working research pipeline and measuring the effects of architectural changes across runs.

The broader question: can a solo developer build a disciplined, self-improving multi-agent system that produces research output of measurable and improving quality?

Research Questions

RQ1: Does reducing overlapping agent scope improve output quality and actionability? (Tested: yes — see Ablations)
RQ2: Does adversarial critique (critic + judge gate) improve factual grounding? (Tested: factuality improved from 4.0 to 5.3 after adding critic; see run data)
RQ3: What is the quality/cost tradeoff between different agent counts? (Tested via 12-run ablation across 4 configurations)
RQ4: Does cross-run memory (G-Memory) improve research quality over time? (Partially tested: 448 techniques accumulated across 40 runs, but causal effect on quality not yet isolated)
RQ5: Can semantic deduplication reduce technique overlap without losing coverage? (Tested: overlap reduced from 76% to 38% with dedup threshold 0.39)

Current Hypotheses

Scope boundaries matter more than agent count. Explicit OUT_OF_SCOPE declarations in agent prompts reduce redundancy more effectively than removing agents entirely. (Supported by ablation data — 8 well-scoped agents outperform 14 without scope boundaries.)
Adversarial evaluation improves grounding. A dedicated critic that challenges findings before a judge scores them produces more factually grounded output than self-assessment. (Partially supported — factuality rose from 4.0 to 5.3, but sample size is small and evaluator is an LLM.)
Sequential phasing outperforms parallel blast for research tasks. Scout → Research → Applied → Quality → Synthesis respects real information dependencies that parallel execution ignores. (Design bet, not yet tested against parallel alternative in this codebase.)
Narrative context injection outperforms structured data passing. Passing scout findings as natural language narratives to researchers produces better analysis than raw JSON. (Observed qualitatively; not yet measured rigorously.)

System Overview

flowchart LR
    subgraph Scout["Phase 1: Scout (Haiku, Parallel)"]
        S1[arxiv-scout]
        S2[impl-scout]
    end
    subgraph Research["Phase 2: Research (Sonnet, Parallel)"]
        R1[arch-researcher]
        R2[prompt-researcher]
    end
    subgraph Applied["Phase 3: Applied (Sonnet)"]
        A1[experiment-designer]
    end
    subgraph Gate["Phase 4: Quality Gate (Sonnet + Opus)"]
        C[critic]
        J[judge]
    end
    subgraph Synth["Phase 5: Synthesis (Opus)"]
        SY[synthesizer]
    end

    Scout --> Research --> Applied --> C --> J --> SY
    SY --> Output[(research brief)]
    J -->|DISCARD| Stop([discarded])

Architecture

Pipeline: 5 sequential phases with parallel execution within phases where dependencies allow.

Phase	Agents	Model	Role
Scout (2)	arxiv-scout, impl-scout	Haiku	Web search via DuckDuckGo MCP server; find papers, repos, benchmarks
Research (2)	arch-researcher, prompt-researcher	Sonnet	Analyze findings through domain-specific lenses
Applied (1)	experiment-designer	Sonnet	Map research to concrete code changes and experiments
Quality Gate (2)	critic, judge	Sonnet, Opus	Adversarial verification; score and keep/discard
Synthesis (1)	synthesizer	Opus	Produce final research brief from kept findings

Model tiering rationale: Haiku for fast, broad scouting (low reasoning demand). Sonnet for focused analysis (good reasoning, moderate cost). Opus for high-stakes judgment and final synthesis (highest reasoning quality).

Agent invocation: Each agent runs as a stateless claude -p subprocess. The orchestrator owns all state, passing context between phases as narrative text. No framework dependencies (no LangGraph, no CrewAI).

Key Mechanisms

Mechanism	Status	Description
Semantic deduplication	Implemented	Jaccard + bigram similarity with complete-linkage clustering (threshold 0.39)
Priority-aware compression	Implemented	Heuristic section scoring; high-priority sections preserved, low-priority compressed
Evidence labeling	Implemented	Every technique tagged: peer_reviewed, preprint, repo, unverified
Adversarial quality gate	Implemented	Critic challenges findings; judge scores coverage/accuracy/actionability/factuality
G-Memory	Implemented	SQLite with exponential decay; techniques and insights persist across runs
Web search (MCP)	Implemented	DuckDuckGo via custom MCP server; scouts ground citations in real sources
Capability registry	Implemented	Each agent declares coverage and non-coverage to prevent overlap
Narrative context injection	Implemented	Scout findings passed as prose, not raw JSON
Pre-dispatch sufficiency gating	Proposed	Block redundant agent spawns before token expenditure
Embedding-based dedup	Proposed	Replace Jaccard with vector similarity for cross-lingual matching

Evaluation

Metrics

All metrics are scored by the judge agent (Claude Opus) on a 0-10 scale per run:

Metric	Definition	Primary?
Actionability	Can these findings be implemented within a week by a solo developer? Requires specific code changes, not vague advice.	Yes
Coverage	Does the research cover the topic's major dimensions?	No
Accuracy	Are the technical claims correct?	No
Factuality	Are findings grounded in real, verifiable sources?	No

Primary metric: Actionability (>= 6 to keep). This reflects the system's purpose — producing implementable research, not literature surveys.

Quality gate logic: If actionability < 6, the run is discarded. If critic verdict is "fail," the run is discarded regardless of judge scores. All discarded runs are logged with reasons.

What these metrics do NOT capture: Inter-rater reliability (single LLM evaluator), real-world implementation success (no follow-up measurement), novelty of findings, cost-effectiveness relative to manual research.

Aggregate Results (40 runs)

Metric	Kept Runs (n=30)	All Runs (n=40)
Avg Quality	6.5	—
Avg Actionability	6.7	—
Avg Factuality	5.3	—
Avg Coverage	7.1	—
Avg Accuracy	6.7	—
Keep Rate	75% (30/40)	—
Avg Overlap (post-dedup)	0.37	—
Avg Wall Clock	620s	—
Techniques Accumulated	448	—
Insights Accumulated	31	—

Ablation: Agent Count (12 runs, 3 topics, 4 configurations)

Tested across three topics: RAG hallucination reduction, fine-tuning methods, code generation evaluation.

Config	Agent Invocations	Avg Quality	Avg Actionability	Avg Factuality	Verdict
full (14 agents)	11	6.0	6.1	4.4	baseline
lean (10 agents)	7	5.7	4.7	5.3	quality drop
minimal (8 agents)	5	6.5	6.2	5.9	adopted
skeleton (7 agents)	4	6.1	6.3	5.0	factuality risk

Interpretation: The 8-agent configuration outperformed the 14-agent baseline on all metrics. The 5 removed researchers had overlapping domain coverage; consolidating them into 2 well-scoped researchers eliminated redundancy without losing coverage. The 7-agent skeleton showed a factuality dip, suggesting the quality gate (critic + judge) is load-bearing.

Caveats: 3 topics is a small sample. All scoring is by a single LLM evaluator. The "quality" metric is a composite that may mask dimension-specific regressions.

Discard Analysis (6 runs)

Run	Topic	Actionability	Reason
20260314_044439	Automated evaluation frameworks	6.0	Quality 4.5; coverage/accuracy too low
20260314_065056	Improving factuality	0.0	Catastrophic failure — agents returned empty output
20260314_084357	RAG hallucination reduction	3.0	Accurate but not actionable for target codebase
20260314_085129	Fine-tuning methods	3.0	Topic mismatched with browser extension codebase
20260314_091940	Fine-tuning methods (10-agent)	2.0	Same mismatch, worse with fewer agents
20260314_094815	Fine-tuning methods (8-agent)	4.0	Marginal improvement but still below threshold

Pattern: Topic-codebase mismatch is the primary discard cause (3/6). The system correctly identifies and rejects research that cannot be applied to its target codebase. Catastrophic failures (empty output) occurred once and were not reproducible.

Failure Modes and Limitations

Single LLM evaluator. All quality scores come from one Claude model. No inter-rater reliability, no human evaluation baseline. Scores may reflect model preferences rather than objective quality.
Small sample sizes. 40 total runs, 12 ablation runs. Insufficient for statistical significance testing. All findings should be treated as observations, not conclusions.
Factuality plateau. Factuality scores plateaued around 5.3 despite multiple interventions (evidence labeling, critic, web search). The root cause is unclear — possibly a ceiling on what DuckDuckGo searches can verify.
Subprocess timeouts. Applied agents occasionally timeout at 200s when processing large context. Partial recovery exists but loses information.
Cost tracking gap. API costs are not tracked per run (estimated_cost_units = 0 in all metrics entries). Cost analysis requires external billing data.
No human validation. "Actionability" is scored by an LLM, not verified by implementing the recommendations.
Memory compounding unvalidated. G-Memory accumulates 448 techniques across runs, but the causal effect on output quality has not been isolated (no runs with empty vs. populated memory compared).
Dedup threshold sensitivity. Threshold 0.50 caused overlap=1.0 on 1/3 topics (all techniques merged into one cluster). Reverted to 0.39. The optimal threshold likely varies by topic.

Reproducibility

Requirements

Python 3.11+
Claude Code CLI installed and authenticated
pip install ddgs mcp httpx

Run a research topic

python3 orchestrator.py "multi-agent consensus convergence 2025"

Dry run (preview agent plan)

python3 orchestrator.py "your topic" --dry-run

Point at a specific codebase

python3 orchestrator.py "your topic" --codebase ~/path/to/your/project

Run specific phases

python3 orchestrator.py "your topic" --phases scout,research

Skip quality gate

python3 orchestrator.py "your topic" --no-gate

Run ablation experiment

python3 ablation.py                      # all 4 configs x 3 topics
python3 ablation.py --configs full,lean   # specific configs only

View metrics dashboard

python3 dashboard.py    # generates dashboard.html

Known Sources of Nondeterminism

LLM output varies between runs (temperature > 0)
Web search results change over time (DuckDuckGo index)
G-Memory state differs between runs (cumulative)
Subprocess timing affects timeout/success boundaries

Environment Assumptions

macOS or Linux (subprocess invocation via claude -p)
Internet access required for scout phase (web search)
No GPU required
Runs take 7-13 minutes per topic depending on timeout behavior

Repo Structure

research-swarm/
├── orchestrator.py       # Main entrypoint — 5-phase pipeline, CLI
├── agents.py             # 14 agent definitions, role prompts, output schemas
├── search_server.py      # MCP web search server (DuckDuckGo + page fetch)
├── consensus.py          # Quality gate — critic eval, judge eval, keep/discard
├── context.py            # Priority-aware context compression
├── dedup.py              # Semantic technique deduplication (Jaccard + bigram)
├── memory.py             # SQLite G-Memory — techniques, insights, decay
├── metrics.py            # Per-run instrumentation, regression detection
├── dashboard.py          # HTML dashboard generator (Chart.js)
├── ablation.py           # Agent-count ablation experiment runner
├── config.toml           # Model assignments, timeouts, roster, thresholds
├── program.md            # Research program definition (steers topic selection)
├── research.md           # Design decisions and source analysis
├── research-log.tsv      # Persistent log of all 40 runs
├── metrics.jsonl         # Machine-readable per-run metrics (40 entries)
├── dashboard.html        # Generated metrics dashboard
├── docs/                 # Research documentation
│   ├── RESEARCH_QUESTIONS.md
│   ├── METHODOLOGY.md
│   ├── EVALUATION.md
│   ├── ABLATIONS.md
│   ├── LIMITATIONS.md
│   ├── ROADMAP.md
│   ├── ARTIFACTS.md
│   ├── RESEARCH_LOG.md
│   └── CLAIMS_AND_EVIDENCE.md
├── results/
│   └── RESULTS.md        # Structured run results table
├── output/               # Generated research briefs (gitignored)
└── memory.db             # SQLite G-Memory storage (gitignored)

Evolution

This system has been iteratively developed over 40 instrumented runs and 16 commits. Each change was motivated by measured deficiencies:

Change	Commit	Motivation	Measured Effect
Initial 14-agent system	f00dbb9	Broad coverage design	Baseline: quality 5.9, actionability 4.3
JSON compliance (PARSE schemas, retry)	26ae486	Parse failures in agent output	Parse failure rate reduced (qualitative)
Context injection (narrative casting)	296813e	Raw JSON confused researchers	Qualitative improvement in researcher output
Role differentiation (OUT_OF_SCOPE)	94c6334	Researcher overlap	Reduced redundancy (qualitative)
Metrics system	e252b66	No way to measure changes	All subsequent changes measurable
Timeout handling	1227f72	Applied agents timing out	Partial recovery, reduced data loss
Adversarial critic + confidence gate	088764f	Hallucinated citations	Factuality 4.0 → 5.0 (measured)
Applied agent quality (CoT, confidence)	a699f52	Low applied phase success rate	Applied success 2/3 → 3/3 (measured)
Semantic deduplication	779f26b	76% technique overlap	Overlap reduced to 38% (measured)
Web search (MCP)	f742f39	Ungrounded citations	Factuality 4.0 → 6.0 (measured)
Agent count ablation (14 → 8)	f4149cc	Suspected researcher redundancy	Quality maintained, actionability +1.0 (measured)
Dedup threshold tuning	71df964, 27f0358	Threshold 0.50 over-merged	Reverted to 0.39 after overlap=1.0 on 1/3 topics

Citation

@software{tyrninoksa2026researchswarm,
  author = {Tyrninoksa, Joona},
  title = {Research Swarm: Multi-Agent Research Pipeline with Ablation-Tested Architecture},
  year = {2026},
  url = {https://github.com/Joona-t/research-swarm},
  license = {MIT}
}

This is an independent research artifact by a solo developer. It is not affiliated with any institution or research lab.

Related Projects

Blitz-Swarm — Parallel multi-agent consensus architecture (predecessor; Research Swarm was forked from this design)
Swarm Builder — 15-agent CLI skill for browser extension development (applies Research Swarm findings)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docs		docs
results		results
.gitignore		.gitignore
BUGS_AND_ITERATIONS.md		BUGS_AND_ITERATIONS.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
ablation.py		ablation.py
agents.py		agents.py
config.toml		config.toml
consensus.py		consensus.py
context.py		context.py
dashboard.html		dashboard.html
dashboard.py		dashboard.py
dedup.py		dedup.py
memory.py		memory.py
metrics.jsonl		metrics.jsonl
metrics.py		metrics.py
orchestrator.py		orchestrator.py
plan.md		plan.md
program.md		program.md
research-command.md		research-command.md
research-log.tsv		research-log.tsv
research.md		research.md
search_server.py		search_server.py

Folders and files

Latest commit

History

Repository files navigation

Research Swarm

Abstract

Research Motivation

Research Questions

Current Hypotheses

System Overview

Architecture

Key Mechanisms

Evaluation

Metrics

Aggregate Results (40 runs)

Ablation: Agent Count (12 runs, 3 topics, 4 configurations)

Discard Analysis (6 runs)

Failure Modes and Limitations

Reproducibility

Requirements

Run a research topic

Dry run (preview agent plan)

Point at a specific codebase

Run specific phases

Skip quality gate

Run ablation experiment

View metrics dashboard

Known Sources of Nondeterminism

Environment Assumptions

Repo Structure

Evolution

Citation

Related Projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages