Deep Research Agent System

A sophisticated AI-powered research system that generates comprehensive, evidence-based reports by orchestrating multiple specialized agents to gather, analyze, and synthesize information from web sources.

Features

🤖 Multi-Agent Architecture

Clarification Agent: Refines research questions through interactive dialogue
Hypothesis Agent: Generates testable hypotheses and search-ready keywords
Questions Agent: Creates diverse search queries (evidence, counter-evidence, keyword-based)
Supervisor Agent: Creates search tasks with depth assignments
Search Worker Agents: Execute parallel web searches using Exa API
Critic Agent: Evaluates evidence quality and relevance
Synthesizer Agent: Generates comprehensive research reports

🔍 Advanced Research Capabilities

Parallel Web Search: Multiple search workers gather evidence simultaneously
Evidence Quality Assessment: AI-powered filtering of relevant information
Comprehensive Analysis: 2000-3500+ word reports with detailed sections
Inline Citations: Automatic citation tracking with hyperlinked references
Structured Output: Professional markdown reports with comparison tables

📊 Report Features

Evidence-Based Analysis: All claims supported by cited sources
Quantitative Comparisons: Detailed metrics and performance data
Risk Assessment: Comprehensive evaluation of options and trade-offs
Strategic Recommendations: Clear conclusions with confidence levels
Professional Formatting: Clean markdown output with proper structure

Setup Instructions

Prerequisites

Python 3.8+
uv package manager

Installation

Install dependencies
```
uv sync
```

Set up environment variables Create a .env file in the project root:

# Required API Keys
OPEN_AI_API_KEY=your_openai_api_key_here
EXA_API_KEY=your_exa_api_key_here

Usage

Interactive Mode

Run the main application for an interactive experience:

uv run src/main.py

Quick Test

Test the system with a predefined example:

echo -e "2\n1\nD\n3" | uv run src/main.py

Example Workflow

Question Clarification: The system asks clarifying questions to understand your research intent
Evidence Gathering: Multiple search workers gather information from web sources
Quality Assessment: The critic agent filters and ranks evidence by relevance
Report Generation: The synthesizer creates a comprehensive analysis
Output: Professional markdown report saved to reports/ directory

Architecture

graph TD
    A[User Input] --> B[Clarification Agent]
    B --> C[Hypothesis Agent]
    C --> D[Questions Agent]
    D --> E[Supervisor Agent]
    E --> F[Search Node]
    
    subgraph F[Search Node]
        F1[Search Worker 1]
        F2[Search Worker 2]
        F3[Search Worker N]
    end
    
    F --> G[Critic Agent]
    G --> H[Synthesizer Agent]
    H --> I[Research Report]

How I Built This

I built this system in 2024 after being impressed by OpenAI and Gemini's Deep Research feature. I wanted to understand what makes those tools work and see if I could replicate the core ideas myself.

What Makes a Great Research Report?

I started by asking ChatGPT: "What makes a research report actually useful?" We landed on a few key insights:

The question matters more than the answer — A vague question produces a vague report. The best research starts with a precise, falsifiable question tied to a real decision.
Good hypotheses drive good research — You can't just "search for information." You need specific claims to test, or you'll drown in tangential results.
You need both depth AND breadth — Deep dives on your hypotheses, plus broad exploration to catch things you didn't think to ask about.
Evidence needs filtering — Not everything you find is relevant or credible. Someone needs to play critic.
Synthesis is where the magic happens — Raw evidence isn't a report. You need to weave it into a coherent analysis with clear recommendations.

These principles shaped the entire architecture.

Stage 1: Clearing Up the Mist (Clarification Agent)

The biggest insight was that the key to a good research report is asking the right question. Most users start with something vague like "Should Arsenal sign Eze?" — but what do they actually mean?

So the first step is a quick clarification dialogue. The agent:

Restates your goal in one line
Offers 3-4 distinct interpretations (A/B/C/D) — maybe you meant "Eze vs other options," or "Eze specifically vs Rodrygo," or "whether Eze fits Arsenal's system at all"
Asks 3-5 quick questions: What decision are you making? What's your time horizon? What would change your mind?

Then it synthesizes your answers into a single falsifiable, decision-linked research question — something empirically checkable, not just an opinion piece.

Raw: "Should Arsenal sign Eze?"

Clarified: "Given Arsenal's current midfield composition and reported £60M budget, 
would signing Eberechi Eze provide better goal contribution per 90 minutes than 
alternative targets in the 2025-26 Premier League season?"

The prompt explicitly tells the model to be "neutral and non-leading" — we don't want it smuggling recommendations into the question itself.

Stage 2: The Key — Generating Hypotheses

This is the heart of the system. A good research report is built on good hypotheses.

Once we have a clear question, the Hypothesis Agent generates 2-5 crucial, researchable-now hypotheses — these are the "hinge points" that matter most for the decision. If you can answer these sub-questions, you can answer the main question.

The key constraints I built into the prompt:

Researchable-now: Can be validated with existing evidence (stats, reports, historical data) — not speculative "will X happen in 5 years"
Falsifiable: Something we could actually disprove if the evidence goes the other way
Crucial: Answers that would significantly change the conclusion, not marginal details

Each hypothesis includes indicators — observable signals to look for in the evidence. For example:

Hypothesis: "Eze's goals+assists per 90 exceeds Rodrygo's in comparable league contexts"
Indicators: ["FBRef per-90 stats", "minutes played in similar roles", "league difficulty adjustment"]

Stage 3: Making Hypotheses Searchable (Questions Agent)

Here's where I realized you need both depth and breadth.

Depth: Targeted queries to test each hypothesis — find the specific stats, the direct comparisons
Breadth: Exploratory queries using keywords to catch context you didn't think to ask about

The Questions Agent generates ~15 diverse search queries with a specific distribution:

Type	~%	Purpose
Evidence	40%	Find supporting evidence for hypotheses
Counter-Evidence	30%	Actively seek criticism, failures, opposing views
Keyword-Based	30%	Explore broader context using extracted keywords

The counter-evidence queries are crucial — they prevent the system from just confirming whatever the user already believes. I explicitly prompt it to generate queries like "Eze weaknesses" or "why Rodrygo might be overrated."

The Hypothesis Agent also generates 12-20 query-ready keywords — entity names, metrics, years, comparison terms. These seed the breadth queries. I added specific guidance to include:

Disambiguators (league, season, version)
Qualifiers (years like "2024", doc types like "analysis")
Comparison terms ("vs", "compare")

Stage 4: Parallel Search with Exa

For the actual searching, I use Exa's API which has a killer feature: built-in LLM summarization.

When you search, you can pass a summary parameter with a custom prompt, and Exa returns AI-generated summaries of each page tailored to your question. So instead of getting raw web pages and having to process them myself, I get focused summaries like:

"This FBRef page shows Eze recorded 11 goals and 6 assists in 2023-24, with 0.48 G+A per 90. His xG overperformance of +2.3 suggests some finishing luck..."

I wrote a custom summary prompt that includes the research question and hypotheses, asking Exa to "extract 3-5 key facts that support or contradict the hypotheses."

All the search tasks run in parallel using asyncio.gather() — so 15 searches happen simultaneously instead of sequentially. This cuts the total search time dramatically.

search_tasks = [self.agents["search_worker"].run(state, task) for task in limited_tasks]
results = await asyncio.gather(*search_tasks, return_exceptions=True)

Stage 5: Filtering the Evidence (Critic Agent)

Not everything you find is useful. The Critic Agent batch-processes all evidence and scores each item on:

Relevance (0-1): How relevant to the research question?
Credibility (0-1): How trustworthy is this source?

Items below threshold get dropped. This prevents the final report from being polluted with tangential or unreliable information.

I process everything in a single batch LLM call for efficiency — the model sees the full evidence landscape and can make relative judgments.

Stage 6: Putting It All Together (Synthesizer Agent)

Finally, everything goes into one big prompt:

The clarified research question
The hypotheses to test
All the filtered evidence (numbered, with source URLs)
A detailed exemplar analysis showing the expected quality

The exemplar is a complete ~3000 word research report I wrote as a benchmark. It shows the model exactly what "good" looks like — comparison tables, quantitative metrics, confidence levels, inline citations. This "few-shot by example" approach works way better than just describing what I want.

The prompt explicitly requires:

Testing each hypothesis systematically (attempt to falsify before accepting)
Inline citations in [X](URL) format
Comparison tables with specific numbers
A clear recommendation with confidence level

The Full Pipeline

User Question
    ↓
[Clarification] → "What do you actually mean?" → Falsifiable research question
    ↓
[Hypothesis] → 2-5 crucial hypotheses + 12-20 search keywords
    ↓  
[Questions] → ~15 queries (40% evidence, 30% counter-evidence, 30% breadth)
    ↓
[Parallel Search] → Exa API with LLM summaries → Evidence pool
    ↓
[Critic] → Filter by relevance + credibility → Cleaned evidence
    ↓
[Synthesizer] → Question + Hypotheses + Evidence → Final Report

The whole thing is orchestrated with LangGraph, which handles the state management and agent sequencing. Each agent reads from and writes to a shared ResearchState object that accumulates everything as it flows through the pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
example_reports		example_reports
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Research Agent System

Table of Contents

Features

🤖 Multi-Agent Architecture

🔍 Advanced Research Capabilities

📊 Report Features

Setup Instructions

Prerequisites

Installation

Usage

Interactive Mode

Quick Test

Example Workflow

Architecture

How I Built This

What Makes a Great Research Report?

Stage 1: Clearing Up the Mist (Clarification Agent)

Stage 2: The Key — Generating Hypotheses

Stage 3: Making Hypotheses Searchable (Questions Agent)

Stage 4: Parallel Search with Exa

Stage 5: Filtering the Evidence (Critic Agent)

Stage 6: Putting It All Together (Synthesizer Agent)

The Full Pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

tonyseetonydo/deep-research

Folders and files

Latest commit

History

Repository files navigation

Deep Research Agent System

Table of Contents

Features

🤖 Multi-Agent Architecture

🔍 Advanced Research Capabilities

📊 Report Features

Setup Instructions

Prerequisites

Installation

Usage

Interactive Mode

Quick Test

Example Workflow

Architecture

How I Built This

What Makes a Great Research Report?

Stage 1: Clearing Up the Mist (Clarification Agent)

Stage 2: The Key — Generating Hypotheses

Stage 3: Making Hypotheses Searchable (Questions Agent)

Stage 4: Parallel Search with Exa

Stage 5: Filtering the Evidence (Critic Agent)

Stage 6: Putting It All Together (Synthesizer Agent)

The Full Pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages