Skip to content

tonyseetonydo/deep-research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Deep Research Agent System

A sophisticated AI-powered research system that generates comprehensive, evidence-based reports by orchestrating multiple specialized agents to gather, analyze, and synthesize information from web sources.

Table of Contents

Features

πŸ€– Multi-Agent Architecture

  • Clarification Agent: Refines research questions through interactive dialogue
  • Hypothesis Agent: Generates testable hypotheses and search-ready keywords
  • Questions Agent: Creates diverse search queries (evidence, counter-evidence, keyword-based)
  • Supervisor Agent: Creates search tasks with depth assignments
  • Search Worker Agents: Execute parallel web searches using Exa API
  • Critic Agent: Evaluates evidence quality and relevance
  • Synthesizer Agent: Generates comprehensive research reports

πŸ” Advanced Research Capabilities

  • Parallel Web Search: Multiple search workers gather evidence simultaneously
  • Evidence Quality Assessment: AI-powered filtering of relevant information
  • Comprehensive Analysis: 2000-3500+ word reports with detailed sections
  • Inline Citations: Automatic citation tracking with hyperlinked references
  • Structured Output: Professional markdown reports with comparison tables

πŸ“Š Report Features

  • Evidence-Based Analysis: All claims supported by cited sources
  • Quantitative Comparisons: Detailed metrics and performance data
  • Risk Assessment: Comprehensive evaluation of options and trade-offs
  • Strategic Recommendations: Clear conclusions with confidence levels
  • Professional Formatting: Clean markdown output with proper structure

Setup Instructions

Prerequisites

  • Python 3.8+
  • uv package manager

Installation

  1. Install dependencies

    uv sync
  2. Set up environment variables Create a .env file in the project root:

    # Required API Keys
    OPEN_AI_API_KEY=your_openai_api_key_here
    EXA_API_KEY=your_exa_api_key_here

Usage

Interactive Mode

Run the main application for an interactive experience:

uv run src/main.py

Quick Test

Test the system with a predefined example:

echo -e "2\n1\nD\n3" | uv run src/main.py

Example Workflow

  1. Question Clarification: The system asks clarifying questions to understand your research intent
  2. Evidence Gathering: Multiple search workers gather information from web sources
  3. Quality Assessment: The critic agent filters and ranks evidence by relevance
  4. Report Generation: The synthesizer creates a comprehensive analysis
  5. Output: Professional markdown report saved to reports/ directory

Architecture

graph TD
    A[User Input] --> B[Clarification Agent]
    B --> C[Hypothesis Agent]
    C --> D[Questions Agent]
    D --> E[Supervisor Agent]
    E --> F[Search Node]
    
    subgraph F[Search Node]
        F1[Search Worker 1]
        F2[Search Worker 2]
        F3[Search Worker N]
    end
    
    F --> G[Critic Agent]
    G --> H[Synthesizer Agent]
    H --> I[Research Report]
Loading

How I Built This

I built this system in 2024 after being impressed by OpenAI and Gemini's Deep Research feature. I wanted to understand what makes those tools work and see if I could replicate the core ideas myself.

What Makes a Great Research Report?

I started by asking ChatGPT: "What makes a research report actually useful?" We landed on a few key insights:

  1. The question matters more than the answer β€” A vague question produces a vague report. The best research starts with a precise, falsifiable question tied to a real decision.

  2. Good hypotheses drive good research β€” You can't just "search for information." You need specific claims to test, or you'll drown in tangential results.

  3. You need both depth AND breadth β€” Deep dives on your hypotheses, plus broad exploration to catch things you didn't think to ask about.

  4. Evidence needs filtering β€” Not everything you find is relevant or credible. Someone needs to play critic.

  5. Synthesis is where the magic happens β€” Raw evidence isn't a report. You need to weave it into a coherent analysis with clear recommendations.

These principles shaped the entire architecture.


Stage 1: Clearing Up the Mist (Clarification Agent)

The biggest insight was that the key to a good research report is asking the right question. Most users start with something vague like "Should Arsenal sign Eze?" β€” but what do they actually mean?

So the first step is a quick clarification dialogue. The agent:

  1. Restates your goal in one line
  2. Offers 3-4 distinct interpretations (A/B/C/D) β€” maybe you meant "Eze vs other options," or "Eze specifically vs Rodrygo," or "whether Eze fits Arsenal's system at all"
  3. Asks 3-5 quick questions: What decision are you making? What's your time horizon? What would change your mind?

Then it synthesizes your answers into a single falsifiable, decision-linked research question β€” something empirically checkable, not just an opinion piece.

Raw: "Should Arsenal sign Eze?"

Clarified: "Given Arsenal's current midfield composition and reported Β£60M budget, 
would signing Eberechi Eze provide better goal contribution per 90 minutes than 
alternative targets in the 2025-26 Premier League season?"

The prompt explicitly tells the model to be "neutral and non-leading" β€” we don't want it smuggling recommendations into the question itself.


Stage 2: The Key β€” Generating Hypotheses

This is the heart of the system. A good research report is built on good hypotheses.

Once we have a clear question, the Hypothesis Agent generates 2-5 crucial, researchable-now hypotheses β€” these are the "hinge points" that matter most for the decision. If you can answer these sub-questions, you can answer the main question.

The key constraints I built into the prompt:

  • Researchable-now: Can be validated with existing evidence (stats, reports, historical data) β€” not speculative "will X happen in 5 years"
  • Falsifiable: Something we could actually disprove if the evidence goes the other way
  • Crucial: Answers that would significantly change the conclusion, not marginal details

Each hypothesis includes indicators β€” observable signals to look for in the evidence. For example:

Hypothesis: "Eze's goals+assists per 90 exceeds Rodrygo's in comparable league contexts"
Indicators: ["FBRef per-90 stats", "minutes played in similar roles", "league difficulty adjustment"]

Stage 3: Making Hypotheses Searchable (Questions Agent)

Here's where I realized you need both depth and breadth.

  • Depth: Targeted queries to test each hypothesis β€” find the specific stats, the direct comparisons
  • Breadth: Exploratory queries using keywords to catch context you didn't think to ask about

The Questions Agent generates ~15 diverse search queries with a specific distribution:

Type ~% Purpose
Evidence 40% Find supporting evidence for hypotheses
Counter-Evidence 30% Actively seek criticism, failures, opposing views
Keyword-Based 30% Explore broader context using extracted keywords

The counter-evidence queries are crucial β€” they prevent the system from just confirming whatever the user already believes. I explicitly prompt it to generate queries like "Eze weaknesses" or "why Rodrygo might be overrated."

The Hypothesis Agent also generates 12-20 query-ready keywords β€” entity names, metrics, years, comparison terms. These seed the breadth queries. I added specific guidance to include:

  • Disambiguators (league, season, version)
  • Qualifiers (years like "2024", doc types like "analysis")
  • Comparison terms ("vs", "compare")

Stage 4: Parallel Search with Exa

For the actual searching, I use Exa's API which has a killer feature: built-in LLM summarization.

When you search, you can pass a summary parameter with a custom prompt, and Exa returns AI-generated summaries of each page tailored to your question. So instead of getting raw web pages and having to process them myself, I get focused summaries like:

"This FBRef page shows Eze recorded 11 goals and 6 assists in 2023-24, with 0.48 G+A per 90. His xG overperformance of +2.3 suggests some finishing luck..."

I wrote a custom summary prompt that includes the research question and hypotheses, asking Exa to "extract 3-5 key facts that support or contradict the hypotheses."

All the search tasks run in parallel using asyncio.gather() β€” so 15 searches happen simultaneously instead of sequentially. This cuts the total search time dramatically.

search_tasks = [self.agents["search_worker"].run(state, task) for task in limited_tasks]
results = await asyncio.gather(*search_tasks, return_exceptions=True)

Stage 5: Filtering the Evidence (Critic Agent)

Not everything you find is useful. The Critic Agent batch-processes all evidence and scores each item on:

  • Relevance (0-1): How relevant to the research question?
  • Credibility (0-1): How trustworthy is this source?

Items below threshold get dropped. This prevents the final report from being polluted with tangential or unreliable information.

I process everything in a single batch LLM call for efficiency β€” the model sees the full evidence landscape and can make relative judgments.


Stage 6: Putting It All Together (Synthesizer Agent)

Finally, everything goes into one big prompt:

  • The clarified research question
  • The hypotheses to test
  • All the filtered evidence (numbered, with source URLs)
  • A detailed exemplar analysis showing the expected quality

The exemplar is a complete ~3000 word research report I wrote as a benchmark. It shows the model exactly what "good" looks like β€” comparison tables, quantitative metrics, confidence levels, inline citations. This "few-shot by example" approach works way better than just describing what I want.

The prompt explicitly requires:

  • Testing each hypothesis systematically (attempt to falsify before accepting)
  • Inline citations in [X](URL) format
  • Comparison tables with specific numbers
  • A clear recommendation with confidence level

The Full Pipeline

User Question
    ↓
[Clarification] β†’ "What do you actually mean?" β†’ Falsifiable research question
    ↓
[Hypothesis] β†’ 2-5 crucial hypotheses + 12-20 search keywords
    ↓  
[Questions] β†’ ~15 queries (40% evidence, 30% counter-evidence, 30% breadth)
    ↓
[Parallel Search] β†’ Exa API with LLM summaries β†’ Evidence pool
    ↓
[Critic] β†’ Filter by relevance + credibility β†’ Cleaned evidence
    ↓
[Synthesizer] β†’ Question + Hypotheses + Evidence β†’ Final Report

The whole thing is orchestrated with LangGraph, which handles the state management and agent sequencing. Each agent reads from and writes to a shared ResearchState object that accumulates everything as it flows through the pipeline.

About

My toy implementation of deep research

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages