A sophisticated AI-powered research system that generates comprehensive, evidence-based reports by orchestrating multiple specialized agents to gather, analyze, and synthesize information from web sources.
- Features
- Setup Instructions
- Architecture
- How I Built This
- What Makes a Great Research Report?
- Stage 1: Clearing Up the Mist (Clarification Agent)
- Stage 2: The Key β Generating Hypotheses
- Stage 3: Making Hypotheses Searchable (Questions Agent)
- Stage 4: Parallel Search with Exa
- Stage 5: Filtering the Evidence (Critic Agent)
- Stage 6: Putting It All Together (Synthesizer Agent)
- The Full Pipeline
- Clarification Agent: Refines research questions through interactive dialogue
- Hypothesis Agent: Generates testable hypotheses and search-ready keywords
- Questions Agent: Creates diverse search queries (evidence, counter-evidence, keyword-based)
- Supervisor Agent: Creates search tasks with depth assignments
- Search Worker Agents: Execute parallel web searches using Exa API
- Critic Agent: Evaluates evidence quality and relevance
- Synthesizer Agent: Generates comprehensive research reports
- Parallel Web Search: Multiple search workers gather evidence simultaneously
- Evidence Quality Assessment: AI-powered filtering of relevant information
- Comprehensive Analysis: 2000-3500+ word reports with detailed sections
- Inline Citations: Automatic citation tracking with hyperlinked references
- Structured Output: Professional markdown reports with comparison tables
- Evidence-Based Analysis: All claims supported by cited sources
- Quantitative Comparisons: Detailed metrics and performance data
- Risk Assessment: Comprehensive evaluation of options and trade-offs
- Strategic Recommendations: Clear conclusions with confidence levels
- Professional Formatting: Clean markdown output with proper structure
- Python 3.8+
- uv package manager
-
Install dependencies
uv sync
-
Set up environment variables Create a
.envfile in the project root:# Required API Keys OPEN_AI_API_KEY=your_openai_api_key_here EXA_API_KEY=your_exa_api_key_here
Run the main application for an interactive experience:
uv run src/main.pyTest the system with a predefined example:
echo -e "2\n1\nD\n3" | uv run src/main.py- Question Clarification: The system asks clarifying questions to understand your research intent
- Evidence Gathering: Multiple search workers gather information from web sources
- Quality Assessment: The critic agent filters and ranks evidence by relevance
- Report Generation: The synthesizer creates a comprehensive analysis
- Output: Professional markdown report saved to
reports/directory
graph TD
A[User Input] --> B[Clarification Agent]
B --> C[Hypothesis Agent]
C --> D[Questions Agent]
D --> E[Supervisor Agent]
E --> F[Search Node]
subgraph F[Search Node]
F1[Search Worker 1]
F2[Search Worker 2]
F3[Search Worker N]
end
F --> G[Critic Agent]
G --> H[Synthesizer Agent]
H --> I[Research Report]
I built this system in 2024 after being impressed by OpenAI and Gemini's Deep Research feature. I wanted to understand what makes those tools work and see if I could replicate the core ideas myself.
I started by asking ChatGPT: "What makes a research report actually useful?" We landed on a few key insights:
-
The question matters more than the answer β A vague question produces a vague report. The best research starts with a precise, falsifiable question tied to a real decision.
-
Good hypotheses drive good research β You can't just "search for information." You need specific claims to test, or you'll drown in tangential results.
-
You need both depth AND breadth β Deep dives on your hypotheses, plus broad exploration to catch things you didn't think to ask about.
-
Evidence needs filtering β Not everything you find is relevant or credible. Someone needs to play critic.
-
Synthesis is where the magic happens β Raw evidence isn't a report. You need to weave it into a coherent analysis with clear recommendations.
These principles shaped the entire architecture.
The biggest insight was that the key to a good research report is asking the right question. Most users start with something vague like "Should Arsenal sign Eze?" β but what do they actually mean?
So the first step is a quick clarification dialogue. The agent:
- Restates your goal in one line
- Offers 3-4 distinct interpretations (A/B/C/D) β maybe you meant "Eze vs other options," or "Eze specifically vs Rodrygo," or "whether Eze fits Arsenal's system at all"
- Asks 3-5 quick questions: What decision are you making? What's your time horizon? What would change your mind?
Then it synthesizes your answers into a single falsifiable, decision-linked research question β something empirically checkable, not just an opinion piece.
Raw: "Should Arsenal sign Eze?"
Clarified: "Given Arsenal's current midfield composition and reported Β£60M budget,
would signing Eberechi Eze provide better goal contribution per 90 minutes than
alternative targets in the 2025-26 Premier League season?"
The prompt explicitly tells the model to be "neutral and non-leading" β we don't want it smuggling recommendations into the question itself.
This is the heart of the system. A good research report is built on good hypotheses.
Once we have a clear question, the Hypothesis Agent generates 2-5 crucial, researchable-now hypotheses β these are the "hinge points" that matter most for the decision. If you can answer these sub-questions, you can answer the main question.
The key constraints I built into the prompt:
- Researchable-now: Can be validated with existing evidence (stats, reports, historical data) β not speculative "will X happen in 5 years"
- Falsifiable: Something we could actually disprove if the evidence goes the other way
- Crucial: Answers that would significantly change the conclusion, not marginal details
Each hypothesis includes indicators β observable signals to look for in the evidence. For example:
Hypothesis: "Eze's goals+assists per 90 exceeds Rodrygo's in comparable league contexts"
Indicators: ["FBRef per-90 stats", "minutes played in similar roles", "league difficulty adjustment"]
Here's where I realized you need both depth and breadth.
- Depth: Targeted queries to test each hypothesis β find the specific stats, the direct comparisons
- Breadth: Exploratory queries using keywords to catch context you didn't think to ask about
The Questions Agent generates ~15 diverse search queries with a specific distribution:
| Type | ~% | Purpose |
|---|---|---|
| Evidence | 40% | Find supporting evidence for hypotheses |
| Counter-Evidence | 30% | Actively seek criticism, failures, opposing views |
| Keyword-Based | 30% | Explore broader context using extracted keywords |
The counter-evidence queries are crucial β they prevent the system from just confirming whatever the user already believes. I explicitly prompt it to generate queries like "Eze weaknesses" or "why Rodrygo might be overrated."
The Hypothesis Agent also generates 12-20 query-ready keywords β entity names, metrics, years, comparison terms. These seed the breadth queries. I added specific guidance to include:
- Disambiguators (league, season, version)
- Qualifiers (years like "2024", doc types like "analysis")
- Comparison terms ("vs", "compare")
For the actual searching, I use Exa's API which has a killer feature: built-in LLM summarization.
When you search, you can pass a summary parameter with a custom prompt, and Exa returns AI-generated summaries of each page tailored to your question. So instead of getting raw web pages and having to process them myself, I get focused summaries like:
"This FBRef page shows Eze recorded 11 goals and 6 assists in 2023-24, with 0.48 G+A per 90. His xG overperformance of +2.3 suggests some finishing luck..."
I wrote a custom summary prompt that includes the research question and hypotheses, asking Exa to "extract 3-5 key facts that support or contradict the hypotheses."
All the search tasks run in parallel using asyncio.gather() β so 15 searches happen simultaneously instead of sequentially. This cuts the total search time dramatically.
search_tasks = [self.agents["search_worker"].run(state, task) for task in limited_tasks]
results = await asyncio.gather(*search_tasks, return_exceptions=True)Not everything you find is useful. The Critic Agent batch-processes all evidence and scores each item on:
- Relevance (0-1): How relevant to the research question?
- Credibility (0-1): How trustworthy is this source?
Items below threshold get dropped. This prevents the final report from being polluted with tangential or unreliable information.
I process everything in a single batch LLM call for efficiency β the model sees the full evidence landscape and can make relative judgments.
Finally, everything goes into one big prompt:
- The clarified research question
- The hypotheses to test
- All the filtered evidence (numbered, with source URLs)
- A detailed exemplar analysis showing the expected quality
The exemplar is a complete ~3000 word research report I wrote as a benchmark. It shows the model exactly what "good" looks like β comparison tables, quantitative metrics, confidence levels, inline citations. This "few-shot by example" approach works way better than just describing what I want.
The prompt explicitly requires:
- Testing each hypothesis systematically (attempt to falsify before accepting)
- Inline citations in
[X](URL)format - Comparison tables with specific numbers
- A clear recommendation with confidence level
User Question
β
[Clarification] β "What do you actually mean?" β Falsifiable research question
β
[Hypothesis] β 2-5 crucial hypotheses + 12-20 search keywords
β
[Questions] β ~15 queries (40% evidence, 30% counter-evidence, 30% breadth)
β
[Parallel Search] β Exa API with LLM summaries β Evidence pool
β
[Critic] β Filter by relevance + credibility β Cleaned evidence
β
[Synthesizer] β Question + Hypotheses + Evidence β Final Report
The whole thing is orchestrated with LangGraph, which handles the state management and agent sequencing. Each agent reads from and writes to a shared ResearchState object that accumulates everything as it flows through the pipeline.