An agentic system for hierarchical discovery and organization of research themes from ArXiv papers. Rather than exhaustive cataloging, LitReviews identifies genuinely central topics and their relationships through iterative refinement guided by AI critique.
Practitioners in rapidly developing areas face a fundamental challenge: research directions evolve rapidly, and staying current across thousands of new papers published monthly is infeasible through manual reading. Thus, it is challenging to maintain up-to-date mental model of current research trends.
LitReviews addresses this by automatically discovering trending research themes and their relationships, allowing researchers to rapidly understand the current landscape of any field and identify where innovation is accelerating.
The system employs a structured agent graph with the following stages:
- Theme Exploration - An LLM agent executes an OODA loop (Observe-Orient-Decide-Act) to iteratively discover research themes, search for relevant papers, and build theme hierarchies
- Validation - The theme structure is validated for consistency (parent-child relationships, no cycles, proper nesting)
- Critique Evaluation - An AI critic independently evaluates each theme on distinctiveness, coherence, and paper relevance alignment
- Refinement - Based on critic feedback (severity levels: CRITICAL, MAJOR, MINOR), the agent iteratively improves themes until convergence or iteration limit
- Selective, not exhaustive: Prioritizes finding 50 highly relevant papers over 500 marginally related ones
- Hierarchical organization: Themes nest meaningfully with root, parent, and leaf-level distinctions
- Multi-dimensional relevance: Papers evaluated on topic relevance, root theme relevance, and current landscape representation
- Token-aware processing: Real-time context management prevents exceeding LLM token limits through selective consolidation
- Provider-agnostic: Supports Anthropic Claude and Google Gemini interchangeably
The graph orchestrates theme discovery and refinement through a multi-stage pipeline. The explore node iteratively discovers themes via LLM tool-use. Validation checks for structural consistency before critique. If validation fails, the system returns to exploration. Critic feedback determines whether to continue refinement (returning to explore) or finalize output.
| Component | Role |
|---|---|
| explore | LLM-driven theme discovery with tool-use (add/update/delete themes, search papers) |
| validate_themes | Structural validation of theme hierarchy |
| evaluate_single_theme | Critic assessment of individual theme quality |
| compile_critic_scores | Aggregation of critique results across all themes |
| critic_feedback | Routing logic: continue refining or finalize |
| consolidate_history_node | Context window recovery when approaching token limits |
The system integrates with ArXiv via a Model Context Protocol (MCP) server:
- Paper search by keyword/category
- Metadata retrieval (authors, abstract, publication date, citations)
- Citation graph parsing
Theme management tools:
add_research_theme()- Create new theme with parent relationshipupdate_research_theme()- Modify theme description and paper assignmentsvalidate_theme_list()- Check hierarchy consistencycomplete_theme_draft()- Mark theme as finalized
Create config.yaml from the template (not version controlled; contains API keys):
provider: "anthropic"
anthropic:
api_key: ${ANTHROPIC_API_KEY}
model: claude-haiku-4-5
google:
api_key: ${GOOGLE_API_KEY}
model: gemini-2.5-flash
critic:
max_iterations: 2
rate_limiter:
requests_per_second: 15Core Pydantic models ensure type safety and validation:
| Model | Purpose |
|---|---|
ResearchTheme |
Hierarchical theme with description, papers, and scoring |
Paper |
ArXiv paper metadata (id, title, authors, abstract, categories, age) |
ThemeCriticOutput |
Evaluation results with issue list (severity, recommendation) |
PaperAnalysis |
Paper with relevance justifications across multiple dimensions |
The system actively monitors context usage against a configurable recommended_token threshold (default: 64,000):
- 70%: Advisory message ("approaching limit")
- 80%: Warning state ("critical usage")
- 100%+: Force consolidation (summarizes conversation history)
Consolidation is order-independent and preserves theme structure while reducing verbosity of earlier exchanges.
streamlit run arxiv_agent_ui.pyEnter a research topic (e.g., "Continuous Improvement in Agentic AI") and the system will iteratively discover and refine themes.
from arxiv_agent import create_and_run_agent
result = create_and_run_agent(
topic="Your research question",
config_path="config.yaml",
recursion_limit=25
)
# result contains hierarchical ResearchTheme structure
for theme in result.themes:
print(f"{theme.topic}: {len(theme.papers)} papers")| File | Lines | Purpose |
|---|---|---|
arxiv_agent.py |
1532 | State graph orchestration and node logic |
arxiv_server.py |
706 | MCP server implementation for ArXiv |
arxiv_prompts.py |
652 | System and dynamic instruction prompts |
arxiv_agent_ui.py |
754 | Streamlit interface for interactive use |
research_theme_tools.py |
516 | Theme CRUD operations and validation |
data_models.py |
274 | Pydantic models for type safety |
- Iterative refinement: Critic-guided improvement ensures theme quality and distinctiveness
- Transparent reasoning: LLM maintains conversation history; decisions are traceable
- Structured output: Hierarchical themes are easily parsed and consumed by downstream systems
- Flexible scope: Token management allows handling 5-paper deep dives or 500-paper broad surveys
- Critique consistency: Critic evaluation can vary; weighted aggregation of multiple independent passes could improve stability
- Cold start problem: Initial theme discovery quality depends on LLM understanding of topic; few-shot examples in prompts could help
- Cross-theme relationships: Current design emphasizes tree structure; modeling sibling relationships or cross-cutting concerns remains unexplored
- Evaluation metrics: No ground-truth benchmark for theme hierarchy quality; developing reproducible evaluation metrics is an open question
- Python 3.13+
- API key for Claude (Anthropic) or Gemini (Google)
See pyproject.toml for dependency versions.
Built for researchers seeking rapid, AI-guided literature organization. Designed for reproducibility and extensibility across different research domains.
