An AI-powered research analysis agent that automates academic paper discovery, analysis, and literature review generation. Give it a research topic and get back a structured report with relevant papers, identified research gaps, and proposed directions.
- Multi-Source Discovery - Searches Semantic Scholar and arXiv simultaneously with deduplication
- LLM-Powered Analysis - Uses Google Gemini for query expansion, paper analysis, gap detection, and feasibility evaluation
- Vector-Based Ranking - Embeds paper abstracts with Gemini embeddings (768-dim) and re-ranks with LLM scoring
- PDF Processing - Downloads and parses PDFs via GROBID (TEI XML extraction) with graceful fallback to abstracts
- Research Gap Detection - Identifies unexplored areas, conflicting results, and limitations across papers
- Feasibility Scoring - Evaluates research directions on difficulty, cost, industry viability, and time-to-MVP
- Real-Time Progress - Server-Sent Events (SSE) streaming for live pipeline status updates
- Multi-Format Output - Generates Markdown reports, BibTeX references, and structured JSON
# 1. Clone the repo
git clone https://github.com/your-username/paper-scout.git
cd paper-scout
# 2. Set up environment
cp .env.example .env
# Edit .env and add your Gemini API key:
# LLM__API_KEY=your_gemini_api_key_here
# 3. Start all services
docker compose up -d
# 4. Start a research query
curl -X POST http://localhost:8080/api/v1/research \
-H "Content-Type: application/json" \
-d '{"topic": "transformer architectures for time series forecasting"}'INPUT: Research Topic
│
▼
┌──────────────────┐
│ 1. Query │ SYNCHRONOUS
│ Expansion │ LLM expands topic into search queries, concepts, keywords
└────────┬─────────┘
│
▼
┌──────────────────┐
│ 2. Paper │ SYNCHRONOUS
│ Discovery │ Search Semantic Scholar + arXiv, deduplicate
└────────┬─────────┘
│
▼
┌──────────────────┐
│ 3. Ranking │ SYNCHRONOUS
│ │ Embed abstracts, cosine similarity, LLM re-rank top 50
└────────┬─────────┘
│
▼
┌──────────────────┐
│ 4. Paper │ ASYNC (Worker Pool)
│ Analysis │ Download PDFs via GROBID, LLM extracts structured data
└────────┬─────────┘
│
▼
┌──────────────────┐
│ 5. Gap │ SYNCHRONOUS
│ Detection │ Cross-paper analysis identifies research gaps
└────────┬─────────┘
│
▼
┌──────────────────┐
│ 6. Feasibility │ SYNCHRONOUS
│ Evaluation │ Score each gap on difficulty, cost, viability
└────────┬─────────┘
│
▼
┌──────────────────┐
│ 7. Report │ SYNCHRONOUS
│ Generation │ Assemble Markdown report + BibTeX references
└────────┬─────────┘
│
▼
OUTPUT: Complete Research Report
- Circuit breaker on all external APIs (Gemini, Semantic Scholar, arXiv)
- Token bucket rate limiting per API
- Exponential backoff with jitter for retries
- Graceful degradation (PDF parse failure falls back to abstracts; LLM rerank failure falls back to embedding scores)
- Discovery retry with 3 query levels (full, broad, minimal)
Base URL: http://localhost:8080
| Method | Path | Description |
|---|---|---|
POST |
/api/v1/research |
Start research pipeline |
GET |
/api/v1/research/:id |
Get full research result |
GET |
/api/v1/research/:id/status |
Get pipeline status only |
GET |
/api/v1/research/:id/stream |
SSE stream for real-time updates |
GET |
/api/v1/research/:id/report |
Download Markdown report |
GET |
/api/v1/research/:id/bibtex |
Download BibTeX references |
GET |
/health |
Health check |
curl -X POST http://localhost:8080/api/v1/research \
-H "Content-Type: application/json" \
-d '{"topic": "large language models for code generation"}'Response:
{
"topic_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "pending",
"message": "Research started"
}curl http://localhost:8080/api/v1/research/{topic_id}/streamEvents:
event: status
data: {"stage": "query_expansion", "progress": 0.05}
event: status
data: {"stage": "paper_discovery", "progress": 0.15}
| Variable | Description |
|---|---|
LLM__API_KEY |
Google Gemini API key |
All defaults are in config/default.yaml. Override via environment variables using __ as delimiter (e.g., LLM__MODEL overrides llm.model).
| Variable | Default | Description |
|---|---|---|
LLM__MODEL |
gemini-2.5-flash |
LLM model |
LLM__EMBEDDING_MODEL |
gemini-embedding-001 |
Embedding model |
LLM__REQUESTS_PER_MINUTE |
15 |
Gemini rate limit |
LLM__REQUESTS_PER_DAY |
1000 |
Gemini daily limit |
PIPELINE__MAX_PAPERS |
50 |
Max papers to discover |
PIPELINE__PAPERS_TO_ANALYZE |
20 |
Papers for deep analysis |
PIPELINE__WORKER_POOL_SIZE |
10 |
Concurrent workers |
SERVER__PORT |
8080 |
HTTP server port |
| Service | Image | Port | Purpose |
|---|---|---|---|
postgres |
postgres:17-alpine |
5432 | Primary database |
redis |
redis:8-alpine |
6379 | Cache, pipeline state, job queue |
qdrant |
qdrant/qdrant:latest |
6333 | Vector database (768-dim, cosine) |
grobid |
lfoppiano/grobid:0.8.1 |
8070 | PDF parsing (TEI XML) |
- Go 1.24+
- Docker + Docker Compose
- just command runner
just setup # Install tools, copy .env
just up # Start services (detached)
just dev # Run with hot reload
just run # Run without hot reload
just test # Run tests
just test-coverage # Tests with coverage report
just fmt # Format code
just lint # Run linter
just check # Vet + build
just down # Stop services
just clean # Stop and remove volumes
just logs # Tail app logs├── cmd/server/ # Application entry point
├── config/ # Default configuration
├── internal/
│ ├── agent/ # Pipeline agents (discovery, ranking, analysis, etc.)
│ ├── api/ # HTTP handlers, router, middleware
│ ├── circuitbreaker/ # Circuit breaker pattern
│ ├── config/ # Configuration loading and validation
│ ├── llm/ # LLM abstraction (Gemini client, prompts, rate limiting)
│ ├── logger/ # Structured logging (zerolog)
│ ├── orchestrator/ # Pipeline orchestration and SSE streaming
│ ├── storage/
│ │ ├── postgres/ # PostgreSQL (pgx + sqlc)
│ │ ├── qdrant/ # Vector database client
│ │ └── redis/ # Cache, state, job queue
│ ├── tools/
│ │ ├── arxiv/ # arXiv API client
│ │ ├── bibtex/ # BibTeX citation generator
│ │ ├── embedding/ # Gemini embedding generator
│ │ ├── pdf/ # PDF download and GROBID parsing
│ │ └── semantic_scholar/ # Semantic Scholar API client
│ └── worker/ # Background job processing
├── migrations/ # Database migrations
├── docker-compose.yml # Service orchestration
├── Justfile # Development commands
└── go.mod
- Language: Go 1.24
- HTTP: Gin
- Database: PostgreSQL 17 (pgx/v5 + sqlc)
- Cache/Queue: Redis 8
- Vector DB: Qdrant
- LLM: Google Gemini (gemini-2.5-flash)
- Embeddings: Gemini (gemini-embedding-001, 768-dim)
- PDF Parsing: GROBID 0.8.1
- Config: koanf (YAML + env overlay)
- Logging: zerolog
MIT