This document describes the system architecture of the Web Crawler project.
graph TD
User([User]) -->|Static Assets| Frontend
User -->|/api/*| Frontend
Frontend["Frontend<br/>(nginx + React SPA)<br/>Port: 8080"]
Frontend -->|Reverse Proxy /api/*| Manager
Manager["Manager<br/>(Axum HTTP API)<br/>Port: 8080"]
Manager -->|Reads / Writes| Neo4j
Neo4j[("Neo4j<br/>Graph Database<br/>Bolt: 7687")]
Feeder -->|Claims / Updates| Neo4j
Feeder["Feeder (x8)<br/>(Background Workers)<br/>Poll loop + backoff"]
- Technology: nginx serving a React SPA (Vite + TypeScript + Tailwind CSS)
- Port: 8080
- Role: User-facing interface for managing and monitoring crawls
- Routing: nginx serves
index.htmlfor all unmatched paths (SPA fallback), reverse-proxies/api/*requests to the Manager service, and proxies WebSocket upgrades for real-time progress - Static asset caching: JS, CSS, images, and fonts are served with a 1-year
Cache-Control: public, immutableheader
- Technology: Rust + Axum HTTP framework
- Port: 8080 (mapped to service port 80 via ClusterIP)
- Role: API server that accepts crawl requests, queries crawl status, and serves graph data
- Key responsibilities:
- Accept and validate new crawl requests
- Fetch the root URL's HTML and extract child URLs
- DNS-resolve all extracted URLs in parallel
- Create the ROOT node and initial child URL nodes in Neo4j
- Serve REST endpoints for listing, querying, and cancelling crawls
- Provide WebSocket connections for real-time crawl progress
- Health/readiness probes for Kubernetes
- Technology: Rust background worker (no HTTP server, except a health endpoint on port 8081)
- Replicas: 8 (configurable via
feeder.replicaCount) - Role: Process pending URL jobs from Neo4j
- Key responsibilities:
- Poll Neo4j for PENDING URL nodes
- Atomically claim jobs using
SET n.job_status = 'IN-PROGRESS' - Fetch HTML, extract URLs, resolve DNS, create child nodes
- Reclaim stale jobs stuck in IN-PROGRESS beyond a configurable timeout
- Gracefully reset claimed jobs to PENDING on shutdown
- Version: 5.x (deployed via the official Neo4j Helm chart)
- Ports: HTTP browser (7474, NodePort 30074), Bolt protocol (7687, NodePort 30087)
- Role: Graph database storing crawl data — URL nodes and Lead relationships
sequenceDiagram
participant U as User
participant F as Frontend
participant M as Manager
participant N as Neo4j
U->>F: Submit crawl form
F->>M: POST /api/v1/crawls
M->>M: Normalize URL
M->>M: Fetch root page HTML
M->>M: Extract URLs from anchors
M->>M: DNS-resolve all URLs (parallel)
M->>N: CREATE ROOT + child URL nodes (transaction)
M-->>F: crawl_id + status response
sequenceDiagram
participant F as Feeder
participant N as Neo4j
participant W as Web
loop Poll loop with exponential backoff
F->>N: Query PENDING URL (depth not at max)
N-->>F: URL job (atomically SET IN-PROGRESS)
F->>W: Fetch URL HTML content
W-->>F: HTML response
F->>F: Extract child URLs from HTML
F->>N: Filter out existing URLs (Cypher dedup)
F->>F: Normalize + DNS-resolve new URLs (parallel)
F->>N: Batch-create child URL nodes + Lead edges (transaction)
F->>N: SET job_status = 'COMPLETED'
end
| Request | View |
|---|---|
GET /api/v1/crawls |
Crawl list with progress bars |
GET /api/v1/crawls/:id |
Crawl detail (progress tab) |
WS /api/v1/crawls/:id/ws |
WebSocket for real-time progress |
GET /api/v1/crawls/:id/graph |
Force-directed graph visualization |
GET /api/v1/crawls/:id/stats |
Aggregate statistics |
Each URL goes through this pipeline (in both Manager and Feeder):
flowchart LR
A[Raw URL] --> B[1. Normalize]
B --> C[2. Fetch HTML]
C --> D[3. Extract URLs]
D --> E[4. Deduplicate]
E --> F[5. DNS Resolve]
F --> G[6. Store in Neo4j]
- Normalize — Strip fragments, uppercase hostname, remove trailing slashes, split into
(name, http_type)e.g.("EXAMPLE.COM/PATH", "HTTPS://") - Fetch HTML — GET request with timeout and user-agent header; records elapsed time as
request_time - Extract URLs — Regex extraction of
hrefvalues from<a>tags, filtered to http/https only - Deduplicate — (Feeder only) Server-side Cypher query filters out URLs already in the database
- DNS Resolve — Resolve hostname to IP address and domain; follows CNAME chains up to
max_dns_depth; parallel resolution for all child URLs - Store in Neo4j — Create URL nodes with properties + Lead relationships; uses
MERGEto prevent duplicates from concurrent feeders
- 8 feeder replicas run independently, each executing a single-threaded poll loop
- Atomic job claiming: each feeder claims exactly one job at a time using a single Cypher query that matches a PENDING URL and atomically sets it to IN-PROGRESS with a
claimed_attimestamp - Stale job reclamation: if no PENDING jobs exist, feeders look for IN-PROGRESS jobs where
claimed_atexceeds the configurable stale timeout (default: 10 minutes) and reclaim them — this handles feeder crashes without requiring a separate reaper process - Exponential backoff: when no work is found, the poll interval doubles from
poll_min_ms(default: 100ms) up topoll_max_ms(default: 30s), then resets immediately when work is found - Duplicate prevention: child URL creation uses
MERGE(notCREATE) so concurrent feeders discovering the same URL only create one node
Both Manager and Feeder handle SIGTERM / Ctrl+C:
- Manager: Axum's
with_graceful_shutdowndrains in-flight HTTP connections before exiting - Feeder:
- A
watchchannel signals the main loop to stop - The current iteration completes (the job in progress finishes processing)
- If a job was claimed but not yet completed, it is reset to PENDING (
job_status = 'PENDING',claimed_at = NULL) so another feeder can pick it up immediately - The process exits cleanly
- A
The frontend connects to ws://host/api/v1/crawls/:id/ws for real-time progress updates.
Server behavior:
- Every 2 seconds, the server queries Neo4j for the crawl's current progress
- Sends a JSON message with the
CrawlProgressschema (same asGET /api/v1/crawls/:id) - If the crawl status is
"completed", the server sends the final message and closes the socket - If the crawl is not found, sends
{"error": "Crawl not found"}and closes
Message format:
{
"crawl_id": "uuid",
"status": "running",
"total": 786,
"completed": 500,
"pending": 200,
"in_progress": 26,
"failed": 60,
"root_url": "https://example.com",
"requested_depth": 3
}Connection lifecycle:
- nginx proxies the WebSocket upgrade via
proxy_set_header Upgrade/Connection "upgrade" proxy_read_timeout 86400skeeps long-running WebSocket connections alive- The client reconnects automatically if the connection drops (handled by the React frontend)