The application consists of two microservices:
- Frontend (port 16700): React + Elastic UI components served by nginx
- Crawler Service (port 8000): FastAPI backend that manages the Elastic Crawler (JRuby)
The frontend communicates with the backend via REST API, proxied through nginx.
graph TB
subgraph Frontend["Frontend :16700"]
UI[React 18 + Elastic UI]
NGINX[nginx]
UI --> NGINX
end
subgraph Backend["Crawler Service :8000"]
API[server.py<br/>FastAPI + Pydantic]
CrawlerMgr[crawler.py<br/>Process Manager]
subgraph Processes["Spawned Processes<br/>(subprocess.Popen)"]
C1[Elastic Open Web Crawler <br> Process 1]
C2[Elastic Open Web Crawler <br> Process 2]
C3[Elastic Open Web Crawler <br> Process 3]
end
DB[db.py<br/>Abstraction Layer]
Models[models.py<br/>Pydantic Models]
API --> CrawlerMgr
CrawlerMgr -->|spawn/manage| C1
CrawlerMgr -->|spawn/manage| C2
CrawlerMgr -->|spawn/manage| C3
API --> DB
API --> Models
CrawlerMgr --> DB
end
subgraph Storage["Data Storage"]
Files[/crawler/crawls/]
DB --> Files
end
ES[Elasticsearch]
C1 --> ES
C2 --> ES
C3 --> ES
NGINX -->|REST API| API
style Frontend fill:#1893FF
style Backend fill:#FFDF56
style Storage fill:#F990C6
Crawl data is persisted to the filesystem in /crawler/crawls/{execution_id}/:
info.json- Contains crawl configuration, status, statistics, and resultslogs.log- Raw crawler execution logs
The db.py module provides an abstraction layer that can be refactored to use Elasticsearch or another database without affecting the rest of the application.
Backend:
server.py- FastAPI application with REST endpointscrawler.py- Manages crawler process execution and lifecycledb.py- Database abstraction layer (currently file-based JSON)models.py- Pydantic models for type-safe data handlingelasticsearch_client.py- Elasticsearch integration utilities
Frontend:
App.jsx- Main application with crawl configuration formCrawlRuns.jsx- Lists all crawl executions with statusCrawlLogs.jsx- Modal for viewing crawler logs
- Asynchronous Crawling: Submit crawl jobs that run in the background
- Real-time Monitoring: Track crawl progress with live status updates
- Process Management: Cancel running crawls
- Execution History: View all past crawls with configuration and results
- Log Streaming: Access detailed crawler logs for each execution
- Flexible Configuration: Customize crawl depth, URL limits, and extraction rules
- Elasticsearch Integration: Index crawled content directly to Elasticsearch
- Frontend: React 18 + Elastic UI + Vite + nginx
- Backend: FastAPI (Python 3.11) + Pydantic
- Crawler: Elastic Open Crawler (JRuby)
- Data Storage: File-based JSON (with abstraction for future DB migration)
This starts up Crawler Service and Frontend with Vite proxy so changes are reflected immediately:
docker-compose up --buildProduction does the actual build of the frontend and uses nginx instead of a Vite proxy.
docker-compose -f docker-compose.prod.yml up --buildAccess the UI at: http://localhost:16700
ES_URL=https://your-elasticsearch-url
ES_API_KEY=your-api-key
PORT=8000Configure crawls via JSON with the following options:
{
"domains": [
{
"url": "https://example.com",
"seed_urls": ["https://example.com/start"]
}
],
"output_index": "my-crawl-index",
"max_crawl_depth": 3,
"max_unique_url_count": 500,
"max_duration_seconds": 3600
}You can specify Elasticsearch credentials per-crawl or use the default environment variables.
-
POST /api/crawl- Start a new crawl job- Request body:
CrawlConfigJSON - Returns: Initial
CrawlStatus
- Request body:
-
POST /api/cancel/{execution_id}- Cancel a running crawl- Returns: Updated
CrawlStatus
- Returns: Updated
-
GET /api/status/{execution_id}- Get crawl status- Returns: Complete
CrawlStatuswith config, stats, and result
- Returns: Complete
-
GET /api/crawls- List all crawls (sorted by most recent)- Returns: Dictionary of execution IDs to
CrawlStatusobjects
- Returns: Dictionary of execution IDs to
-
GET /api/logs/{execution_id}- Get crawler logs- Returns: Plain text log content
-
GET /api/info/{execution_id}- Get complete crawl information- Returns: Raw JSON from
info.json
- Returns: Raw JSON from
GET /api/health- Service health check- Returns:
{"status": "healthy", "service": "crawly", "version": "1.0.0"}
- Returns:
The main object returned by the API, containing all crawl information:
{
"status": "started|running|completed|failed|cancelled",
"execution_id": "uuid",
"started_at": "ISO timestamp",
"completed_at": "ISO timestamp",
"message": "Human-readable status message",
"config": {...}, # CrawlConfig
"stats": {...}, # CrawlStats (pages visited, documents indexed, duration)
"result": {...} # CrawlResult (return code, errors, domains crawled)
}app/
├── crawler-service/
│ ├── app/
│ │ ├── server.py # FastAPI application
│ │ ├── crawler.py # Crawler process management
│ │ ├── db.py # Persists crawls to disk
│ │ ├── models.py # Pydantic models
│ │ └── elasticsearch_client.py
│ └── Dockerfile
├── frontend/
│ ├── src/
│ │ ├── App.jsx
│ │ └── components/
│ │ ├── CrawlRuns.jsx
│ │ └── CrawlLogs.jsx
│ └── Dockerfile
└── docker-compose.yml
MIT © ugosan

