GitHub - ugosan/elastic-crawler-control: Elastic Open Web Crawler Control

Web crawler service and UI built on top of Elastic Open Web Crawler.

Architecture

The application consists of two microservices:

Frontend (port 16700): React + Elastic UI components served by nginx
Crawler Service (port 8000): FastAPI backend that manages the Elastic Crawler (JRuby)

The frontend communicates with the backend via REST API, proxied through nginx.

graph TB
    subgraph Frontend["Frontend :16700"]
        UI[React 18 + Elastic UI]
        NGINX[nginx]
        UI --> NGINX
    end
    
    subgraph Backend["Crawler Service :8000"]
        API[server.py<br/>FastAPI + Pydantic]
        CrawlerMgr[crawler.py<br/>Process Manager]
        
        subgraph Processes["Spawned Processes<br/>(subprocess.Popen)"]
            C1[Elastic Open Web Crawler <br> Process 1]
            C2[Elastic Open Web Crawler <br> Process 2]
            C3[Elastic Open Web Crawler <br> Process 3]
        end
        
        DB[db.py<br/>Abstraction Layer]
        Models[models.py<br/>Pydantic Models]
        
        API --> CrawlerMgr
        CrawlerMgr -->|spawn/manage| C1
        CrawlerMgr -->|spawn/manage| C2
        CrawlerMgr -->|spawn/manage| C3
        API --> DB
        API --> Models
        CrawlerMgr --> DB
    end
    
    subgraph Storage["Data Storage"]
        Files[/crawler/crawls/]
        DB --> Files
    end
    
    ES[Elasticsearch]
        C1 --> ES
        C2 --> ES
        C3 --> ES
    
    NGINX -->|REST API| API
    
    style Frontend fill:#1893FF
    style Backend fill:#FFDF56
    style Storage fill:#F990C6

Data Persistence

Crawl data is persisted to the filesystem in /crawler/crawls/{execution_id}/:

info.json - Contains crawl configuration, status, statistics, and results
logs.log - Raw crawler execution logs

The db.py module provides an abstraction layer that can be refactored to use Elasticsearch or another database without affecting the rest of the application.

Key Components

Backend:

server.py - FastAPI application with REST endpoints
crawler.py - Manages crawler process execution and lifecycle
db.py - Database abstraction layer (currently file-based JSON)
models.py - Pydantic models for type-safe data handling
elasticsearch_client.py - Elasticsearch integration utilities

Frontend:

App.jsx - Main application with crawl configuration form
CrawlRuns.jsx - Lists all crawl executions with status
CrawlLogs.jsx - Modal for viewing crawler logs

Features

Asynchronous Crawling: Submit crawl jobs that run in the background
Real-time Monitoring: Track crawl progress with live status updates
Process Management: Cancel running crawls
Execution History: View all past crawls with configuration and results
Log Streaming: Access detailed crawler logs for each execution
Flexible Configuration: Customize crawl depth, URL limits, and extraction rules
Elasticsearch Integration: Index crawled content directly to Elasticsearch

Stack

Frontend: React 18 + Elastic UI + Vite + nginx
Backend: FastAPI (Python 3.11) + Pydantic
Crawler: Elastic Open Crawler (JRuby)
Data Storage: File-based JSON (with abstraction for future DB migration)

Quick Start

Development

This starts up Crawler Service and Frontend with Vite proxy so changes are reflected immediately:

docker-compose up --build

Production

Production does the actual build of the frontend and uses nginx instead of a Vite proxy.

docker-compose -f docker-compose.prod.yml up --build

Access the UI at: http://localhost:16700

Configuration

Environment Variables

ES_URL=https://your-elasticsearch-url
ES_API_KEY=your-api-key
PORT=8000

Crawl Configuration

Configure crawls via JSON with the following options:

{
  "domains": [
    {
      "url": "https://example.com",
      "seed_urls": ["https://example.com/start"]
    }
  ],
  "output_index": "my-crawl-index",
  "max_crawl_depth": 3,
  "max_unique_url_count": 500,
  "max_duration_seconds": 3600
}

You can specify Elasticsearch credentials per-crawl or use the default environment variables.

API Endpoints

Crawl Management

POST /api/crawl - Start a new crawl job
- Request body: CrawlConfig JSON
- Returns: Initial CrawlStatus
POST /api/cancel/{execution_id} - Cancel a running crawl
- Returns: Updated CrawlStatus

Status & Monitoring

GET /api/status/{execution_id} - Get crawl status
- Returns: Complete CrawlStatus with config, stats, and result
GET /api/crawls - List all crawls (sorted by most recent)
- Returns: Dictionary of execution IDs to CrawlStatus objects
GET /api/logs/{execution_id} - Get crawler logs
- Returns: Plain text log content
GET /api/info/{execution_id} - Get complete crawl information
- Returns: Raw JSON from info.json

Health

GET /api/health - Service health check
- Returns: {"status": "healthy", "service": "crawly", "version": "1.0.0"}

Data Models

CrawlStatus

The main object returned by the API, containing all crawl information:

{
  "status": "started|running|completed|failed|cancelled",
  "execution_id": "uuid",
  "started_at": "ISO timestamp",
  "completed_at": "ISO timestamp",
  "message": "Human-readable status message",
  "config": {...},  # CrawlConfig
  "stats": {...},   # CrawlStats (pages visited, documents indexed, duration)
  "result": {...}   # CrawlResult (return code, errors, domains crawled)
}

Development

Project Structure

app/
├── crawler-service/
│   ├── app/
│   │   ├── server.py         # FastAPI application
│   │   ├── crawler.py        # Crawler process management
│   │   ├── db.py             # Persists crawls to disk
│   │   ├── models.py         # Pydantic models
│   │   └── elasticsearch_client.py
│   └── Dockerfile
├── frontend/
│   ├── src/
│   │   ├── App.jsx
│   │   └── components/
│   │       ├── CrawlRuns.jsx
│   │       └── CrawlLogs.jsx
│   └── Dockerfile
└── docker-compose.yml

📄 License

MIT © ugosan

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
crawler-service		crawler-service
frontend		frontend
.gitignore		.gitignore
README.md		README.md
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
env		env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Architecture

Data Persistence

Key Components

Features

Stack

Quick Start

Development

Production

Configuration

Environment Variables

Crawl Configuration

API Endpoints

Crawl Management

Status & Monitoring

Health

Data Models

CrawlStatus

Development

Project Structure

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Architecture

Data Persistence

Key Components

Features

Stack

Quick Start

Development

Production

Configuration

Environment Variables

Crawl Configuration

API Endpoints

Crawl Management

Status & Monitoring

Health

Data Models

CrawlStatus

Development

Project Structure

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages