High-throughput atomic scraping and stateful interactive browser sessions with LLM orchestration.
- Stateless Scraper: Fast atomic scraping and Google search transformation (Serper-compatible).
- Stateful Sessions: Interactive browser sessions via DSL over WebSockets with Taskiq Actors.
- AI Integration: Omni-Parser for UI grounding (SoM approach) and Jina Reader for structured markdown extraction.
- Resource Management: Automatic 10-minute inactivity timeout for stateful sessions.
- Modular Design: Clean architecture with layers for API, Domain, Infrastructure, and Actions.
- Language: Python 3.11+
- Framework: FastAPI
- Async Logic: Playwright, Taskiq (Redis Broker)
- AI Tools: Flexible OpenAI-compatible configuration (LM Studio, OpenAI, etc.), Jina Reader V2, Omni-Parser
- Infrastructure: Redis (Pub/Sub and Task Queue), Docker
Detailed directory layout and layer responsibilities are documented in STRUCTURE.md.
- Python 3.11+
- Docker & Docker Compose
- AI Providers: LM Studio (local), OpenAI (cloud), or any OpenAI-compatible API.
- Clone and Setup:
uv init . uv sync - Environment Variables:
Create a
.envfile from.env.example. This service uses separate configurations for Extraction (e.g., Jina Reader LM) and Orchestration (e.g., Reasoning/Navigation):# Extraction Settings (e.g., Local LM Studio) EXTRACTION_API_BASE=http://localhost:1234/v1 EXTRACTION_API_KEY=lm-studio EXTRACTION_MODEL_NAME=jina-reader-lm # Orchestration Settings (e.g., OpenAI) ORCHESTRATION_API_BASE=https://api.openai.com/v1 ORCHESTRATION_API_KEY=sk-... ORCHESTRATION_MODEL_NAME=gpt-4o
- Start Infrastructure:
docker-compose up -d
To run the service with PM2, including the background workers:
pm2 start ecosystem.config.jsYou can use the provided test_ws.py to verify the WebSocket session and DSL commands:
- Ensure the server is running.
- Install test dependencies:
pip install websockets httpx. - Run the test:
python test_ws.py.
Run the full test suite (including contract and integration tests):
uv run python -m pytest testsThe project includes an MCP (Model Context Protocol) server that exposes all web interactions as tools for LLMs.
Session tools (session_goto, session_screenshot, etc.) communicate with the backend via POST /sessions/{session_id}/command — a regular HTTP endpoint — instead of WebSocket. This is intentional: MCP runs over stdio and cannot maintain a persistent WS connection. The WebSocket endpoint (/ws/{session_id}) remains available for other clients.
To run the MCP server manually from any directory:
cd C:/[repo_path]/Atomic-Scraper-Service && uv run python -m src.mcp_serverAdd this to your claude_desktop_config.json (or equivalent MCP config):
{
"mcpServers": {
"atomic-scraper": {
"command": "uv",
"args": [
"run",
"--project",
"C:/[repo_path]/Atomic-Scraper-Service",
"python",
"C:/[repo_path]/Atomic-Scraper-Service/src/mcp_server.py"
],
"env": {
"API_KEY": "your_internal_key"
}
}
}
}- Stateless:
scrape,search,omni_parse,jina_extract - Session Management:
create_session,delete_session - Interactive (DSL):
session_goto,session_scroll,session_click,session_type,session_screenshot,session_click_omni,session_extract_jina
The JSON DSL controls browser sessions. Commands can be sent two ways:
- Start a Session:
POST /sessions(requiresX-API-Key) -> returnssession_id. - Send Commands:
POST /sessions/{session_id}/commandwith body{"type": "<command>", "params": { ... }}. - Result is returned directly in the HTTP response (blocks up to 60 s).
- Start a Session:
POST /sessions-> returnssession_id. - Connect:
ws://localhost:8000/ws/{session_id}. - Send Commands: Send JSON frames
{"type": "<command>", "params": { ... }}. Results arrive as push frames on the same connection.
goto:{"url": "https://..."}- Navigate to a URL.scroll:{"direction": "down", "amount": 500}- Scroll the page.click_coord:{"x": 0.5, "y": 0.2}- Click relative coordinates.type:{"selector": "input", "text": "hello"}- Type into a selector.screenshot:{}- Capture base64 screenshot.
click_omni:{"element_description": "the login button"}- AI-based grounding.extract_jina:{"extraction_schema": {...}}- Structured extraction.