Atomic Scraper Service

High-throughput atomic scraping and stateful interactive browser sessions with LLM orchestration.

Features

Stateless Scraper: Fast atomic scraping and Google search transformation (Serper-compatible).
Stateful Sessions: Interactive browser sessions via DSL over WebSockets with Taskiq Actors.
AI Integration: Omni-Parser for UI grounding (SoM approach) and Jina Reader for structured markdown extraction.
Resource Management: Automatic 10-minute inactivity timeout for stateful sessions.
Modular Design: Clean architecture with layers for API, Domain, Infrastructure, and Actions.

Tech Stack

Language: Python 3.11+
Framework: FastAPI
Async Logic: Playwright, Taskiq (Redis Broker)
AI Tools: Flexible OpenAI-compatible configuration (LM Studio, OpenAI, etc.), Jina Reader V2, Omni-Parser
Infrastructure: Redis (Pub/Sub and Task Queue), Docker

Project Structure

Detailed directory layout and layer responsibilities are documented in STRUCTURE.md.

Quickstart

Prerequisites

Python 3.11+
Docker & Docker Compose
AI Providers: LM Studio (local), OpenAI (cloud), or any OpenAI-compatible API.

Installation

Clone and Setup:
```
uv init .
uv sync
```

Environment Variables: Create a .env file from .env.example. This service uses separate configurations for Extraction (e.g., Jina Reader LM) and Orchestration (e.g., Reasoning/Navigation):

# Extraction Settings (e.g., Local LM Studio)
EXTRACTION_API_BASE=http://localhost:1234/v1
EXTRACTION_API_KEY=lm-studio
EXTRACTION_MODEL_NAME=jina-reader-lm

# Orchestration Settings (e.g., OpenAI)
ORCHESTRATION_API_BASE=https://api.openai.com/v1
ORCHESTRATION_API_KEY=sk-...
ORCHESTRATION_MODEL_NAME=gpt-4o

Start Infrastructure:
```
docker-compose up -d
```

Run API (PM2)

To run the service with PM2, including the background workers:

pm2 start ecosystem.config.js

Manual Testing

You can use the provided test_ws.py to verify the WebSocket session and DSL commands:

Ensure the server is running.
Install test dependencies: pip install websockets httpx.
Run the test: python test_ws.py.

Testing

Run the full test suite (including contract and integration tests):

uv run python -m pytest tests

MCP Server

The project includes an MCP (Model Context Protocol) server that exposes all web interactions as tools for LLMs.

Session tools (session_goto, session_screenshot, etc.) communicate with the backend via POST /sessions/{session_id}/command — a regular HTTP endpoint — instead of WebSocket. This is intentional: MCP runs over stdio and cannot maintain a persistent WS connection. The WebSocket endpoint (/ws/{session_id}) remains available for other clients.

Running the MCP Server

To run the MCP server manually from any directory:

cd C:/[repo_path]/Atomic-Scraper-Service && uv run python -m src.mcp_server

Claude Desktop / OpenCode Configuration

Add this to your claude_desktop_config.json (or equivalent MCP config):

{
  "mcpServers": {
    "atomic-scraper": {
      "command": "uv",
      "args": [
        "run",
        "--project",
        "C:/[repo_path]/Atomic-Scraper-Service",
        "python",
        "C:/[repo_path]/Atomic-Scraper-Service/src/mcp_server.py"
      ],
      "env": {
        "API_KEY": "your_internal_key"
      }
    }
  }
}

Available Tools

Stateless: scrape, search, omni_parse, jina_extract
Session Management: create_session, delete_session
Interactive (DSL): session_goto, session_scroll, session_click, session_type, session_screenshot, session_click_omni, session_extract_jina

JSON DSL Guide

The JSON DSL controls browser sessions. Commands can be sent two ways:

Option A: HTTP (recommended for MCP / programmatic clients)

Start a Session: POST /sessions (requires X-API-Key) -> returns session_id.
Send Commands: POST /sessions/{session_id}/command with body {"type": "<command>", "params": { ... }}.
Result is returned directly in the HTTP response (blocks up to 60 s).

Option B: WebSocket (recommended for interactive / streaming clients)

Start a Session: POST /sessions -> returns session_id.
Connect: ws://localhost:8000/ws/{session_id}.
Send Commands: Send JSON frames {"type": "<command>", "params": { ... }}. Results arrive as push frames on the same connection.

Core Commands

goto: {"url": "https://..."} - Navigate to a URL.
scroll: {"direction": "down", "amount": 500} - Scroll the page.
click_coord: {"x": 0.5, "y": 0.2} - Click relative coordinates.
type: {"selector": "input", "text": "hello"} - Type into a selector.
screenshot: {} - Capture base64 screenshot.

AI-Enhanced

click_omni: {"element_description": "the login button"} - AI-based grounding.
extract_jina: {"extraction_schema": {...}} - Structured extraction.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.opencode/command		.opencode/command
.specify		.specify
specs/009-smart-scraping-llm-api		specs/009-smart-scraping-llm-api
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
STRUCTURE.md		STRUCTURE.md
docker-compose.yml		docker-compose.yml
ecosystem.config.js		ecosystem.config.js
kill_mcp.py		kill_mcp.py
main.py		main.py
playwright_orchestrator_plan_2.md		playwright_orchestrator_plan_2.md
pyproject.toml		pyproject.toml
restart_mcp.py		restart_mcp.py
test_all_interactions.py		test_all_interactions.py
test_keys.py		test_keys.py
test_ws.py		test_ws.py
uv.lock		uv.lock
web_interactions.md		web_interactions.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Atomic Scraper Service

Features

Tech Stack

Project Structure

Quickstart

Prerequisites

Installation

Run API (PM2)

Manual Testing

Testing

MCP Server

Running the MCP Server

Claude Desktop / OpenCode Configuration

Available Tools

JSON DSL Guide

Option A: HTTP (recommended for MCP / programmatic clients)

Option B: WebSocket (recommended for interactive / streaming clients)

Core Commands

AI-Enhanced

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Atomic Scraper Service

Features

Tech Stack

Project Structure

Quickstart

Prerequisites

Installation

Run API (PM2)

Manual Testing

Testing

MCP Server

Running the MCP Server

Claude Desktop / OpenCode Configuration

Available Tools

JSON DSL Guide

Option A: HTTP (recommended for MCP / programmatic clients)

Option B: WebSocket (recommended for interactive / streaming clients)

Core Commands

AI-Enhanced

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages