Web Scraping API for AI Agents. One binary. Zero dependencies. Built-in MCP server.
The Go alternative to Firecrawl / Crawl4AI — no Python, no Node.js, no Redis. Just docker compose up and go.
# Start Purify
docker compose up -d
# Scrape a page
curl -s -X POST http://localhost:8080/api/v1/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://news.ycombinator.com"}' | jq .{
"success": true,
"content": "# Hacker News\n\n1. Show HN: ...",
"tokens": {
"original_estimate": 11708,
"cleaned_estimate": 5572,
"savings_percent": 52.4
},
"timing": {
"total_ms": 400
}
}Two commands. Clean Markdown back in under a second.
| Purify | Firecrawl | Crawl4AI | Jina Reader | |
|---|---|---|---|---|
| Language | Go | TypeScript | Python | N/A (cloud) |
| Self-host | Single binary | 5+ containers (Redis, PG, Playwright…) | pip + Playwright | No self-host docs |
| MCP server | Built-in (5 tools) | Community-maintained | No | No |
| Token savings | 52–99% | ~70–80% | ~75–85% | ~60–70% |
| Recursive crawling | Yes | Yes | Yes | No |
| Batch scrape | Yes | Yes | No | No |
| License | Apache 2.0 | AGPL-3.0 | Apache 2.0 | Partial open source |
| Price (50k req/mo) | $29/mo | $49/mo | Free (local) | $49/mo |
Option A: Docker
docker compose up -dOption B: Build from source
git clone https://github.com/Easonliuliang/purify.git
cd purify && make build
PURIFY_AUTH_ENABLED=false ./bin/purifyOption C: Hosted API (no setup)
curl -s -X POST https://purify.verifly.pro/api/v1/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://news.ycombinator.com"}' | jq .contentGet a free API key at purify.verifly.pro — 1,000 requests/month, no credit card.
Measured with tiktoken (GPT-4 tokenizer). Purify strips navigation, ads, scripts, and styling — your LLM only sees the content.
| Website | Raw HTML | After Purify | Savings | Latency |
|---|---|---|---|---|
| GitHub repo page | 99,181 | 1,370 | 98.6% | 1.1s |
| New York Times | 103,744 | 2,130 | 98.0% | 1.1s |
| Anthropic API Docs | 129,066 | 4,837 | 96.3% | 1.8s |
| Next.js blog (React SPA) | 87,231 | 4,271 | 95.1% | 5.0s |
| BBC News homepage | 97,540 | 6,969 | 92.9% | 2.5s |
| arXiv paper (DeepSeek-R1) | 26,684 | 3,129 | 88.3% | 0.5s |
| Wikipedia (LLM) | 245,276 | 76,325 | 68.9% | 1.5s |
| Hacker News | 11,708 | 5,572 | 52.4% | 0.4s |
| sspai.com | 32,895 | 187 | 99.4% | 1.2s |
| Xiaohongshu (RedNote) | 158,742 | 353 | 99.8% | 1.0s |
Low-savings sites (Hacker News, paulgraham.com) are already minimal — almost pure text with no cruft to remove. That's a feature, not a bug.
- AI agents — Give your agent web access via MCP or REST API
- RAG pipelines — Scrape docs, get clean Markdown, embed into your vector DB
- Trading bots — Scrape prediction markets and news with sub-500ms latency
- Research assistants — Read and summarize any web page
Purify includes a built-in MCP server with 5 tools:
| Tool | Description |
|---|---|
scrape_url |
Scrape a single page, return clean content |
batch_scrape |
Scrape multiple URLs in parallel |
crawl_site |
Recursively crawl a website (BFS) |
map_site |
Discover all URLs on a site |
extract_data |
Extract structured data with LLM (BYOK) |
Add to your Claude Desktop config (claude_desktop_config.json):
{
"mcpServers": {
"purify": {
"command": "purify-mcp",
"env": {
"PURIFY_API_URL": "https://purify.verifly.pro",
"PURIFY_API_KEY": "your-api-key"
}
}
}
}For self-hosted instances, set PURIFY_API_URL to http://localhost:8080.
Then ask Claude:
- "Scrape https://paulgraham.com/greatwork.html and summarize it."
- "Crawl the Next.js docs site, max 20 pages."
- "Extract the product name and price from this page: ..."
Scrape a single page and return cleaned content. Supports JSON response or SSE streaming.
{
"url": "https://example.com/article",
"output_format": "markdown",
"extract_mode": "readability"
}| Parameter | Type | Default | Description |
|---|---|---|---|
url |
string | required | Target URL |
output_format |
string | markdown |
markdown, html, text, or markdown_citations |
extract_mode |
string | readability |
readability, raw, pruning, or auto |
timeout |
int | 30 |
Timeout in seconds (1–120) |
stealth |
bool | false |
Anti-detection mode |
headers |
object | — | Custom HTTP headers |
cookies |
array | — | Cookies to set before navigation |
actions |
array | — | Browser interactions (click, scroll, wait, etc.) |
include_tags |
array | — | CSS selectors to keep |
exclude_tags |
array | — | CSS selectors to remove |
css_selector |
string | — | Extract only matching elements |
max_age |
int | 0 |
Cache max age in ms (0 = no cache) |
Response:
{
"success": true,
"status_code": 200,
"final_url": "https://example.com/article",
"content": "# Article Title\n\nClean markdown content...",
"metadata": {
"title": "Article Title",
"author": "Author Name",
"language": "en",
"source_url": "https://example.com/article",
"fetch_method": "http"
},
"links": {
"internal": [{"href": "/about", "text": "About"}],
"external": [{"href": "https://github.com/...", "text": "GitHub"}]
},
"images": [{"src": "https://example.com/hero.jpg", "alt": "Hero"}],
"tokens": {
"original_estimate": 32895,
"cleaned_estimate": 187,
"savings_percent": 99.43
},
"timing": {
"total_ms": 1172,
"navigation_ms": 1162,
"cleaning_ms": 9
},
"engine_used": "http"
}Add Accept: text/event-stream header to receive Server-Sent Events instead of JSON:
curl -X POST https://purify.verifly.pro/api/v1/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{"url": "https://example.com"}'Events: scrape.started → scrape.navigated → scrape.completed (or scrape.error).
Use "output_format": "markdown_citations" to convert inline links to academic-style references:
See [Google][1] and [GitHub][2]
---
[1]: https://google.com
[2]: https://github.comScrape multiple URLs in parallel. Returns a job ID for async polling.
{
"urls": ["https://a.com", "https://b.com", "https://c.com"],
"options": {"output_format": "markdown"},
"webhook_url": "https://your-server.com/callback",
"webhook_secret": "your-hmac-secret"
}Poll status: GET /api/v1/batch/:id
Recursively crawl a website starting from a URL.
{
"url": "https://docs.example.com",
"max_depth": 3,
"max_pages": 100,
"scope": "subdomain",
"webhook_url": "https://your-server.com/callback",
"webhook_secret": "your-hmac-secret"
}Poll status: GET /api/v1/crawl/:id
Discover all URLs on a site without scraping content.
{"url": "https://example.com"}Structured data extraction using your own LLM key (BYOK).
curl -X POST https://purify.verifly.pro/api/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/product",
"schema": {
"name": "string",
"price": "number",
"features": ["string"]
},
"llm_api_key": "your-openai-key"
}'Batch and Crawl endpoints support webhook notifications. When a job completes, Purify sends a POST request to your webhook_url with HMAC-SHA256 signature in the X-Purify-Signature header.
Events: batch.completed, crawl.page, crawl.completed, crawl.failed
Verify the signature:
HMAC-SHA256(webhook_secret, request_body) == X-Purify-Signature (sha256=<hex>)
Returns server status and uptime.
All configuration via environment variables:
| Variable | Default | Description |
|---|---|---|
PURIFY_HOST |
0.0.0.0 |
Listen address |
PURIFY_PORT |
8080 |
Listen port |
PURIFY_AUTH_ENABLED |
true |
Enable API key authentication |
PURIFY_API_KEYS |
— | Comma-separated valid API keys |
PURIFY_MAX_PAGES |
10 |
Max concurrent browser tabs |
PURIFY_DEFAULT_TIMEOUT |
30s |
Default scrape timeout |
PURIFY_RATE_RPS |
5 |
Rate limit (requests/sec/key) |
PURIFY_RATE_BURST |
10 |
Rate limit burst |
PURIFY_LOG_LEVEL |
info |
debug, info, warn, error |
Purify is a single Go binary. No Docker required, no Redis, no database.
# Local development (no auth)
PURIFY_AUTH_ENABLED=false ./bin/purify
# Production (with API key)
PURIFY_API_KEYS=your-secret-key ./bin/purifyRuns on any $5/month VPS. No usage limits when self-hosted.
- Any Linux, macOS, or Windows machine
- ~15 MB disk space
- ~30 MB RAM idle
| Free | Pro | |
|---|---|---|
| Price | $0/mo | $29/mo |
| Requests | 1,000/mo | 50,000/mo |
| Concurrent | 2 | 10 |
| MCP server | ✓ | ✓ |
| Structured extraction | ✓ | ✓ |
Contributions welcome. Please open an issue first to discuss what you'd like to change.
Apache 2.0 — use it however you want, commercially or otherwise. No AGPL restrictions.