Purify

Web Scraping API for AI Agents. One binary. Zero dependencies. Built-in MCP server.

The Go alternative to Firecrawl / Crawl4AI — no Python, no Node.js, no Redis. Just docker compose up and go.

Get Free API Key · API Docs · MCP Server · Self-Host

See it work

# Start Purify
docker compose up -d

# Scrape a page
curl -s -X POST http://localhost:8080/api/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://news.ycombinator.com"}' | jq .

{
  "success": true,
  "content": "# Hacker News\n\n1. Show HN: ...",
  "tokens": {
    "original_estimate": 11708,
    "cleaned_estimate": 5572,
    "savings_percent": 52.4
  },
  "timing": {
    "total_ms": 400
  }
}

Two commands. Clean Markdown back in under a second.

How it compares

	Purify	Firecrawl	Crawl4AI	Jina Reader
Language	Go	TypeScript	Python	N/A (cloud)
Self-host	Single binary	5+ containers (Redis, PG, Playwright…)	pip + Playwright	No self-host docs
MCP server	Built-in (5 tools)	Community-maintained	No	No
Token savings	52–99%	~70–80%	~75–85%	~60–70%
Recursive crawling	Yes	Yes	Yes	No
Batch scrape	Yes	Yes	No	No
License	Apache 2.0	AGPL-3.0	Apache 2.0	Partial open source
Price (50k req/mo)	$29/mo	$49/mo	Free (local)	$49/mo

Quick start

Option A: Docker

docker compose up -d

Option B: Build from source

git clone https://github.com/Easonliuliang/purify.git
cd purify && make build
PURIFY_AUTH_ENABLED=false ./bin/purify

Option C: Hosted API (no setup)

curl -s -X POST https://purify.verifly.pro/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://news.ycombinator.com"}' | jq .content

Get a free API key at purify.verifly.pro — 1,000 requests/month, no credit card.

Token savings — real numbers

Measured with tiktoken (GPT-4 tokenizer). Purify strips navigation, ads, scripts, and styling — your LLM only sees the content.

Website	Raw HTML	After Purify	Savings	Latency
GitHub repo page	99,181	1,370	98.6%	1.1s
New York Times	103,744	2,130	98.0%	1.1s
Anthropic API Docs	129,066	4,837	96.3%	1.8s
Next.js blog (React SPA)	87,231	4,271	95.1%	5.0s
BBC News homepage	97,540	6,969	92.9%	2.5s
arXiv paper (DeepSeek-R1)	26,684	3,129	88.3%	0.5s
Wikipedia (LLM)	245,276	76,325	68.9%	1.5s
Hacker News	11,708	5,572	52.4%	0.4s
sspai.com	32,895	187	99.4%	1.2s
Xiaohongshu (RedNote)	158,742	353	99.8%	1.0s

Low-savings sites (Hacker News, paulgraham.com) are already minimal — almost pure text with no cruft to remove. That's a feature, not a bug.

Use cases

AI agents — Give your agent web access via MCP or REST API
RAG pipelines — Scrape docs, get clean Markdown, embed into your vector DB
Trading bots — Scrape prediction markets and news with sub-500ms latency
Research assistants — Read and summarize any web page

MCP server

Purify includes a built-in MCP server with 5 tools:

Tool	Description
`scrape_url`	Scrape a single page, return clean content
`batch_scrape`	Scrape multiple URLs in parallel
`crawl_site`	Recursively crawl a website (BFS)
`map_site`	Discover all URLs on a site
`extract_data`	Extract structured data with LLM (BYOK)

Setup

Add to your Claude Desktop config (claude_desktop_config.json):

{
  "mcpServers": {
    "purify": {
      "command": "purify-mcp",
      "env": {
        "PURIFY_API_URL": "https://purify.verifly.pro",
        "PURIFY_API_KEY": "your-api-key"
      }
    }
  }
}

For self-hosted instances, set PURIFY_API_URL to http://localhost:8080.

Then ask Claude:

"Scrape https://paulgraham.com/greatwork.html and summarize it."
"Crawl the Next.js docs site, max 20 pages."
"Extract the product name and price from this page: ..."

API

POST /api/v1/scrape

Scrape a single page and return cleaned content. Supports JSON response or SSE streaming.

{
  "url": "https://example.com/article",
  "output_format": "markdown",
  "extract_mode": "readability"
}

Parameter	Type	Default	Description
`url`	string	required	Target URL
`output_format`	string	`markdown`	`markdown`, `html`, `text`, or `markdown_citations`
`extract_mode`	string	`readability`	`readability`, `raw`, `pruning`, or `auto`
`timeout`	int	`30`	Timeout in seconds (1–120)
`stealth`	bool	`false`	Anti-detection mode
`headers`	object	—	Custom HTTP headers
`cookies`	array	—	Cookies to set before navigation
`actions`	array	—	Browser interactions (click, scroll, wait, etc.)
`include_tags`	array	—	CSS selectors to keep
`exclude_tags`	array	—	CSS selectors to remove
`css_selector`	string	—	Extract only matching elements
`max_age`	int	`0`	Cache max age in ms (0 = no cache)

Response:

{
  "success": true,
  "status_code": 200,
  "final_url": "https://example.com/article",
  "content": "# Article Title\n\nClean markdown content...",
  "metadata": {
    "title": "Article Title",
    "author": "Author Name",
    "language": "en",
    "source_url": "https://example.com/article",
    "fetch_method": "http"
  },
  "links": {
    "internal": [{"href": "/about", "text": "About"}],
    "external": [{"href": "https://github.com/...", "text": "GitHub"}]
  },
  "images": [{"src": "https://example.com/hero.jpg", "alt": "Hero"}],
  "tokens": {
    "original_estimate": 32895,
    "cleaned_estimate": 187,
    "savings_percent": 99.43
  },
  "timing": {
    "total_ms": 1172,
    "navigation_ms": 1162,
    "cleaning_ms": 9
  },
  "engine_used": "http"
}

SSE streaming

Add Accept: text/event-stream header to receive Server-Sent Events instead of JSON:

curl -X POST https://purify.verifly.pro/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{"url": "https://example.com"}'

Events: scrape.started → scrape.navigated → scrape.completed (or scrape.error).

Citation format

Use "output_format": "markdown_citations" to convert inline links to academic-style references:

See [Google][1] and [GitHub][2]

---
[1]: https://google.com
[2]: https://github.com

POST /api/v1/batch/scrape

Scrape multiple URLs in parallel. Returns a job ID for async polling.

{
  "urls": ["https://a.com", "https://b.com", "https://c.com"],
  "options": {"output_format": "markdown"},
  "webhook_url": "https://your-server.com/callback",
  "webhook_secret": "your-hmac-secret"
}

Poll status: GET /api/v1/batch/:id

POST /api/v1/crawl

Recursively crawl a website starting from a URL.

{
  "url": "https://docs.example.com",
  "max_depth": 3,
  "max_pages": 100,
  "scope": "subdomain",
  "webhook_url": "https://your-server.com/callback",
  "webhook_secret": "your-hmac-secret"
}

Poll status: GET /api/v1/crawl/:id

POST /api/v1/map

Discover all URLs on a site without scraping content.

{"url": "https://example.com"}

POST /api/v1/extract

Structured data extraction using your own LLM key (BYOK).

curl -X POST https://purify.verifly.pro/api/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product",
    "schema": {
      "name": "string",
      "price": "number",
      "features": ["string"]
    },
    "llm_api_key": "your-openai-key"
  }'

Webhook callbacks

Batch and Crawl endpoints support webhook notifications. When a job completes, Purify sends a POST request to your webhook_url with HMAC-SHA256 signature in the X-Purify-Signature header.

Events: batch.completed, crawl.page, crawl.completed, crawl.failed

Verify the signature:

HMAC-SHA256(webhook_secret, request_body) == X-Purify-Signature (sha256=<hex>)

GET /api/v1/health

Returns server status and uptime.

Configuration

All configuration via environment variables:

Variable	Default	Description
`PURIFY_HOST`	`0.0.0.0`	Listen address
`PURIFY_PORT`	`8080`	Listen port
`PURIFY_AUTH_ENABLED`	`true`	Enable API key authentication
`PURIFY_API_KEYS`	—	Comma-separated valid API keys
`PURIFY_MAX_PAGES`	`10`	Max concurrent browser tabs
`PURIFY_DEFAULT_TIMEOUT`	`30s`	Default scrape timeout
`PURIFY_RATE_RPS`	`5`	Rate limit (requests/sec/key)
`PURIFY_RATE_BURST`	`10`	Rate limit burst
`PURIFY_LOG_LEVEL`	`info`	`debug`, `info`, `warn`, `error`

Self-hosting

Purify is a single Go binary. No Docker required, no Redis, no database.

# Local development (no auth)
PURIFY_AUTH_ENABLED=false ./bin/purify

# Production (with API key)
PURIFY_API_KEYS=your-secret-key ./bin/purify

Runs on any $5/month VPS. No usage limits when self-hosted.

System requirements

Any Linux, macOS, or Windows machine
~15 MB disk space
~30 MB RAM idle

Pricing

	Free	Pro
Price	$0/mo	$29/mo
Requests	1,000/mo	50,000/mo
Concurrent	2	10
MCP server	✓	✓
Structured extraction	✓	✓

Get started free →

Contributing

Contributions welcome. Please open an issue first to discuss what you'd like to change.

License

Apache 2.0 — use it however you want, commercially or otherwise. No AGPL restrictions.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
api		api
cache		cache
cleaner		cleaner
cmd		cmd
config		config
engine		engine
llm		llm
models		models
proxy		proxy
scraper		scraper
scripts/benchmark		scripts/benchmark
simhash		simhash
webhook		webhook
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
deploy.sh		deploy.sh
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Purify

See it work

How it compares

Quick start

Token savings — real numbers

Use cases

MCP server

Setup

API

POST /api/v1/scrape

SSE streaming

Citation format

POST /api/v1/batch/scrape

POST /api/v1/crawl

POST /api/v1/map

POST /api/v1/extract

Webhook callbacks

GET /api/v1/health

Configuration

Self-hosting

System requirements

Pricing

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Purify

See it work

How it compares

Quick start

Token savings — real numbers

Use cases

MCP server

Setup

API

POST /api/v1/scrape

SSE streaming

Citation format

POST /api/v1/batch/scrape

POST /api/v1/crawl

POST /api/v1/map

POST /api/v1/extract

Webhook callbacks

GET /api/v1/health

Configuration

Self-hosting

System requirements

Pricing

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages