Serve clean Markdown to AI crawlers. Reduce tokens by 60-80%.
Quick Start β’ How It Works β’ Deploy
When AI assistants (Claude, ChatGPT, etc.) crawl your documentation sites, they consume HTML bloat:
- Navigation, footers, ads = 60-80% wasted tokens
- Harder to parse, slower responses
- Expensive API costs
HTML-to-MD AI automatically detects AI crawlers and serves them clean, optimized Markdown instead.
| Feature | Description |
|---|---|
| π Edge Deployment | Runs on Cloudflare's global network (300+ locations) |
| π€ Smart Detection | Recognizes Claude, ChatGPT, Google AI, Bing AI, Perplexity, and more |
| πΎ Built-in Caching | Fast responses with configurable TTL (< 10ms cache hits) |
| π― Zero Code Changes | Deploy and forget - perfect for static sites |
| π Clean Output | GitHub-flavored Markdown with frontmatter |
| βοΈ Highly Configurable | Custom content selectors, AI patterns, caching rules |
- Cloudflare account (free tier works)
- Domain managed by Cloudflare
- Node.js v18+
# Clone the repo
git clone https://github.com/thekevinm/html-to-md-ai.git
cd html-to-md-ai/packages/cloudflare-worker
# Install dependencies
npm install
# Configure
cp wrangler.example.toml wrangler.toml
# Edit wrangler.toml - set your domain route
# Login to Cloudflare
npx wrangler login
# Deploy
npm run deployThat's it! π Your site now serves Markdown to AI crawlers automatically.
Edit wrangler.toml:
name = "html-to-md-ai-worker"
route = "docs.yoursite.com/*"
[vars]
ORIGIN_URL = "" # Leave empty to use request URL
CACHE_TTL = "3600" # 1 hour cache
ENABLE_CACHE = "true" # Enable caching
DEBUG = "false" # Debug loggingGive each site a unique worker name:
# Site 1
name = "html-to-md-site1"
route = "docs.site1.com/*"
# Site 2 (separate wrangler.toml)
name = "html-to-md-site2"
route = "docs.site2.com/*"Deploy each separately:
wrangler deploy -c wrangler.toml.site1
wrangler deploy -c wrangler.toml.site2AI coding assistants like Cursor and Windsurf use generic HTTP libraries (axios, got, node-fetch, undici) instead of specific AI user-agents. To detect these:
[vars]
# Detect generic HTTP client user-agents
DETECT_HTTP_CLIENTS = "true"When to enable:
- β You control the domain and know your traffic patterns
- β Your docs site has minimal programmatic API access
- β You want to optimize for AI coding assistants like Cursor
When NOT to enable:
- β Your site is accessed by scripts/bots using axios/fetch
- β You have webhooks or API integrations
- β You're unsure about your traffic patterns
Detected patterns when enabled:
axios/*- Axios HTTP clientnode-fetch/*- Node.js fetchgot/*- Got HTTP clientundici/*- Undici HTTP client
Best practice: Start with DETECT_HTTP_CLIENTS = "false" (default), then enable after analyzing your traffic logs.
graph LR
A[AI Crawler] -->|User-Agent: ClaudeBot| B[Edge Worker]
B -->|Check Cache| C{Cached?}
C -->|Yes| D[Return Markdown]
C -->|No| E[Fetch HTML]
E --> F[Convert to MD]
F --> G[Cache Result]
G --> D
- Detect - Check User-Agent for AI patterns
- Fetch - Get HTML from origin server
- Convert - Transform to clean Markdown
- Cache - Store result at edge
- Serve - Return optimized content
| AI Tool | User-Agent Patterns | Status |
|---|---|---|
| Anthropic Claude | ClaudeBot, Claude-Web, claude-cli/* |
β Supported |
| OpenAI ChatGPT | GPTBot, ChatGPT-User, ChatGPT |
β Supported |
| Google AI | Google-Extended, Gemini, GoogleOther |
β Supported |
| Microsoft Bing AI | BingPreview, Copilot, Bing.*AI |
β Supported |
| Perplexity | PerplexityBot, Perplexity |
β Supported |
| Meta AI | Meta-ExternalAgent, FacebookBot |
β Supported |
| Apple Intelligence | Applebot-Extended |
β Supported |
| You.com | YouBot |
β Supported |
| Cohere | Cohere-AI |
β Supported |
# Regular browser (should get HTML)
curl https://your-site.com/docs
# AI crawler (should get Markdown)
curl -H "User-Agent: ClaudeBot" https://your-site.com/docs
# Check response headers
curl -I -H "User-Agent: ClaudeBot" https://your-site.com/docs
# Expected: Content-Type: text/markdown; X-AI-Bot: Claudecd packages/cloudflare-worker
npm run dev
# Test in another terminal
curl -H "User-Agent: claude-cli/1.0" http://localhost:8787/| Metric | Value |
|---|---|
| Cache Hit | < 10ms |
| Cache Miss | 100-200ms |
| Token Reduction | 60-80% |
| Edge Locations | 300+ globally |
| Cost | Free tier (100K requests/day) |
html-to-md-ai/
βββ packages/
β βββ core/ # Detection & conversion library
β β βββ src/
β β β βββ detector.ts # AI user-agent detection
β β β βββ converter.ts # HTML β Markdown
β β β βββ index.ts
β β βββ package.json
β β
β βββ cloudflare-worker/ # Main edge function β
β βββ src/
β β βββ index.ts # Worker implementation
β βββ wrangler.toml # Your config (gitignored)
β βββ wrangler.example.toml
β βββ README.md
β
βββ examples/
β βββ docusaurus/ # Docusaurus integration
β
βββ README.md # This file
βββ CHANGELOG.md # Version history
βββ LICENSE # CC BY-NC 4.0
Worker not intercepting requests
- β
Check
routeinwrangler.tomlmatches your domain - β Verify DNS points to Cloudflare (orange cloud βοΈ)
- β
Confirm deployment:
wrangler deployments list - β Check worker status in Cloudflare dashboard
Incomplete Markdown output
- β
Adjust
contentSelectorsinsrc/index.tsto match your HTML structure - β
Check
removeSelectorsaren't too aggressive - β
Enable
DEBUG=trueand check logs:wrangler tail
Cache issues
- β
Set
ENABLE_CACHE=falsefor debugging - β
Reduce
CACHE_TTLfor frequently updated content - β Purge cache in Cloudflare dashboard
This project is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
You are free to:
- β Share, copy, and redistribute
- β Adapt, remix, and build upon this work
Under these terms:
- π Attribution β Give appropriate credit
- π« NonCommercial β No commercial use without permission
For commercial licensing, please contact [kevinmcgahey1114 @ gmail.com].
See LICENSE for full details.
- Turndown - HTML to Markdown conversion
- Cloudflare Workers - Edge computing platform
- Inspired by the need to optimize AI assistant interactions
Made for AI assistants, by developers who use AI assistants π€β¨