🤖✨ HTML-to-MD AI

Serve clean Markdown to AI crawlers. Reduce tokens by 60-80%.

💡 Why?

When AI assistants (Claude, ChatGPT, etc.) crawl your documentation sites, they consume HTML bloat:

Navigation, footers, ads = 60-80% wasted tokens
Harder to parse, slower responses
Expensive API costs

HTML-to-MD AI automatically detects AI crawlers and serves them clean, optimized Markdown instead.

✨ Features

Feature	Description
🚀 Edge Deployment	Runs on Cloudflare's global network (300+ locations)
🤖 Smart Detection	Recognizes Claude, ChatGPT, Google AI, Bing AI, Perplexity, and more
💾 Built-in Caching	Fast responses with configurable TTL (< 10ms cache hits)
🎯 Zero Code Changes	Deploy and forget - perfect for static sites
📝 Clean Output	GitHub-flavored Markdown with frontmatter
⚙️ Highly Configurable	Custom content selectors, AI patterns, caching rules

🚀 Quick Start

Prerequisites

Cloudflare account (free tier works)
Domain managed by Cloudflare
Node.js v18+

Installation

# Clone the repo
git clone https://github.com/thekevinm/html-to-md-ai.git
cd html-to-md-ai/packages/cloudflare-worker

# Install dependencies
npm install

# Configure
cp wrangler.example.toml wrangler.toml
# Edit wrangler.toml - set your domain route

# Login to Cloudflare
npx wrangler login

# Deploy
npm run deploy

That's it! 🎉 Your site now serves Markdown to AI crawlers automatically.

⚙️ Configuration

Edit wrangler.toml:

name = "html-to-md-ai-worker"
route = "docs.yoursite.com/*"

[vars]
ORIGIN_URL = ""          # Leave empty to use request URL
CACHE_TTL = "3600"       # 1 hour cache
ENABLE_CACHE = "true"    # Enable caching
DEBUG = "false"          # Debug logging

Multiple Sites?

Give each site a unique worker name:

# Site 1
name = "html-to-md-site1"
route = "docs.site1.com/*"

# Site 2 (separate wrangler.toml)
name = "html-to-md-site2"
route = "docs.site2.com/*"

Deploy each separately:

wrangler deploy -c wrangler.toml.site1
wrangler deploy -c wrangler.toml.site2

🔧 Advanced: Detecting AI Coding Assistants

AI coding assistants like Cursor and Windsurf use generic HTTP libraries (axios, got, node-fetch, undici) instead of specific AI user-agents. To detect these:

[vars]
# Detect generic HTTP client user-agents
DETECT_HTTP_CLIENTS = "true"

⚠️ WARNING: Generic HTTP clients are also used by regular applications!

When to enable:

✅ You control the domain and know your traffic patterns
✅ Your docs site has minimal programmatic API access
✅ You want to optimize for AI coding assistants like Cursor

When NOT to enable:

❌ Your site is accessed by scripts/bots using axios/fetch
❌ You have webhooks or API integrations
❌ You're unsure about your traffic patterns

Detected patterns when enabled:

axios/* - Axios HTTP client
node-fetch/* - Node.js fetch
got/* - Got HTTP client
undici/* - Undici HTTP client

Best practice: Start with DETECT_HTTP_CLIENTS = "false" (default), then enable after analyzing your traffic logs.

🔍 How It Works

graph LR
    A[AI Crawler] -->|User-Agent: ClaudeBot| B[Edge Worker]
    B -->|Check Cache| C{Cached?}
    C -->|Yes| D[Return Markdown]
    C -->|No| E[Fetch HTML]
    E --> F[Convert to MD]
    F --> G[Cache Result]
    G --> D

Detect - Check User-Agent for AI patterns
Fetch - Get HTML from origin server
Convert - Transform to clean Markdown
Cache - Store result at edge
Serve - Return optimized content

🤖 Supported AI Crawlers

AI Tool	User-Agent Patterns	Status
Anthropic Claude	`ClaudeBot`, `Claude-Web`, `claude-cli/*`	✅ Supported
OpenAI ChatGPT	`GPTBot`, `ChatGPT-User`, `ChatGPT`	✅ Supported
Google AI	`Google-Extended`, `Gemini`, `GoogleOther`	✅ Supported
Microsoft Bing AI	`BingPreview`, `Copilot`, `Bing.*AI`	✅ Supported
Perplexity	`PerplexityBot`, `Perplexity`	✅ Supported
Meta AI	`Meta-ExternalAgent`, `FacebookBot`	✅ Supported
Apple Intelligence	`Applebot-Extended`	✅ Supported
You.com	`YouBot`	✅ Supported
Cohere	`Cohere-AI`	✅ Supported

🧪 Testing

Test AI Detection

# Regular browser (should get HTML)
curl https://your-site.com/docs

# AI crawler (should get Markdown)
curl -H "User-Agent: ClaudeBot" https://your-site.com/docs

# Check response headers
curl -I -H "User-Agent: ClaudeBot" https://your-site.com/docs
# Expected: Content-Type: text/markdown; X-AI-Bot: Claude

Local Development

cd packages/cloudflare-worker
npm run dev

# Test in another terminal
curl -H "User-Agent: claude-cli/1.0" http://localhost:8787/

📊 Performance

Metric	Value
Cache Hit	< 10ms
Cache Miss	100-200ms
Token Reduction	60-80%
Edge Locations	300+ globally
Cost	Free tier (100K requests/day)

📁 Project Structure

html-to-md-ai/
├── packages/
│   ├── core/                    # Detection & conversion library
│   │   ├── src/
│   │   │   ├── detector.ts      # AI user-agent detection
│   │   │   ├── converter.ts     # HTML → Markdown
│   │   │   └── index.ts
│   │   └── package.json
│   │
│   └── cloudflare-worker/       # Main edge function ⭐
│       ├── src/
│       │   └── index.ts         # Worker implementation
│       ├── wrangler.toml        # Your config (gitignored)
│       ├── wrangler.example.toml
│       └── README.md
│
├── examples/
│   └── docusaurus/              # Docusaurus integration
│
├── README.md                    # This file
├── CHANGELOG.md                 # Version history
└── LICENSE                      # CC BY-NC 4.0

🛠️ Troubleshooting

Worker not intercepting requests

✅ Check route in wrangler.toml matches your domain
✅ Verify DNS points to Cloudflare (orange cloud ☁️)
✅ Confirm deployment: wrangler deployments list
✅ Check worker status in Cloudflare dashboard

Incomplete Markdown output

✅ Adjust contentSelectors in src/index.ts to match your HTML structure
✅ Check removeSelectors aren't too aggressive
✅ Enable DEBUG=true and check logs: wrangler tail

Cache issues

✅ Set ENABLE_CACHE=false for debugging
✅ Reduce CACHE_TTL for frequently updated content
✅ Purge cache in Cloudflare dashboard

📜 License

This project is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

You are free to:

✅ Share, copy, and redistribute
✅ Adapt, remix, and build upon this work

Under these terms:

📝 Attribution — Give appropriate credit
🚫 NonCommercial — No commercial use without permission

For commercial licensing, please contact [kevinmcgahey1114 @ gmail.com].

See LICENSE for full details.

🙏 Acknowledgments

Turndown - HTML to Markdown conversion
Cloudflare Workers - Edge computing platform
Inspired by the need to optimize AI assistant interactions

Made for AI assistants, by developers who use AI assistants 🤖✨

Report Bug • Request Feature

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github		.github
examples		examples
packages		packages
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
package.json		package.json
test-deployment.sh		test-deployment.sh
test-headers.js		test-headers.js
test-manual.js		test-manual.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤖✨ HTML-to-MD AI

💡 Why?

✨ Features

🚀 Quick Start

Prerequisites

Installation

⚙️ Configuration

Multiple Sites?

🔧 Advanced: Detecting AI Coding Assistants

🔍 How It Works

🤖 Supported AI Crawlers

🧪 Testing

Test AI Detection

Local Development

📊 Performance

📁 Project Structure

🛠️ Troubleshooting

📜 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

thekevinm/HTML-to-MD-AI

Folders and files

Latest commit

History

Repository files navigation

🤖✨ HTML-to-MD AI

💡 Why?

✨ Features

🚀 Quick Start

Prerequisites

Installation

⚙️ Configuration

Multiple Sites?

🔧 Advanced: Detecting AI Coding Assistants

🔍 How It Works

🤖 Supported AI Crawlers

🧪 Testing

Test AI Detection

Local Development

📊 Performance

📁 Project Structure

🛠️ Troubleshooting

📜 License

🙏 Acknowledgments

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages