Skip to content

Lightweight tool that converts traditional HTML pages to Markdown when being invoked by AI Agents to reduce token usage and improve readability.

License

Notifications You must be signed in to change notification settings

thekevinm/HTML-to-MD-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€–βœ¨ HTML-to-MD AI

Serve clean Markdown to AI crawlers. Reduce tokens by 60-80%.

Quick Start β€’ How It Works β€’ Deploy

License: CC BY-NC 4.0


πŸ’‘ Why?

When AI assistants (Claude, ChatGPT, etc.) crawl your documentation sites, they consume HTML bloat:

  • Navigation, footers, ads = 60-80% wasted tokens
  • Harder to parse, slower responses
  • Expensive API costs

HTML-to-MD AI automatically detects AI crawlers and serves them clean, optimized Markdown instead.


✨ Features

Feature Description
πŸš€ Edge Deployment Runs on Cloudflare's global network (300+ locations)
πŸ€– Smart Detection Recognizes Claude, ChatGPT, Google AI, Bing AI, Perplexity, and more
πŸ’Ύ Built-in Caching Fast responses with configurable TTL (< 10ms cache hits)
🎯 Zero Code Changes Deploy and forget - perfect for static sites
πŸ“ Clean Output GitHub-flavored Markdown with frontmatter
βš™οΈ Highly Configurable Custom content selectors, AI patterns, caching rules

πŸš€ Quick Start

Prerequisites

  • Cloudflare account (free tier works)
  • Domain managed by Cloudflare
  • Node.js v18+

Installation

# Clone the repo
git clone https://github.com/thekevinm/html-to-md-ai.git
cd html-to-md-ai/packages/cloudflare-worker

# Install dependencies
npm install

# Configure
cp wrangler.example.toml wrangler.toml
# Edit wrangler.toml - set your domain route

# Login to Cloudflare
npx wrangler login

# Deploy
npm run deploy

That's it! πŸŽ‰ Your site now serves Markdown to AI crawlers automatically.


βš™οΈ Configuration

Edit wrangler.toml:

name = "html-to-md-ai-worker"
route = "docs.yoursite.com/*"

[vars]
ORIGIN_URL = ""          # Leave empty to use request URL
CACHE_TTL = "3600"       # 1 hour cache
ENABLE_CACHE = "true"    # Enable caching
DEBUG = "false"          # Debug logging

Multiple Sites?

Give each site a unique worker name:

# Site 1
name = "html-to-md-site1"
route = "docs.site1.com/*"

# Site 2 (separate wrangler.toml)
name = "html-to-md-site2"
route = "docs.site2.com/*"

Deploy each separately:

wrangler deploy -c wrangler.toml.site1
wrangler deploy -c wrangler.toml.site2

πŸ”§ Advanced: Detecting AI Coding Assistants

AI coding assistants like Cursor and Windsurf use generic HTTP libraries (axios, got, node-fetch, undici) instead of specific AI user-agents. To detect these:

[vars]
# Detect generic HTTP client user-agents
DETECT_HTTP_CLIENTS = "true"

⚠️ WARNING: Generic HTTP clients are also used by regular applications!

When to enable:

  • βœ… You control the domain and know your traffic patterns
  • βœ… Your docs site has minimal programmatic API access
  • βœ… You want to optimize for AI coding assistants like Cursor

When NOT to enable:

  • ❌ Your site is accessed by scripts/bots using axios/fetch
  • ❌ You have webhooks or API integrations
  • ❌ You're unsure about your traffic patterns

Detected patterns when enabled:

  • axios/* - Axios HTTP client
  • node-fetch/* - Node.js fetch
  • got/* - Got HTTP client
  • undici/* - Undici HTTP client

Best practice: Start with DETECT_HTTP_CLIENTS = "false" (default), then enable after analyzing your traffic logs.


πŸ” How It Works

graph LR
    A[AI Crawler] -->|User-Agent: ClaudeBot| B[Edge Worker]
    B -->|Check Cache| C{Cached?}
    C -->|Yes| D[Return Markdown]
    C -->|No| E[Fetch HTML]
    E --> F[Convert to MD]
    F --> G[Cache Result]
    G --> D
Loading
  1. Detect - Check User-Agent for AI patterns
  2. Fetch - Get HTML from origin server
  3. Convert - Transform to clean Markdown
  4. Cache - Store result at edge
  5. Serve - Return optimized content

πŸ€– Supported AI Crawlers

AI Tool User-Agent Patterns Status
Anthropic Claude ClaudeBot, Claude-Web, claude-cli/* βœ… Supported
OpenAI ChatGPT GPTBot, ChatGPT-User, ChatGPT βœ… Supported
Google AI Google-Extended, Gemini, GoogleOther βœ… Supported
Microsoft Bing AI BingPreview, Copilot, Bing.*AI βœ… Supported
Perplexity PerplexityBot, Perplexity βœ… Supported
Meta AI Meta-ExternalAgent, FacebookBot βœ… Supported
Apple Intelligence Applebot-Extended βœ… Supported
You.com YouBot βœ… Supported
Cohere Cohere-AI βœ… Supported

πŸ§ͺ Testing

Test AI Detection

# Regular browser (should get HTML)
curl https://your-site.com/docs

# AI crawler (should get Markdown)
curl -H "User-Agent: ClaudeBot" https://your-site.com/docs

# Check response headers
curl -I -H "User-Agent: ClaudeBot" https://your-site.com/docs
# Expected: Content-Type: text/markdown; X-AI-Bot: Claude

Local Development

cd packages/cloudflare-worker
npm run dev

# Test in another terminal
curl -H "User-Agent: claude-cli/1.0" http://localhost:8787/

πŸ“Š Performance

Metric Value
Cache Hit < 10ms
Cache Miss 100-200ms
Token Reduction 60-80%
Edge Locations 300+ globally
Cost Free tier (100K requests/day)

πŸ“ Project Structure

html-to-md-ai/
β”œβ”€β”€ packages/
β”‚   β”œβ”€β”€ core/                    # Detection & conversion library
β”‚   β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”‚   β”œβ”€β”€ detector.ts      # AI user-agent detection
β”‚   β”‚   β”‚   β”œβ”€β”€ converter.ts     # HTML β†’ Markdown
β”‚   β”‚   β”‚   └── index.ts
β”‚   β”‚   └── package.json
β”‚   β”‚
β”‚   └── cloudflare-worker/       # Main edge function ⭐
β”‚       β”œβ”€β”€ src/
β”‚       β”‚   └── index.ts         # Worker implementation
β”‚       β”œβ”€β”€ wrangler.toml        # Your config (gitignored)
β”‚       β”œβ”€β”€ wrangler.example.toml
β”‚       └── README.md
β”‚
β”œβ”€β”€ examples/
β”‚   └── docusaurus/              # Docusaurus integration
β”‚
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ CHANGELOG.md                 # Version history
└── LICENSE                      # CC BY-NC 4.0

πŸ› οΈ Troubleshooting

Worker not intercepting requests
  • βœ… Check route in wrangler.toml matches your domain
  • βœ… Verify DNS points to Cloudflare (orange cloud ☁️)
  • βœ… Confirm deployment: wrangler deployments list
  • βœ… Check worker status in Cloudflare dashboard
Incomplete Markdown output
  • βœ… Adjust contentSelectors in src/index.ts to match your HTML structure
  • βœ… Check removeSelectors aren't too aggressive
  • βœ… Enable DEBUG=true and check logs: wrangler tail
Cache issues
  • βœ… Set ENABLE_CACHE=false for debugging
  • βœ… Reduce CACHE_TTL for frequently updated content
  • βœ… Purge cache in Cloudflare dashboard

πŸ“œ License

This project is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

You are free to:

  • βœ… Share, copy, and redistribute
  • βœ… Adapt, remix, and build upon this work

Under these terms:

  • πŸ“ Attribution β€” Give appropriate credit
  • 🚫 NonCommercial β€” No commercial use without permission

For commercial licensing, please contact [kevinmcgahey1114 @ gmail.com].

See LICENSE for full details.


πŸ™ Acknowledgments

  • Turndown - HTML to Markdown conversion
  • Cloudflare Workers - Edge computing platform
  • Inspired by the need to optimize AI assistant interactions

Made for AI assistants, by developers who use AI assistants πŸ€–βœ¨

Report Bug β€’ Request Feature

About

Lightweight tool that converts traditional HTML pages to Markdown when being invoked by AI Agents to reduce token usage and improve readability.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published