Crawl any site. Save as markdown.
A fast, simple web scraper that saves crawled content as individual markdown files with frontmatter.
Note: Not affiliated with that helpful paperclip from your childhood. This one just grabs web pages.
npm install -g @tremendous.dev/clippy# Crawl a website and save as markdown files
clippy https://react.dev
# Specify output directory
clippy https://docs.python.org -o python-docs
# Crawl a GitHub repo
clippy https://github.com/user/repo -o repo-docs
# Crawl local codebase
clippy . -o my-code- Crawls websites with configurable depth and concurrency
- Extracts clean content using Mozilla Readability
- Converts to markdown automatically
- Saves individual files — one
.mdfile per page with YAML frontmatter - Handles JavaScript sites — automatically falls back to browser mode when needed
- Supports authentication — crawl sites that require login using persistent sessions
- Respects robots.txt and rate limits by default
- Works with Git repos — can crawl GitHub repos or local directories
Each crawled page is saved as a separate markdown file with frontmatter:
---
title: "Page Title"
url: "https://example.com/page"
author: "Author Name"
date: "2024-01-01"
crawled: "2024-01-15T10:30:00.000Z"
---
# Page Title
Page content in clean markdown format...# Crawl with default settings (depth=2, max=150 pages)
clippy https://example.com
# Customize depth and page limit
clippy https://example.com --depth 3 --max-pages 500
# Multiple sources into one directory
clippy https://react.dev https://nextjs.org -o frontend-docs# Specify output directory
clippy https://example.com -o my-docs
# Default output: ./clippy-output
clippy https://example.com# Control concurrency and rate limiting
clippy https://example.com -c 5 -r 2
# Include/exclude patterns
clippy https://example.com --include "docs/.*" --exclude ".*\.pdf"
# Disable sitemap discovery
clippy https://example.com --no-sitemap
# Ignore robots.txt (use responsibly!)
clippy https://example.com --no-robots# Force browser mode (for JS-heavy sites)
clippy https://spa-site.com --browser
# Force stealth mode (bypass anti-bot)
clippy https://protected-site.com --stealthCrawl sites that require login using persistent browser sessions:
# Login once (opens browser for manual authentication)
clippy auth login https://example.com
# Crawl authenticated site (automatically uses saved session)
clippy https://example.com/private-docs
# Manage sessions
clippy auth list # List all stored sessions
clippy auth logout <url> # Remove a session
clippy auth clear # Clear all sessions
# Disable auth for a specific crawl
clippy https://example.com --no-auth# See what's available via sitemap
clippy preview https://example.com| Flag | Description | Default |
|---|---|---|
-o, --output <dir> |
Output directory for markdown files | ./clippy-output |
-d, --depth <n> |
Crawl depth (0 = single page only) | 2 |
-m, --max-pages <n> |
Maximum pages to crawl | 150 |
-c, --concurrency <n> |
Concurrent requests | 10 |
-r, --rate-limit <n> |
Max requests per second | 10 |
-t, --timeout <ms> |
Request timeout | 10000 |
--include <regex> |
Only crawl URLs matching pattern | - |
--exclude <regex> |
Skip URLs matching pattern | - |
--label <label> |
Label for crawled documents | web |
--sitemap |
Use sitemap.xml for discovery | true |
--no-sitemap |
Disable sitemap discovery | - |
--no-robots |
Ignore robots.txt | - |
--no-auth |
Disable automatic auth detection | - |
--browser |
Force browser mode | - |
--stealth |
Force stealth mode | - |
-q, --quiet |
Minimal output | - |
-v, --verbose |
Verbose output | - |
# Crawl React docs
clippy https://react.dev -o react-docs
# Crawl Python docs (large site)
clippy https://docs.python.org --depth 3 --max-pages 1000 -o python-docs
# Crawl Stripe API docs
clippy https://stripe.com/docs -o stripe-docs# Archive a blog
clippy https://paulgraham.com/articles.html -o pg-essays
# Specific article
clippy "https://example.com/article" -o articles# Crawl a GitHub repo
clippy https://github.com/user/repo -o repo-docs
# Local codebase
clippy . -o my-project-docs
clippy /path/to/project -o project-docs# Slow and steady for rate-sensitive sites
clippy https://example.com -c 2 -r 1
# Fast crawl with high concurrency
clippy https://example.com -c 20 -r 50
# Deep crawl of specific section
clippy https://example.com/docs --depth 5 --include "docs/.*"
# JavaScript-heavy SPA
clippy https://spa.example.com --browser --max-pages 50-
URL Discovery
- Starts with provided URLs
- Checks for
sitemap.xml(unless disabled) - Follows links up to specified depth
- Respects
robots.txtby default
-
Content Extraction
- Fetches pages with optimized waterfall:
fetch(fast, works for 90% of sites)playwright(real browser, for JS sites)rebrowser(stealth mode, bypasses anti-bot)
- Extracts clean content using Mozilla Readability
- Converts HTML to markdown
- Fetches pages with optimized waterfall:
-
File Output
- Sanitizes URL/title into safe filename
- Adds YAML frontmatter with metadata
- Writes individual
.mdfile per page - Handles duplicate filenames with counters
- Skips duplicate URLs automatically
- Detects locale variants (
/en/,/es/, etc.) - Identifies similar content to avoid redundancy
- Offline documentation — Read docs without internet
- Documentation archival — Preserve documentation versions
- Content backup — Archive websites before they change
- Research — Collect content for analysis
- Training data — Gather markdown content for ML
- Knowledge base — Build searchable documentation collections
import { clippy, preview } from 'clippy';
// Crawl a site
const result = await clippy(['https://example.com'], {
output: './docs',
depth: 2,
maxPages: 100,
quiet: false
});
console.log(`Crawled ${result.pages} pages`);
console.log(`Saved to: ${result.output}`);
// Preview available pages
const sitePreview = await preview('https://example.com');
console.log(`${sitePreview.totalPages} pages available`);- Default: 10 requests/second with exponential backoff
- Respects
robots.txtby default (use--no-robotsto override) - Be responsible: Don't hammer servers or bypass restrictions maliciously
- Consider: Reduce concurrency/rate-limit for smaller sites
- No form submission — Only GET requests
- Basic JavaScript — Complex SPAs may need
--browseror--stealth - Cloudflare/reCAPTCHA — Stealth mode helps but isn't perfect
Site blocks requests?
clippy https://example.com --stealthJavaScript not rendering?
clippy https://example.com --browserRate limited?
clippy https://example.com -c 2 -r 1Too many pages?
clippy https://example.com --max-pages 50 --depth 1# Clone and install
git clone https://github.com/yerffejytnac/clippy.git
cd clippy
bun install
# Build
bun run build
# Link locally (creates symlink in ~/.bun/bin)
bun link
# Use
clippy https://example.com# Install with bun
bun install
# Build
bun run build
# Link globally (creates symlink in ~/.bun/bin)
bun link
# Make sure ~/.bun/bin is in your PATH
# Use
clippy https://example.comMIT
Contributions welcome! Please open an issue or PR.
- Browser automation via Playwright and Rebrowser
- Content extraction using Mozilla Readability
- Markdown conversion with rehype-remark and unified