clippy

Crawl any site. Save as markdown.

A fast, simple web scraper that saves crawled content as individual markdown files with frontmatter.

Note: Not affiliated with that helpful paperclip from your childhood. This one just grabs web pages.

Install

npm install -g @tremendous.dev/clippy

Quick Start

# Crawl a website and save as markdown files
clippy https://react.dev

# Specify output directory
clippy https://docs.python.org -o python-docs

# Crawl a GitHub repo
clippy https://github.com/user/repo -o repo-docs

# Crawl local codebase
clippy . -o my-code

What It Does

Crawls websites with configurable depth and concurrency
Extracts clean content using Mozilla Readability
Converts to markdown automatically
Saves individual files — one .md file per page with YAML frontmatter
Handles JavaScript sites — automatically falls back to browser mode when needed
Supports authentication — crawl sites that require login using persistent sessions
Respects robots.txt and rate limits by default
Works with Git repos — can crawl GitHub repos or local directories

Output Format

Each crawled page is saved as a separate markdown file with frontmatter:

---
title: "Page Title"
url: "https://example.com/page"
author: "Author Name"
date: "2024-01-01"
crawled: "2024-01-15T10:30:00.000Z"
---

# Page Title

Page content in clean markdown format...

Usage

Basic Crawling

# Crawl with default settings (depth=2, max=150 pages)
clippy https://example.com

# Customize depth and page limit
clippy https://example.com --depth 3 --max-pages 500

# Multiple sources into one directory
clippy https://react.dev https://nextjs.org -o frontend-docs

Output Control

# Specify output directory
clippy https://example.com -o my-docs

# Default output: ./clippy-output
clippy https://example.com

Crawling Behavior

# Control concurrency and rate limiting
clippy https://example.com -c 5 -r 2

# Include/exclude patterns
clippy https://example.com --include "docs/.*" --exclude ".*\.pdf"

# Disable sitemap discovery
clippy https://example.com --no-sitemap

# Ignore robots.txt (use responsibly!)
clippy https://example.com --no-robots

Browser Modes

# Force browser mode (for JS-heavy sites)
clippy https://spa-site.com --browser

# Force stealth mode (bypass anti-bot)
clippy https://protected-site.com --stealth

Authentication

Crawl sites that require login using persistent browser sessions:

# Login once (opens browser for manual authentication)
clippy auth login https://example.com

# Crawl authenticated site (automatically uses saved session)
clippy https://example.com/private-docs

# Manage sessions
clippy auth list              # List all stored sessions
clippy auth logout <url>      # Remove a session
clippy auth clear             # Clear all sessions

# Disable auth for a specific crawl
clippy https://example.com --no-auth

Preview Before Crawling

# See what's available via sitemap
clippy preview https://example.com

Options

Flag	Description	Default
`-o, --output <dir>`	Output directory for markdown files	`./clippy-output`
`-d, --depth <n>`	Crawl depth (0 = single page only)	`2`
`-m, --max-pages <n>`	Maximum pages to crawl	`150`
`-c, --concurrency <n>`	Concurrent requests	`10`
`-r, --rate-limit <n>`	Max requests per second	`10`
`-t, --timeout <ms>`	Request timeout	`10000`
`--include <regex>`	Only crawl URLs matching pattern	-
`--exclude <regex>`	Skip URLs matching pattern	-
`--label <label>`	Label for crawled documents	`web`
`--sitemap`	Use sitemap.xml for discovery	`true`
`--no-sitemap`	Disable sitemap discovery	-
`--no-robots`	Ignore robots.txt	-
`--no-auth`	Disable automatic auth detection	-
`--browser`	Force browser mode	-
`--stealth`	Force stealth mode	-
`-q, --quiet`	Minimal output	-
`-v, --verbose`	Verbose output	-

Examples

Documentation Sites

# Crawl React docs
clippy https://react.dev -o react-docs

# Crawl Python docs (large site)
clippy https://docs.python.org --depth 3 --max-pages 1000 -o python-docs

# Crawl Stripe API docs
clippy https://stripe.com/docs -o stripe-docs

Blogs & Articles

# Archive a blog
clippy https://paulgraham.com/articles.html -o pg-essays

# Specific article
clippy "https://example.com/article" -o articles

GitHub Repositories

# Crawl a GitHub repo
clippy https://github.com/user/repo -o repo-docs

# Local codebase
clippy . -o my-project-docs
clippy /path/to/project -o project-docs

Advanced Usage

# Slow and steady for rate-sensitive sites
clippy https://example.com -c 2 -r 1

# Fast crawl with high concurrency
clippy https://example.com -c 20 -r 50

# Deep crawl of specific section
clippy https://example.com/docs --depth 5 --include "docs/.*"

# JavaScript-heavy SPA
clippy https://spa.example.com --browser --max-pages 50

How It Works

Crawling Strategy

URL Discovery
- Starts with provided URLs
- Checks for sitemap.xml (unless disabled)
- Follows links up to specified depth
- Respects robots.txt by default
Content Extraction
- Fetches pages with optimized waterfall:
  - fetch (fast, works for 90% of sites)
  - playwright (real browser, for JS sites)
  - rebrowser (stealth mode, bypasses anti-bot)
- Extracts clean content using Mozilla Readability
- Converts HTML to markdown
File Output
- Sanitizes URL/title into safe filename
- Adds YAML frontmatter with metadata
- Writes individual .md file per page
- Handles duplicate filenames with counters

Deduplication

Skips duplicate URLs automatically
Detects locale variants (/en/, /es/, etc.)
Identifies similar content to avoid redundancy

Use Cases

Offline documentation — Read docs without internet
Documentation archival — Preserve documentation versions
Content backup — Archive websites before they change
Research — Collect content for analysis
Training data — Gather markdown content for ML
Knowledge base — Build searchable documentation collections

Programmatic Usage

import { clippy, preview } from 'clippy';

// Crawl a site
const result = await clippy(['https://example.com'], {
  output: './docs',
  depth: 2,
  maxPages: 100,
  quiet: false
});

console.log(`Crawled ${result.pages} pages`);
console.log(`Saved to: ${result.output}`);

// Preview available pages
const sitePreview = await preview('https://example.com');
console.log(`${sitePreview.totalPages} pages available`);

Rate Limiting & Ethics

Default: 10 requests/second with exponential backoff
Respects robots.txt by default (use --no-robots to override)
Be responsible: Don't hammer servers or bypass restrictions maliciously
Consider: Reduce concurrency/rate-limit for smaller sites

Limitations

No form submission — Only GET requests
Basic JavaScript — Complex SPAs may need --browser or --stealth
Cloudflare/reCAPTCHA — Stealth mode helps but isn't perfect

Troubleshooting

Site blocks requests?

clippy https://example.com --stealth

JavaScript not rendering?

clippy https://example.com --browser

Rate limited?

clippy https://example.com -c 2 -r 1

Too many pages?

clippy https://example.com --max-pages 50 --depth 1

Development

# Clone and install
git clone https://github.com/yerffejytnac/clippy.git
cd clippy
bun install

# Build
bun run build

# Link locally (creates symlink in ~/.bun/bin)
bun link

# Use
clippy https://example.com

Building with Bun

# Install with bun
bun install

# Build
bun run build

# Link globally (creates symlink in ~/.bun/bin)
bun link

# Make sure ~/.bun/bin is in your PATH

# Use
clippy https://example.com

License

MIT

Contributing

Contributions welcome! Please open an issue or PR.

Credits

Browser automation via Playwright and Rebrowser
Content extraction using Mozilla Readability
Markdown conversion with rehype-remark and unified

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
assets		assets
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
biome.json		biome.json
bun.lock		bun.lock
package.json		package.json
tsconfig.check.json		tsconfig.check.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

clippy

Install

Quick Start

What It Does

Output Format

Usage

Basic Crawling

Output Control

Crawling Behavior

Browser Modes

Authentication

Preview Before Crawling

Options

Examples

Documentation Sites

Blogs & Articles

GitHub Repositories

Advanced Usage

How It Works

Crawling Strategy

Deduplication

Use Cases

Programmatic Usage

Rate Limiting & Ethics

Limitations

Troubleshooting

Development

Building with Bun

License

Contributing

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages