A versatile web crawler and scraper that uses the Firecrawl API to crawl websites, scrape individual pages, and discover site URLs, saving content as Markdown files while maintaining the site's directory structure.
This tool provides three main commands:
- crawl: Crawl entire websites starting from a URL
- scrape: Scrape one or more specific URLs
- map: Discover all URLs on a website (sitemap generation)
All commands:
- Save pages as Markdown files in a directory structure matching the site's URL hierarchy
- Transform all hyperlinks in the content to point to the corresponding local
.mdfiles - Preserve the link style (absolute links remain absolute, relative links remain relative)
- Bun runtime v1.2.14 or higher
- Firecrawl API access (either self-hosted or cloud API key)
# Clone the repository
git clone https://github.com/0xBigBoss/firecrawl-crawl.git
cd firecrawl-crawl
# Install dependencies
bun install
# Build the executable
bun run build
# Run the tool
./bin/fcrawl --versionDownload the latest release for your platform from GitHub Releases.
Available platforms:
- Linux x64 (
fcrawl-linux-x64.tar.gz) - Linux ARM64 (
fcrawl-linux-arm64.tar.gz) - Windows x64 (
fcrawl-windows-x64.zip)
Note: macOS binaries are temporarily disabled until code signing is implemented. Build from source on macOS for now.
Set the following environment variables:
# Required: Firecrawl API URL (for self-hosted) or leave empty for cloud
FIRECRAWL_API_URL=http://localhost:3002
# Optional: API key if using Firecrawl cloud
FIRECRAWL_API_KEY=your-api-key-hereYou can also pass these as CLI arguments:
./fcrawl crawl https://example.com --api-url http://localhost:3002
./fcrawl scrape https://example.com --api-key fc-YOUR_KEYCrawl an entire website starting from a URL:
# Crawl a website
./fcrawl crawl https://example.com
# Limit the number of pages to crawl
./fcrawl crawl https://example.com --limit 10
# Specify output directory
./fcrawl crawl https://example.com -o ./output
# Enable verbose logging
./fcrawl crawl https://example.com -vScrape one or more specific URLs:
# Scrape a single URL
./fcrawl scrape https://example.com/page1
# Scrape multiple URLs
./fcrawl scrape https://example.com/page1 https://example.com/page2
# Scrape with specific formats (markdown, html, screenshot)
./fcrawl scrape https://example.com --formats markdown,html
# Include screenshot
./fcrawl scrape https://example.com --screenshot
# Wait for dynamic content
./fcrawl scrape https://example.com --wait-for 5000Discover all URLs on a website:
# Discover URLs and save to file
./fcrawl map https://example.com
# Output to console
./fcrawl map https://example.com --output console
# Output to both console and file
./fcrawl map https://example.com --output both
# Limit number of URLs discovered
./fcrawl map https://example.com --limit 1000
# Include subdomains
./fcrawl map https://example.com --include-subdomainsThe tool still supports the legacy direct URL syntax, which defaults to the crawl command:
# This still works but shows a deprecation warning
./fcrawl https://example.comfirecrawl-crawl/
├── src/
│ ├── index.ts # Main entry point with command routing
│ ├── cli.ts # CLI argument parsing with subcommand support
│ ├── crawler.ts # Crawl command implementation
│ ├── scraper.ts # Scrape command implementation
│ ├── mapper.ts # Map command implementation
│ ├── storage.ts # File saving logic
│ ├── transform.ts # Link transformation
│ ├── logger.ts # Debug logging utilities
│ ├── utils/
│ │ └── url.ts # URL utilities
│ └── tests/
│ ├── cli.test.ts
│ ├── transform.test.ts
│ ├── url.test.ts
│ └── integration.test.ts
├── index.ts # Entry loader
├── build.ts # Build script
├── package.json
├── tsconfig.json
├── CLAUDE.md # AI assistant guidelines
└── README.md # This file
All commands save content to the crawls/ directory (gitignored) with the following structure:
crawls/
└── example.com/
├── index.md # Homepage (markdown)
├── index.html # Homepage (HTML, if requested)
├── index.png # Homepage (screenshot, if requested)
├── about.md # /about page
├── sitemap.json # URL map (from map command)
├── sitemap.txt # URL list (from map command)
└── docs/
├── index.md # /docs/ page
└── guide.md # /docs/guide page
The tool converts URLs to file paths following these rules:
https://example.com/→crawls/example.com/index.mdhttps://example.com/about→crawls/example.com/about.mdhttps://example.com/docs/guide→crawls/example.com/docs/guide.mdhttps://example.com/page.html→crawls/example.com/page.md
All hyperlinks in the Markdown content are transformed to reference local files:
- Internal links (same domain) are converted to relative
.mdpaths - External links (different domain) are preserved unchanged
- Anchor links (starting with
#) are preserved - Hash fragments on internal links are preserved
Example transformations from /docs/api/reference.md:
[Home](https://example.com/)→[Home](../../index.md)[Guide](/docs/guide)→[Guide](../guide.md)[Section](#section)→[Section](#section)(unchanged)[External](https://other.com)→[External](https://other.com)(unchanged)
The transformer also handles:
- Bare URLs (e.g.,
Visit https://example.com/about) - Protocol-relative URLs (e.g.,
//example.com/page) - Query parameters (stripped from internal links)
- Empty link text and links in parentheses
The map command generates two files:
- sitemap.json: Structured JSON with metadata
{
"source": "https://example.com",
"timestamp": "2025-06-04T18:20:00.000Z",
"totalUrls": 42,
"includeSubdomains": false,
"urls": [
{ "url": "https://example.com/" },
{ "url": "https://example.com/about" },
// ...
]
}- sitemap.txt: Simple text list of URLs
# URL Map for https://example.com
# Generated: 2025-06-04T18:20:00.000Z
# Total URLs: 42
https://example.com/
https://example.com/about
...
This tool is designed to mirror the Firecrawl API options. We reference the official Firecrawl types from their repository to ensure compatibility:
The CLI options are mapped directly to the Firecrawl API parameters, including:
includePaths/excludePaths- Path filteringmaxDepth/maxDiscoveryDepth- Crawl depth controlignoreRobotsTxt- Robots.txt handlingdeduplicateSimilarURLs- URL deduplicationignoreQueryParameters- Query parameter handlingregexOnFullURL- Regex pattern matchingdelay- Request throttling- And all other crawlerOptions defined in the Firecrawl API
# Run all tests
bun test
# Run tests in watch mode
bun test:watch
# Run tests with coverage
bun test:coverage
# Test link transformation specifically
bun test:links
# TypeScript type checking
bun typecheck# Install dependencies
bun install
# Build the executable
bun run build
# Run tests
bun test
# TypeScript checking
bun tsc --noEmitEnable verbose logging to debug issues:
# Using CLI flag
./fcrawl crawl https://example.com -v
# Using environment variable
NODE_DEBUG=fcrawl:* ./fcrawl crawl https://example.com
# Debug specific modules
NODE_DEBUG=fcrawl:crawler ./fcrawl crawl https://example.com
NODE_DEBUG=fcrawl:cli,fcrawl:storage ./fcrawl scrape https://example.comAvailable debug namespaces:
fcrawl:cli- CLI argument parsingfcrawl:crawler- Crawl operationsfcrawl:storage- File system operationsfcrawl:transform- Link transformationfcrawl:main- Main application flowfcrawl:error- Error logging
- Resume interrupted crawls
- Parallel processing for faster crawling
- Custom include/exclude patterns
- Different output formats (HTML, PDF)
- Link validation and broken link reporting
- Incremental updates (only crawl changed pages)
- Progress bar and better logging
- Configuration file support
- Batch scraping from file input
- Export to different formats (EPUB, PDF)
- v1.1.0 - Added scrape and map commands, subcommand support
- v1.0.0 - Initial release with crawl functionality
MIT