Skip to content

HireBase-1/markdowner

Β 
Β 

Repository files navigation

Markdowner Local πŸ”–

A fully local, Docker-based tool to convert any website into LLM-ready markdown. Fork of supermemoryai/markdowner, refactored to run locally with Puppeteer and optional Bright Data proxy integration.

Features

  • Two extraction methods:
    • html - Direct HTTP fetch (fast, for server-side rendered pages)
    • hydration - Full browser rendering with Puppeteer (for JavaScript-heavy SPAs)
  • Bright Data proxy integration - Rotate IPs for scraping at scale
  • Local file-based caching - Avoid redundant fetches
  • Subpage crawling - Recursively convert up to 10 linked pages
  • Docker-ready - Easy deployment with Docker Compose

Quick Start

Using Docker Compose (Recommended)

  1. Clone and configure:

    git clone https://github.com/your-repo/markdowner.git
    cd markdowner
    cp env.example .env
    # Edit .env with your settings (proxy is optional)
  2. Build and run:

    docker-compose up -d
  3. Convert a URL:

    # Simple conversion (hydration method - renders JavaScript)
    curl "http://localhost:3000/convert?url=https://example.com"
    
    # Fast conversion (html method - direct fetch, for SSR pages)
    curl "http://localhost:3000/convert?url=https://example.com&method=html"

Local Development

  1. Install dependencies:

    npm install
  2. Run in development mode:

    npm run dev
  3. Build and run:

    npm run build
    npm start

API Reference

GET /convert

Convert a URL to markdown.

Parameter Type Default Description
url string required The website URL to convert
method 'html' | 'hydration' 'hydration' Extraction method
enableDetailedResponse boolean false Include full page content instead of article extraction
crawlSubpages boolean false Also convert linked subpages (max 10)
useProxy boolean false Use Bright Data proxy for requests

Response Headers:

  • Accept: application/json β†’ Returns JSON with metadata
  • Accept: text/plain (default) β†’ Returns raw markdown

Examples:

# Fast SSR conversion (no browser needed)
curl "http://localhost:3000/convert?url=https://example.com&method=html"

# Full page with JSON response
curl -H "Accept: application/json" \
  "http://localhost:3000/convert?url=https://example.com&enableDetailedResponse=true"

# With Bright Data proxy (requires configuration)
curl "http://localhost:3000/convert?url=https://example.com&useProxy=true"

# Crawl subpages (returns JSON array)
curl "http://localhost:3000/convert?url=https://example.com&crawlSubpages=true"

GET /health

Health check endpoint.

curl http://localhost:3000/health
# {"status":"ok","timestamp":"2026-01-10T..."}

GET /cache/stats

Get cache statistics.

curl http://localhost:3000/cache/stats
# {"entries":42,"sizeBytes":125000}

DELETE /cache

Clear all cached entries.

curl -X DELETE http://localhost:3000/cache
# {"cleared":42,"message":"Cleared 42 cache entries"}

Extraction Methods

html (Fast)

  • Best for: Server-side rendered pages, blogs, documentation sites
  • How it works: Direct HTTP fetch using axios
  • Speed: Very fast (~100-500ms)
  • Limitations: Won't capture JavaScript-rendered content

hydration (Full Rendering)

  • Best for: SPAs, JavaScript-heavy sites, dynamic content
  • How it works: Full Chromium browser rendering via Puppeteer
  • Speed: Slower (~2-10s depending on page complexity)
  • Capabilities: Captures all dynamically loaded content

Bright Data Proxy Configuration

For scraping at scale or bypassing geo-restrictions, configure Bright Data:

  1. Get credentials from Bright Data Dashboard

  2. Set environment variables:

    BRIGHTDATA_USERNAME=your-zone-username
    BRIGHTDATA_PASSWORD=your-zone-password
    BRIGHTDATA_PROXY=brd.superproxy.io:22225
  3. Use the proxy:

    curl "http://localhost:3000/convert?url=https://example.com&useProxy=true"

The proxy uses session rotation (session-rand{N}) for automatic IP rotation on each request.

Environment Variables

Variable Default Description
PORT 3000 Server port
CACHE_ENABLED true Enable file-based caching
CACHE_TTL_SECONDS 3600 Cache TTL (1 hour)
CACHE_DIR ./cache Cache directory
BROWSER_HEADLESS true Run browser in headless mode
BROWSER_TIMEOUT 30000 Page load timeout (ms)
RATE_LIMIT_WINDOW_MS 60000 Rate limit window (1 minute)
RATE_LIMIT_MAX 30 Max requests per window
BRIGHTDATA_USERNAME - Bright Data username
BRIGHTDATA_PASSWORD - Bright Data password
BRIGHTDATA_PROXY - Bright Data proxy host:port

Docker Commands

# Build the image
docker build -t markdowner .

# Run with environment file
docker run -p 3000:3000 --env-file .env markdowner

# Run with Docker Compose
docker-compose up -d

# View logs
docker-compose logs -f

# Stop
docker-compose down

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Express Server                          β”‚
β”‚                    (src/server.ts)                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚   /convert  β”‚    β”‚   /health   β”‚    β”‚   /cache    β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚         β”‚                                                   β”‚
β”‚         β–Ό                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚              Converter (src/converter.ts)           β”‚    β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚  β”‚   method='html'      β”‚   method='hydration'         β”‚    β”‚
β”‚  β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚    β”‚
β”‚  β”‚   β”‚   Axios    β”‚     β”‚   β”‚     Puppeteer      β”‚     β”‚    β”‚
β”‚  β”‚   β”‚  (HTTP)    β”‚     β”‚   β”‚  (Chromium browser)β”‚     β”‚    β”‚
β”‚  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚         β”‚                         β”‚                         β”‚
β”‚         β–Ό                         β–Ό                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚        Readability + Turndown (HTML β†’ Markdown)     β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚         β”‚                                                   β”‚
β”‚         β–Ό                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚              File Cache (src/cache.ts)              β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

License

MIT

About

A fast tool to convert any website into LLM-ready markdown data. Built by https://supermemory.ai, adapted by https://hirebase.org

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TypeScript 97.7%
  • Dockerfile 2.3%