@happyvertical/spider

Purpose and Responsibilities

The spider package provides web scraping and content extraction through multiple adapters optimized for different use cases. It follows the standardized provider pattern to give developers the right tool for each scraping scenario, from simple static HTML to complex JavaScript-heavy pages.

Key Features

Provider Pattern: Four adapters for different use cases (Simple, DOM, Crawlee, Crawl4ai)
Standardized Interface: All adapters implement ISpiderAdapter
Built-in Caching: Automatic response caching via @happyvertical/cache
Navigation Expansion: Crawlee adapter auto-expands accordions/dropdowns to discover hidden links
Link Extraction: Automatic extraction of all page links
Markdown Conversion: Crawl4ai adapter provides AI-friendly markdown output
Environment Variable Configuration: HAVE_SPIDER_* pattern for easy setup

Architecture Overview

getSpider(options)
    ↓
Adapter Selection
    ├── Simple (fast HTTP + cheerio)
    ├── DOM (happy-dom processing + cheerio)
    ├── Crawlee (Playwright browser automation)
    └── Crawl4ai (remote server with markdown)
    ↓
ISpiderAdapter Interface
    ├── fetch(url, options) → Page
    ├── Built-in caching (file-based)
    └── Standardized output (url, content, links, raw, markdown?)

Key APIs

Basic Usage

import { getSpider } from '@happyvertical/spider';

// Simple adapter (fast HTTP)
const spider = await getSpider({ adapter: 'simple' });
const page = await spider.fetch('https://example.com');

console.log(page.url);      // Final URL after redirects
console.log(page.content);  // HTML content
console.log(page.links);    // Extracted links

WordPress Download Manager Detection

The scrapeDocument() function automatically detects WordPress Download Manager pages and extracts the actual download URLs:

import { scrapeDocument } from '@happyvertical/spider';

// Automatically detects WordPress download pages
const doc = await scrapeDocument('https://example.com/download/meeting-minutes/');

if (doc.metadata.strategy === 'wordpress-pdf-link') {
  console.log('WordPress download detected');
  console.log('Download URL:', doc.url); // Extracted wpdmdl URL
  console.log('Is PDF:', doc.metadata.isPdf); // true
  console.log('Complete:', doc.metadata.complete); // false - needs separate fetch
}

Important Caching Behavior (Fix for sdk#440):

WordPress Download Manager URLs often return HTML tracking/analytics pages before redirecting to the actual file. The spider includes defensive checks to prevent infinite loops:

First Fetch: HTML page with /download/ in URL → Detects WordPress → Extracts ?wpdmdl= link
Second Fetch: URL with ?wpdmdl= parameter → Skips WordPress detection (prevents loop)
Result: If ?wpdmdl= URL returns HTML, it's treated as HTML (not marked as PDF)

Debug Logging:

Enable debug logging to troubleshoot WordPress detection:

export HAVE_DEBUG=true
# or
export DEBUG=spider

This will log:

When WordPress pages are detected
When WordPress detection is skipped (wpdmdl URLs)
What download URLs are extracted

Adapter Selection

// Simple: Fast HTTP with cheerio (static content)
const simple = await getSpider({
  adapter: 'simple',
  cacheDir: '.cache/spider',
});

// DOM: happy-dom processing (complex HTML)
const dom = await getSpider({
  adapter: 'dom',
  cacheDir: '.cache/spider',
});

// Crawlee: Playwright browser automation (dynamic content)
const crawlee = await getSpider({
  adapter: 'crawlee',
  headless: true,
  userAgent: 'MyBot/1.0 (+https://mysite.com/bot)',
  cacheDir: '.cache/spider',
});

// Crawl4ai: Remote server with markdown conversion (AI-friendly)
const crawl4ai = await getSpider({
  adapter: 'crawl4ai',
  baseUrl: 'http://crawl4ai.default.svc:11235', // or use HAVE_SPIDER_CRAWL4AI_URL
  cacheDir: '.cache/spider',
});

// Crawl4ai provides markdown in addition to HTML
const page = await crawl4ai.fetch('https://example.com');
console.log(page.content);   // HTML content
console.log(page.markdown);  // Markdown conversion (AI-friendly)

Fetch Options

const page = await spider.fetch('https://example.com', {
  headers: { 'User-Agent': 'MyBot/1.0' },
  timeout: 30000,      // 30 seconds
  cache: true,         // Enable caching
  cacheExpiry: 300000, // 5 minutes
});

Environment Variable Configuration

# Configure via environment variables
export HAVE_SPIDER_TIMEOUT=60000
export HAVE_SPIDER_USER_AGENT="MyBot/1.0 (+https://mysite.com/bot)"
export HAVE_SPIDER_MAX_REQUESTS=100

# Crawl4ai server URL (for crawl4ai adapter)
export HAVE_SPIDER_CRAWL4AI_URL="http://crawl4ai.default.svc:11235"

// Env vars are merged with options (user options take precedence)
process.env.HAVE_SPIDER_TIMEOUT = '45000';

const spider = await getSpider({ adapter: 'simple' });
await spider.fetch(url); // Uses 45000ms timeout from env

await spider.fetch(url, { timeout: 30000 }); // Overrides to 30000ms

Page Object Structure

interface Page {
  url: string;       // Final URL after redirects
  content: string;   // Full HTML content
  links: Link[];     // Extracted links with metadata
  raw: any;          // Adapter-specific raw response
  markdown?: string; // Markdown content (crawl4ai adapter only)
}

Adapter Characteristics

Simple Adapter

Speed: ~200ms first fetch, ~5ms cached
Use Case: Static HTML, high volume scraping
Dependencies: undici, cheerio
Best For: Fast content extraction, minimal resource usage

DOM Adapter

Speed: ~500ms first fetch, ~5ms cached
Use Case: Complex/malformed HTML needing normalization
Dependencies: happy-dom, cheerio
Best For: DOM manipulation without full browser

Crawlee Adapter

Speed: ~8000ms first fetch, ~5ms cached
Use Case: JavaScript-rendered content, dynamic loading
Dependencies: crawlee, playwright
Best For: Accordion navigation, hidden links, AJAX content
Special Feature: Auto-expands [aria-expanded="false"], accordions, <details> tags

Crawl4ai Adapter

Speed: Variable (depends on remote server), ~5ms cached
Use Case: AI-friendly content extraction, markdown conversion
Dependencies: undici (HTTP client to remote server)
Best For: LLM/AI pipelines, content summarization, remote scraping infrastructure
Special Features:
- Markdown conversion built-in (page.markdown)
- Structured link extraction from server
- Offloads browser overhead to Kubernetes cluster
- Configurable via HAVE_SPIDER_CRAWL4AI_URL

Dependencies

Internal:
- @happyvertical/cache - Caching infrastructure
- @happyvertical/utils - Error types, validation
External:
- cheerio - HTML parsing (all adapters)
- undici - HTTP client (Simple adapter)
- happy-dom - DOM implementation (DOM adapter)
- crawlee - Browser automation framework (Crawlee adapter)
- playwright - Browser engine (Crawlee dependency)

Development Guidelines

All adapters must implement ISpiderAdapter interface completely
Cache keys prefixed by adapter type (simple:, dom:, crawlee:, crawl4ai:)
Default cache expiry: 5 minutes (300,000ms)
Default timeout: 30 seconds (30,000ms) for local adapters, 60 seconds for crawl4ai
Links should be absolute URLs (relative URLs converted)
Handle errors with ValidationError and NetworkError from @happyvertical/utils
Respect robots.txt and use descriptive User-Agent strings

Expert Agent Expertise

When working with spider:

Adapter Selection: Start with Simple, fallback to Crawlee if content missing, use Crawl4ai for AI pipelines
Caching Strategy: Remote adapters (Crawlee, Crawl4ai) benefit most from caching, always enable
Navigation Expansion: Crawlee runs up to 3 iterations clicking expandable elements
Performance: Simple is 10x faster than DOM, 40x faster than Crawlee (uncached)
Error Handling: NetworkError for HTTP failures, ValidationError for bad inputs
Crawl4ai for AI: When building LLM pipelines, prefer Crawl4ai for its markdown output
Real-World Testing: Integration tests use Bentley town council site (accordion navigation)

Common Patterns

// PDF discovery with navigation expansion
import { getSpider } from '@happyvertical/spider';

const spider = await getSpider({
  adapter: 'crawlee',
  headless: true,
});

const page = await spider.fetch(
  'https://townofbentley.ca/town-office/council/meetings-agendas/',
  { timeout: 60000, cache: true }
);

const pdfLinks = page.links.filter(link => link.endsWith('.pdf'));
console.log(`Found ${pdfLinks.length} PDFs`);

// Fallback strategy for resilience
async function robustFetch(url: string) {
  try {
    // Try Crawlee first for best quality
    const spider = await getSpider({ adapter: 'crawlee' });
    return await spider.fetch(url, { timeout: 30000 });
  } catch (error) {
    // Fallback to simple adapter
    console.warn('Crawlee failed, using simple adapter');
    const spider = await getSpider({ adapter: 'simple' });
    return await spider.fetch(url, { timeout: 15000 });
  }
}

// Batch processing with caching
const spider = await getSpider({ adapter: 'simple' });
const urls = ['https://example.com/1', 'https://example.com/2'];

const pages = await Promise.all(
  urls.map(url => spider.fetch(url, {
    cache: true,
    cacheExpiry: 600000, // 10 minutes
  }))
);

// Integration with AI for content analysis (using simple adapter)
import { getAI } from '@happyvertical/ai';

const spider = await getSpider({ adapter: 'simple' });
const page = await spider.fetch('https://news.example.com/article');

const $ = cheerio.load(page.content);
const articleText = $('article').text();

const ai = await getAI({ type: 'anthropic' });
const summary = await ai.chat([
  { role: 'user', content: `Summarize: ${articleText}` }
]);

// Integration with AI using crawl4ai (markdown ready)
const crawl4ai = await getSpider({ adapter: 'crawl4ai' });
const page = await crawl4ai.fetch('https://news.example.com/article');

// No cheerio needed - markdown is already extracted
const ai = await getAI({ type: 'anthropic' });
const summary = await ai.chat([
  { role: 'user', content: `Summarize: ${page.markdown}` }
]);

Related Packages

@happyvertical/cache: Powers built-in caching with file-based storage
@happyvertical/utils: Provides error types and validation
@happyvertical/pdf: Often used with spider to download extracted PDF links
@happyvertical/documents: Uses spider for web page processing
@happyvertical/ai: Commonly paired for content analysis
@happyvertical/content: Uses spider for content mirroring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@happyvertical/spider

Purpose and Responsibilities

Key Features

Architecture Overview

Key APIs

Basic Usage

WordPress Download Manager Detection

Adapter Selection

Fetch Options

Environment Variable Configuration

Page Object Structure

Adapter Characteristics

Simple Adapter

DOM Adapter

Crawlee Adapter

Crawl4ai Adapter

Dependencies

Development Guidelines

Expert Agent Expertise

Common Patterns

Related Packages

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

@happyvertical/spider

Purpose and Responsibilities

Key Features

Architecture Overview

Key APIs

Basic Usage

WordPress Download Manager Detection

Adapter Selection

Fetch Options

Environment Variable Configuration

Page Object Structure

Adapter Characteristics

Simple Adapter

DOM Adapter

Crawlee Adapter

Crawl4ai Adapter

Dependencies

Development Guidelines

Expert Agent Expertise

Common Patterns

Related Packages