Skip to content
/ allwebcrawler-api Public template

AllWebCrawler is a robust, secure, and feature-rich website crawler API service built with Node.js and Express. This service provides comprehensive web scraping capabilities with built-in security measures, CORS handling, rate limiting, and proxy support.

License

Notifications You must be signed in to change notification settings

mobiprox/allwebcrawler-api

Repository files navigation

AllWebCrawler API

AllWebCrawler is a robust, secure, and feature-rich website crawler API service built with Node.js and Express. This service provides comprehensive web scraping capabilities with built-in security measures, CORS handling, rate limiting, and proxy support.

Features

đź”’ Security & Safety

  • Helmet.js security headers
  • CORS configuration with origin validation
  • Rate limiting and request throttling
  • Input validation with Joi schemas
  • User agent rotation to avoid detection
  • Robots.txt compliance checking
  • Proxy support (HTTP/HTTPS/SOCKS5)

🕷️ Crawling Capabilities

  • Single URL crawling
  • Batch URL crawling with concurrency control
  • Session management for persistent crawling
  • HTML parsing with Cheerio
  • Custom CSS selector extraction
  • Metadata extraction
  • Image and link extraction
  • Text content extraction

📊 Monitoring & Logging

  • Comprehensive logging with Winston
  • Health check endpoints
  • Performance metrics
  • Error tracking and reporting
  • Request/response monitoring

Quick Start

Installation

# Clone the repository
git clone https://github.com/[your-username]/AllWebCrawler.git
cd AllWebCrawler

# Install dependencies
npm install

# Copy environment configuration
copy .env.example .env

# Start the development server
npm run dev

Basic Usage

# Test the service
curl http://localhost:3000/api/health

# Crawl a single URL
curl -X POST http://localhost:3000/api/crawler/crawl \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

đź“– Wikipedia Example

Here's a practical example of crawling a Wikipedia page to extract structured information:

Using cURL

# Crawl Wikipedia Education page with custom selectors
curl -X POST http://localhost:3000/api/crawler/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://en.wikipedia.org/wiki/Education",
    "timeout": 30000,
    "extractHeadings": true,
    "extractLinks": true,
    "extractImages": true,
    "selectors": {
      "pageTitle": "#firstHeading",
      "summary": "#mw-content-text > div.mw-parser-output > p:first-of-type",
      "infoboxTitle": ".infobox-title",
      "infoboxData": ".infobox tr",
      "tableOfContents": "#toc .toctext",
      "mainCategories": "#catlinks .mw-normal-catlinks ul li a",
      "lastModified": "#footer-info-lastmod",
      "languages": ".interlanguage-link-target"
    }
  }'

Using Node.js/JavaScript

const axios = require('axios');

async function crawlWikipedia() {
  try {
    const response = await axios.post('http://localhost:3000/api/crawler/crawl', {
      url: 'https://en.wikipedia.org/wiki/Education',
      timeout: 30000,
      extractHeadings: true,
      extractLinks: true,
      extractImages: true,
      extractText: false, // Skip full text to reduce response size
      selectors: {
        pageTitle: '#firstHeading',
        summary: '#mw-content-text > div.mw-parser-output > p:first-of-type',
        infoboxTitle: '.infobox-title',
        tableOfContents: '#toc .toctext',
        categories: '#catlinks .mw-normal-catlinks ul li a',
        lastModified: '#footer-info-lastmod',
        citationCount: '.citation',
        externalLinks: '#External_links + ul li a',
        seeAlso: '#See_also + ul li a'
      }
    });

    const data = response.data;
    
    console.log('Wikipedia Page Analysis:');
    console.log('=====================');
    console.log(`Page Title: ${data.html.custom.pageTitle}`);
    console.log(`Summary: ${data.html.custom.summary?.substring(0, 200)}...`);
    console.log(`Headings Found: ${Object.values(data.html.headings).flat().length}`);
    console.log(`Links Found: ${data.html.links?.length || 0}`);
    console.log(`Images Found: ${data.html.images?.length || 0}`);
    console.log(`Table of Contents: ${data.html.custom.tableOfContents?.length || 0} sections`);
    console.log(`Categories: ${data.html.custom.categories?.length || 0}`);
    console.log(`Load Time: ${data.duration}ms`);
    
    return data;
  } catch (error) {
    console.error('Error crawling Wikipedia:', error.response?.data || error.message);
  }
}

crawlWikipedia();

Expected Response Structure

{
  "success": true,
  "requestId": "uuid-here",
  "url": "https://en.wikipedia.org/wiki/Education",
  "status": 200,
  "duration": 2347,
  "timestamp": "2025-06-29T10:00:00.000Z",
  "html": {
    "title": "Education - Wikipedia",
    "meta": {
      "description": "Education is the transmission of knowledge, skills, and character traits...",
      "keywords": "Education, learning, teaching, school, university"
    },
    "headings": {
      "h1": ["Education"],
      "h2": ["Etymology", "History", "Formal education", "Informal education", ...],
      "h3": ["Early history", "Ancient civilizations", "Medieval period", ...]
    },
    "links": [
      {
        "text": "learning",
        "href": "/wiki/Learning",
        "title": "Learning"
      },
      ...
    ],
    "images": [
      {
        "src": "//upload.wikimedia.org/wikipedia/commons/thumb/...",
        "alt": "Students in a classroom",
        "title": null
      },
      ...
    ],
    "custom": {
      "pageTitle": "Education",
      "summary": "Education is the transmission of knowledge, skills, and character traits...",
      "tableOfContents": [
        "Etymology",
        "History", 
        "Formal education",
        "Informal education",
        ...
      ],
      "categories": [
        "Education",
        "Learning",
        "Pedagogy",
        ...
      ],
      "lastModified": "This page was last edited on 28 June 2025, at 15:30 (UTC)."
    }
  },
  "metadata": {
    "statusCode": 200,
    "statusText": "OK",
    "contentLength": 245678,
    "lastModified": "Wed, 28 Jun 2025 15:30:00 GMT",
    "server": "nginx",
    "encoding": "gzip"
  }
}

Advanced Wikipedia Crawling with Sessions

// Create a session for multiple Wikipedia pages
const session = await axios.post('http://localhost:3000/api/crawler/session', {
  userAgent: 'Educational Research Bot 1.0',
  description: 'Wikipedia education research session'
});

// Crawl multiple related pages
const educationTopics = [
  'https://en.wikipedia.org/wiki/Education',
  'https://en.wikipedia.org/wiki/Higher_education',
  'https://en.wikipedia.org/wiki/Primary_education',
  'https://en.wikipedia.org/wiki/Educational_technology'
];

const batchResult = await axios.post('http://localhost:3000/api/crawler/batch', {
  urls: educationTopics,
  sessionId: session.data.sessionId,
  concurrency: 2,
  delay: 1000, // Be respectful to Wikipedia servers
  selectors: {
    pageTitle: '#firstHeading',
    summary: '#mw-content-text > div.mw-parser-output > p:first-of-type',
    categories: '#catlinks .mw-normal-catlinks ul li a',
    wordCount: '#mw-content-text'
  }
});

console.log(`Crawled ${batchResult.data.summary.successful} Wikipedia pages successfully`);

API Documentation

Base URL

http://localhost:3000/api

Endpoints

Health Check

  • GET /health - Basic health check
  • GET /health/detailed - Detailed system information
  • GET /health/readiness - Readiness probe
  • GET /health/liveness - Liveness probe

Crawler Service

  • GET /crawler/test - Test crawler endpoint
  • POST /crawler/crawl - Crawl a single URL
  • POST /crawler/batch - Crawl multiple URLs
  • POST /crawler/session - Create a crawling session
  • GET /crawler/session/:id - Get session information
  • DELETE /crawler/session/:id - Delete a session
  • GET /crawler/status - Get service status

Request Examples

Single URL Crawl

POST /api/crawler/crawl
{
  "url": "https://example.com",
  "timeout": 30000,
  "extractHeadings": true,
  "extractLinks": true,
  "extractImages": true,
  "selectors": {
    "title": "h1",
    "description": "meta[name='description']"
  }
}

Batch Crawl

POST /api/crawler/batch
{
  "urls": [
    "https://example.com",
    "https://google.com",
    "https://github.com"
  ],
  "concurrency": 3,
  "delay": 1000,
  "extractText": true
}

Create Session

POST /api/crawler/session
{
  "userAgent": "Custom Bot 1.0",
  "description": "My crawling session"
}

Response Format

Success Response

{
  "success": true,
  "requestId": "uuid",
  "url": "https://example.com",
  "status": 200,
  "duration": 1234,
  "timestamp": "2025-06-29T10:00:00.000Z",
  "html": {
    "title": "Example Domain",
    "meta": {...},
    "headings": {...},
    "links": [...],
    "images": [...],
    "custom": {...}
  },
  "metadata": {...}
}

Error Response

{
  "success": false,
  "error": {
    "message": "Error description",
    "statusCode": 400
  }
}

Configuration

Environment Variables

Variable Default Description
PORT 3000 Server port
NODE_ENV development Environment mode
DEFAULT_TIMEOUT 30000 Request timeout (ms)
RATE_LIMIT_MAX_REQUESTS 100 Max requests per window
RATE_LIMIT_WINDOW_MS 900000 Rate limit window (ms)
USER_AGENT_ROTATION true Enable user agent rotation
PROXY_ENABLED false Enable proxy support
CORS_ORIGIN * Allowed CORS origins

Proxy Configuration

PROXY_ENABLED=true
PROXY_HOST=proxy.example.com
PROXY_PORT=8080
PROXY_USERNAME=username
PROXY_PASSWORD=password
PROXY_TYPE=http

Security Features

Rate Limiting

  • 100 requests per 15 minutes per IP
  • Configurable rate limits
  • Slow-down after threshold

CORS Protection

  • Origin validation
  • Configurable allowed origins
  • Preflight request handling

Security Headers

  • Content Security Policy
  • HTTP Strict Transport Security
  • X-Frame-Options
  • X-XSS-Protection
  • And more...

Input Validation

  • URL validation
  • Request size limits
  • Schema validation with Joi
  • SQL injection prevention

Deployment

Docker Deployment

FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY src/ ./src/
EXPOSE 3000
CMD ["npm", "start"]

Production Considerations

  1. Environment Configuration

    • Set NODE_ENV=production
    • Configure proper CORS origins
    • Enable security headers
    • Set appropriate rate limits
  2. Monitoring

    • Use log aggregation (ELK Stack, Splunk)
    • Set up health check monitoring
    • Configure alerts for errors
  3. Scaling

    • Use load balancers
    • Implement session clustering
    • Consider Redis for session storage

Development

Scripts

npm start        # Start production server
npm run dev      # Start development server with nodemon
npm test         # Run tests
npm run lint     # Run ESLint
npm run security-audit  # Run security audit

Testing

# Run all tests
npm test

# Test API functionality
npm run test:api

# Run Wikipedia crawling example
npm run example:wikipedia

# Run general client examples
npm run example:client

# Test specific endpoint
curl -X POST http://localhost:3000/api/crawler/test

Troubleshooting

Common Issues

  1. CORS Errors

    • Check CORS_ORIGIN configuration
    • Verify allowed headers
    • Ensure preflight requests are handled
  2. Rate Limiting

    • Adjust rate limit settings
    • Implement API key authentication
    • Use different IP addresses
  3. Timeout Issues

    • Increase DEFAULT_TIMEOUT
    • Check target website response times
    • Verify proxy configuration
  4. Memory Issues

    • Monitor memory usage
    • Implement request size limits
    • Clean up sessions regularly

Logs

Logs are stored in the logs/ directory:

  • crawler.log - General application logs
  • error.log - Error logs only
  • exceptions.log - Uncaught exceptions
  • rejections.log - Unhandled promise rejections

License

MIT License - see LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

Support

For support and questions:

  • Create an issue on GitHub
  • Check the documentation
  • Review the logs for error details

About

AllWebCrawler is a robust, secure, and feature-rich website crawler API service built with Node.js and Express. This service provides comprehensive web scraping capabilities with built-in security measures, CORS handling, rate limiting, and proxy support.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published