AllWebCrawler is a robust, secure, and feature-rich website crawler API service built with Node.js and Express. This service provides comprehensive web scraping capabilities with built-in security measures, CORS handling, rate limiting, and proxy support.
- Helmet.js security headers
- CORS configuration with origin validation
- Rate limiting and request throttling
- Input validation with Joi schemas
- User agent rotation to avoid detection
- Robots.txt compliance checking
- Proxy support (HTTP/HTTPS/SOCKS5)
- Single URL crawling
- Batch URL crawling with concurrency control
- Session management for persistent crawling
- HTML parsing with Cheerio
- Custom CSS selector extraction
- Metadata extraction
- Image and link extraction
- Text content extraction
- Comprehensive logging with Winston
- Health check endpoints
- Performance metrics
- Error tracking and reporting
- Request/response monitoring
# Clone the repository
git clone https://github.com/[your-username]/AllWebCrawler.git
cd AllWebCrawler
# Install dependencies
npm install
# Copy environment configuration
copy .env.example .env
# Start the development server
npm run dev# Test the service
curl http://localhost:3000/api/health
# Crawl a single URL
curl -X POST http://localhost:3000/api/crawler/crawl \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'Here's a practical example of crawling a Wikipedia page to extract structured information:
# Crawl Wikipedia Education page with custom selectors
curl -X POST http://localhost:3000/api/crawler/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://en.wikipedia.org/wiki/Education",
"timeout": 30000,
"extractHeadings": true,
"extractLinks": true,
"extractImages": true,
"selectors": {
"pageTitle": "#firstHeading",
"summary": "#mw-content-text > div.mw-parser-output > p:first-of-type",
"infoboxTitle": ".infobox-title",
"infoboxData": ".infobox tr",
"tableOfContents": "#toc .toctext",
"mainCategories": "#catlinks .mw-normal-catlinks ul li a",
"lastModified": "#footer-info-lastmod",
"languages": ".interlanguage-link-target"
}
}'const axios = require('axios');
async function crawlWikipedia() {
try {
const response = await axios.post('http://localhost:3000/api/crawler/crawl', {
url: 'https://en.wikipedia.org/wiki/Education',
timeout: 30000,
extractHeadings: true,
extractLinks: true,
extractImages: true,
extractText: false, // Skip full text to reduce response size
selectors: {
pageTitle: '#firstHeading',
summary: '#mw-content-text > div.mw-parser-output > p:first-of-type',
infoboxTitle: '.infobox-title',
tableOfContents: '#toc .toctext',
categories: '#catlinks .mw-normal-catlinks ul li a',
lastModified: '#footer-info-lastmod',
citationCount: '.citation',
externalLinks: '#External_links + ul li a',
seeAlso: '#See_also + ul li a'
}
});
const data = response.data;
console.log('Wikipedia Page Analysis:');
console.log('=====================');
console.log(`Page Title: ${data.html.custom.pageTitle}`);
console.log(`Summary: ${data.html.custom.summary?.substring(0, 200)}...`);
console.log(`Headings Found: ${Object.values(data.html.headings).flat().length}`);
console.log(`Links Found: ${data.html.links?.length || 0}`);
console.log(`Images Found: ${data.html.images?.length || 0}`);
console.log(`Table of Contents: ${data.html.custom.tableOfContents?.length || 0} sections`);
console.log(`Categories: ${data.html.custom.categories?.length || 0}`);
console.log(`Load Time: ${data.duration}ms`);
return data;
} catch (error) {
console.error('Error crawling Wikipedia:', error.response?.data || error.message);
}
}
crawlWikipedia();{
"success": true,
"requestId": "uuid-here",
"url": "https://en.wikipedia.org/wiki/Education",
"status": 200,
"duration": 2347,
"timestamp": "2025-06-29T10:00:00.000Z",
"html": {
"title": "Education - Wikipedia",
"meta": {
"description": "Education is the transmission of knowledge, skills, and character traits...",
"keywords": "Education, learning, teaching, school, university"
},
"headings": {
"h1": ["Education"],
"h2": ["Etymology", "History", "Formal education", "Informal education", ...],
"h3": ["Early history", "Ancient civilizations", "Medieval period", ...]
},
"links": [
{
"text": "learning",
"href": "/wiki/Learning",
"title": "Learning"
},
...
],
"images": [
{
"src": "//upload.wikimedia.org/wikipedia/commons/thumb/...",
"alt": "Students in a classroom",
"title": null
},
...
],
"custom": {
"pageTitle": "Education",
"summary": "Education is the transmission of knowledge, skills, and character traits...",
"tableOfContents": [
"Etymology",
"History",
"Formal education",
"Informal education",
...
],
"categories": [
"Education",
"Learning",
"Pedagogy",
...
],
"lastModified": "This page was last edited on 28 June 2025, at 15:30 (UTC)."
}
},
"metadata": {
"statusCode": 200,
"statusText": "OK",
"contentLength": 245678,
"lastModified": "Wed, 28 Jun 2025 15:30:00 GMT",
"server": "nginx",
"encoding": "gzip"
}
}// Create a session for multiple Wikipedia pages
const session = await axios.post('http://localhost:3000/api/crawler/session', {
userAgent: 'Educational Research Bot 1.0',
description: 'Wikipedia education research session'
});
// Crawl multiple related pages
const educationTopics = [
'https://en.wikipedia.org/wiki/Education',
'https://en.wikipedia.org/wiki/Higher_education',
'https://en.wikipedia.org/wiki/Primary_education',
'https://en.wikipedia.org/wiki/Educational_technology'
];
const batchResult = await axios.post('http://localhost:3000/api/crawler/batch', {
urls: educationTopics,
sessionId: session.data.sessionId,
concurrency: 2,
delay: 1000, // Be respectful to Wikipedia servers
selectors: {
pageTitle: '#firstHeading',
summary: '#mw-content-text > div.mw-parser-output > p:first-of-type',
categories: '#catlinks .mw-normal-catlinks ul li a',
wordCount: '#mw-content-text'
}
});
console.log(`Crawled ${batchResult.data.summary.successful} Wikipedia pages successfully`);http://localhost:3000/api
GET /health- Basic health checkGET /health/detailed- Detailed system informationGET /health/readiness- Readiness probeGET /health/liveness- Liveness probe
GET /crawler/test- Test crawler endpointPOST /crawler/crawl- Crawl a single URLPOST /crawler/batch- Crawl multiple URLsPOST /crawler/session- Create a crawling sessionGET /crawler/session/:id- Get session informationDELETE /crawler/session/:id- Delete a sessionGET /crawler/status- Get service status
POST /api/crawler/crawl
{
"url": "https://example.com",
"timeout": 30000,
"extractHeadings": true,
"extractLinks": true,
"extractImages": true,
"selectors": {
"title": "h1",
"description": "meta[name='description']"
}
}POST /api/crawler/batch
{
"urls": [
"https://example.com",
"https://google.com",
"https://github.com"
],
"concurrency": 3,
"delay": 1000,
"extractText": true
}POST /api/crawler/session
{
"userAgent": "Custom Bot 1.0",
"description": "My crawling session"
}{
"success": true,
"requestId": "uuid",
"url": "https://example.com",
"status": 200,
"duration": 1234,
"timestamp": "2025-06-29T10:00:00.000Z",
"html": {
"title": "Example Domain",
"meta": {...},
"headings": {...},
"links": [...],
"images": [...],
"custom": {...}
},
"metadata": {...}
}{
"success": false,
"error": {
"message": "Error description",
"statusCode": 400
}
}| Variable | Default | Description |
|---|---|---|
PORT |
3000 | Server port |
NODE_ENV |
development | Environment mode |
DEFAULT_TIMEOUT |
30000 | Request timeout (ms) |
RATE_LIMIT_MAX_REQUESTS |
100 | Max requests per window |
RATE_LIMIT_WINDOW_MS |
900000 | Rate limit window (ms) |
USER_AGENT_ROTATION |
true | Enable user agent rotation |
PROXY_ENABLED |
false | Enable proxy support |
CORS_ORIGIN |
* | Allowed CORS origins |
PROXY_ENABLED=true
PROXY_HOST=proxy.example.com
PROXY_PORT=8080
PROXY_USERNAME=username
PROXY_PASSWORD=password
PROXY_TYPE=http- 100 requests per 15 minutes per IP
- Configurable rate limits
- Slow-down after threshold
- Origin validation
- Configurable allowed origins
- Preflight request handling
- Content Security Policy
- HTTP Strict Transport Security
- X-Frame-Options
- X-XSS-Protection
- And more...
- URL validation
- Request size limits
- Schema validation with Joi
- SQL injection prevention
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY src/ ./src/
EXPOSE 3000
CMD ["npm", "start"]-
Environment Configuration
- Set
NODE_ENV=production - Configure proper CORS origins
- Enable security headers
- Set appropriate rate limits
- Set
-
Monitoring
- Use log aggregation (ELK Stack, Splunk)
- Set up health check monitoring
- Configure alerts for errors
-
Scaling
- Use load balancers
- Implement session clustering
- Consider Redis for session storage
npm start # Start production server
npm run dev # Start development server with nodemon
npm test # Run tests
npm run lint # Run ESLint
npm run security-audit # Run security audit# Run all tests
npm test
# Test API functionality
npm run test:api
# Run Wikipedia crawling example
npm run example:wikipedia
# Run general client examples
npm run example:client
# Test specific endpoint
curl -X POST http://localhost:3000/api/crawler/test-
CORS Errors
- Check
CORS_ORIGINconfiguration - Verify allowed headers
- Ensure preflight requests are handled
- Check
-
Rate Limiting
- Adjust rate limit settings
- Implement API key authentication
- Use different IP addresses
-
Timeout Issues
- Increase
DEFAULT_TIMEOUT - Check target website response times
- Verify proxy configuration
- Increase
-
Memory Issues
- Monitor memory usage
- Implement request size limits
- Clean up sessions regularly
Logs are stored in the logs/ directory:
crawler.log- General application logserror.log- Error logs onlyexceptions.log- Uncaught exceptionsrejections.log- Unhandled promise rejections
MIT License - see LICENSE file for details.
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
For support and questions:
- Create an issue on GitHub
- Check the documentation
- Review the logs for error details