A complete job search management system with tools for scraping LinkedIn job postings and browsing them through a modern web interface.
A TypeScript CLI application for scraping LinkedIn job postings with advanced filtering capabilities and local database storage.
Features:
- π Advanced filtering (position, location, experience level, employment type, date posted, remote work)
- πΎ Local SQLite database storage with full-text search
- π‘οΈ Enterprise anti-detection system (6 phases, fingerprinting, behavioral simulation, ML patterns)
- π Real-time progress tracking with terminal UI
- π― Multi-strategy parsing (JSON-LD, HTML fallback, job ID fallback)
- π Automatic retry logic with exponential backoff and debug artifacts
- π Refresh mode to update existing job postings
- π Export to JSON/CSV
- β 119 unit tests with 100% pass rate
Tech Stack: TypeScript, Node.js, Playwright, SQLite, Drizzle ORM, Vitest
A full-stack web application for browsing and filtering scraped LinkedIn job postings.
Features:
- π Modern React UI with Tailwind CSS
- π Full-text search and multi-criteria filtering
- π Beautiful HTML rendering with XSS protection (DOMPurify)
- πΎ Persistent filter state with localStorage
- π Real-time filter counts and statistics
- π¨ Dark mode support
- π± Responsive design
- π Direct links to LinkedIn job postings
- π View original LinkedIn HTML and raw JSON data
Tech Stack: React, TypeScript, Vite, TanStack Query, Zustand, Express, Tailwind CSS
- Node.js >= 18.0.0 (managed via fnm)
- npm
cd linkedin-scraper
# Install dependencies
npm install
# Install Playwright browser
npx playwright install chromium
# Build the project
npm run build
# Interactive scraping
npm run dev scrape --interactive
# CLI mode with filters
npm run dev scrape \
--position "Software Engineer" \
--location "San Francisco, CA" \
--limit 50
# Refresh existing jobs
npm run dev scrape --position "Software Engineer" --refresh
# List scraped jobs
npm run dev list --limit 20
# Export to JSON
npm run dev export --format json --output jobs.json
# Run tests
npm testcd job-browser
# Install dependencies
npm install
# Start both client and server
npm run dev
# Visit http://localhost:5173 in browserAlternative commands:
npm run dev:client # Client only (port 5173)
npm run dev:server # Server only (port 3000)
npm run build # Build both
npm run start # Production server-
scrape- Scrape LinkedIn job postings with filters--position <string>- Job position/title--location <string>- Location--experience-level <levels...>- Experience levels (Internship, Entry Level, etc.)--employment-type <types...>- Employment types (Full-time, Part-time, etc.)--date-posted <date>- Date posted filter--remote-option <option>- Remote work (On-site, Remote, Hybrid)--limit <number>- Max jobs to scrape (default: 50)--refresh- Force update existing jobs--interactive- Interactive mode with prompts--log-browser-errors- Log browser console errors
-
list- View scraped jobs in formatted table--search-id <id>- Filter by search ID--company <name>- Filter by company--location <location>- Filter by location--date-from <date>- Jobs posted after date--limit <number>- Max results (default: 20)
-
export- Export jobs to JSON or CSV--search-id <id>- Export specific search--format <format>- json or csv (default: json)--output <file>- Output filename
-
clear- Clear database--yes- Skip confirmation
Access the web interface at http://localhost:5173
Features:
- Search jobs by keyword
- Filter by company, location, employment type, seniority level, industry
- Filter by salary range
- Filter by posted date and scraped date
- Sort by posted date, scraped date, salary, title
- Pagination
- View job details with formatted HTML
- See original LinkedIn HTML
- Export raw JSON data
- CLAUDE.md - Complete developer guide for working with this repository
- linkedin-scraper/README.md - Detailed user guide with examples
- linkedin-scraper/SETUP.md - Platform-specific installation instructions
- job-browser/README.md - Web application setup and usage
- job-browser/QUICKSTART.md - Quick start guide with examples
- how-to-scrape-linkedin.md - Research on LinkedIn scraping methods
The LinkedIn scraper is for educational purposes only. LinkedIn's Terms of Service prohibit automated scraping. Users are responsible for ensuring compliance with:
- LinkedIn's Terms of Service
- Local data protection laws (GDPR, CCPA, etc.)
- Computer Fraud and Abuse Act (CFAA)
- Website scraping regulations in their jurisdiction
For production use, consider:
- Official LinkedIn APIs or licensed third-party data providers
- Reasonable rate limits (max 50-100 jobs per session)
- Respecting robots.txt and Terms of Service
This project uses fnm (Fast Node Manager):
# fnm automatically loads Node.js v24.11.1 when entering directories
cd linkedin-scraper
node --version # v24.11.1LinkedIn Scraper (119 unit tests):
cd linkedin-scraper
npm test # Run all tests
npm run test:watch # Watch mode
npm run test:ui # Interactive UI
npm run test:coverage # With coverageJob Browser: Start with small limits to test:
cd linkedin-scraper
npm run dev scrape --position "Test" --limit 10
cd ../job-browser
npm run devMonitor logs at linkedin-scraper/logs/scraper.log for errors.
Shared Database:
Both applications use the same SQLite database located at linkedin-scraper/data/linkedin-jobs.db.
Data Flow:
- LinkedIn Scraper β Scrapes jobs β SQLite database
- Job Browser β Reads from database β Displays in web UI
Key Features:
- SQLite with WAL mode for concurrency
- Full-text search with FTS5
- Automatic migrations
- Comprehensive error logging
- Debug artifacts (screenshots + HTML on failures)
The LinkedIn scraper includes an enterprise-grade anti-detection system implemented across 6 phases with 16 TypeScript modules (~4,500+ lines):
Phase 1: Core Fingerprinting
- Browser fingerprint randomization (viewport, user-agent, locale, timezone, platform)
- Stealth patches to hide automation (WebDriver, automation flags)
- Screen resolution and pixel ratio randomization
Phase 2: Behavioral Simulation
- Human-like mouse movement with BΓ©zier curves
- Natural scrolling patterns with variable speeds
- Reading time simulation based on content length
- Hover behavior before interactions
Phase 3: Network Management
- HTTP header randomization
- Cookie and session persistence
- Storage data management
Phase 4: Advanced Fingerprinting
- Canvas fingerprinting with noise injection
- WebGL parameter randomization
- Audio context fingerprinting
- Font enumeration randomization
Phase 5: Detection Monitoring
- CAPTCHA detection (reCAPTCHA, hCaptcha)
- Rate limit detection (HTTP 429, retry-after headers)
- Block detection (Cloudflare, anti-bot services)
- Adaptive delay multipliers (1.2x-5.0x based on detection)
- Automatic abort on critical detections
Phase 6: Machine Learning Patterns
- Timing Analysis: Predicts optimal delays using statistical methods (mean + 0.5ΓstdDev)
- Behavior Recognition: Uses Markov chains to model human-like action sequences
- Anomaly Detection: Ensemble methods (z-score, IQR, MAD) to detect unusual responses
- Training Data: Persists learning data to disk for continuous improvement across sessions
- Zero ML Dependencies: Pure TypeScript statistical algorithms (no TensorFlow/PyTorch)
Performance:
- Prediction latency: <1ms
- Memory: ~50-100 MB for ML data
- Training data: Auto-saves every 10 minutes to
./data/training/
Configuration:
- Default "balanced" mode enables all features
- Configurable via environment variables
- Optional async initialization for ML components
See CLAUDE.md for complete technical documentation.
Rate limiting/blocking:
- Increase delays:
REQUEST_DELAY_MIN=5000in.env - Reduce concurrency:
MAX_CONCURRENT_REQUESTS=1 - Lower limit:
--limit 20 - Wait 24 hours before retrying
Parsing failures:
- Check
logs/for debug artifacts (auto-saved on failures) - Update selectors in
src/scraper/core/Parser.tsif LinkedIn changed HTML - Parser has multiple fallback strategies
Browser not initialized:
- Run:
npx playwright install chromium - Check browser dependencies on Linux
Empty job list:
- Run linkedin-scraper first to populate database
- Check
DATABASE_PATHinserver/.env - Verify database file exists
CORS errors:
- Ensure server is running on port 3000
- Check CORS configuration in
server/.env
Filters not working:
- Clear browser localStorage
- Check filter state in browser DevTools
- Verify API response includes filter options
LinkedIn Scraper:
- ~4-6 jobs per minute
- ~150-200 MB memory
- ~1-2 KB per job in database
Job Browser:
- <50ms filter response time
- ~50-100 MB memory (server)
- ~200-300 KB client bundle (gzipped)
LinkedIn Scraper:
- Email notifications for new jobs
- Scheduled scraping with cron
- ML-based job matching and recommendations
- Multi-platform support (Indeed, Glassdoor)
- Cloud deployment
- Enhanced ML patterns (neural networks for deeper pattern recognition)
Job Browser:
- Application tracking system
- Job analytics and trends visualization
- Resume keyword optimization
- Saved search filters
- Mobile app
MIT License - Copyright (c) 2025 Mohamadmehdi Kharatizadeh