Skip to content

Latest commit

 

History

History
61 lines (42 loc) · 2.13 KB

File metadata and controls

61 lines (42 loc) · 2.13 KB

Web Crawler

Description

Concurrent single-domain web crawler in Java that efficiently discovers and maps internal site structure starting from a seed URL.

Starting from a seed URL, it:

  • Fetches the page HTML
  • Extracts links
  • Canonicalises them
  • Deduplicates & validates (same host, allowed by robots, valid http/https, not PDF, ...)
  • Enqueues new URLs
  • Repeats until no new pages remain

At the end of the process, it prints some stats in the terminal and generate 2 files:

  • sitemap.json: a representation of the sitemap store with all URLs metadata
  • sitemap.txt: a representation of all URLs and their children

Architecture

Diagram

Crawler architecture diagram

Design summary

The crawler is designed as a 2-stage pipeline separating I/O latency (fetch) from CPU work (page processing).

Concern Implementation
Concurrency Producer–consumer with 2 blocking queues providing backpressure
State URL Metadata of processed pages stored in ConcurrentHashMap keyed by canonical URL
Distribution Separate thread pools for UrlFetcher and Page Processor for flexibility
Politeness Fixed inter-request delay
Completion Atomic counter for in-flight URLs + poison message for shutdown

inFlight is an AtomicInteger tracking how many URLs are currently being fetched or processed. It increments when we discover a new unique URL, and decrements once the fetcher finishes handling that URL. When this reaches 0, and both queues are empty, we know there is no active work left, i.e. crawling is complete.

Running

Requirements

  • Java 11
  • Maven 3.9+

Build

mvn clean package

Run locally

Run java -jar target/web-crawler-1.0-SNAPSHOT.jar

OR

Run the CrawlingApp in your IDE

Test

mvn test

Config

Config lives in /src/main/resources/application.properties