Web Crawler

Description

Concurrent single-domain web crawler in Java that efficiently discovers and maps internal site structure starting from a seed URL.

Starting from a seed URL, it:

Fetches the page HTML
Extracts links
Canonicalises them
Deduplicates & validates (same host, allowed by robots, valid http/https, not PDF, ...)
Enqueues new URLs
Repeats until no new pages remain

At the end of the process, it prints some stats in the terminal and generate 2 files:

sitemap.json: a representation of the sitemap store with all URLs metadata
sitemap.txt: a representation of all URLs and their children

Architecture

Diagram

Design summary

The crawler is designed as a 2-stage pipeline separating I/O latency (fetch) from CPU work (page processing).

Concern	Implementation
Concurrency	Producer–consumer with 2 blocking queues providing backpressure
State	URL Metadata of processed pages stored in ConcurrentHashMap keyed by canonical URL
Distribution	Separate thread pools for UrlFetcher and Page Processor for flexibility
Politeness	Fixed inter-request delay
Completion	Atomic counter for in-flight URLs + poison message for shutdown

inFlight is an AtomicInteger tracking how many URLs are currently being fetched or processed. It increments when we discover a new unique URL, and decrements once the fetcher finishes handling that URL. When this reaches 0, and both queues are empty, we know there is no active work left, i.e. crawling is complete.

Running

Requirements

Java 11
Maven 3.9+

Build

mvn clean package

Run locally

Run java -jar target/web-crawler-1.0-SNAPSHOT.jar

Run the CrawlingApp in your IDE

Test

mvn test

Config

Config lives in /src/main/resources/application.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web Crawler

Description

Architecture

Diagram

Design summary

Running

Requirements

Build

Run locally

Test

Config

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Web Crawler

Description

Architecture

Diagram

Design summary

Running

Requirements

Build

Run locally

Test

Config