A command-line based web crawler built using Node.js that recursively extracts and analyzes internal links from a given website.
It provides a detailed report showing how many times each internal page was found — helping understand site structure and link connectivity.
- 🕸️ Recursive Crawling: Traverses all internal links within the same domain.
- 🧩 HTML Parsing: Extracts links efficiently using the
JSDOMlibrary. - 🔗 URL Normalization: Converts relative and absolute links into a consistent format.
- ⚙️ Asynchronous Fetching: Uses
async/awaitfor non-blocking network requests. - 🧠 Error Handling: Handles broken links, non-HTML responses, and invalid URLs.
- 📊 Reporting Module: Displays link frequency and page stats sorted by count.
- 🧱 Modular Design: Cleanly divided into crawling logic, reporting, and CLI execution files.
- Language: JavaScript (ES6)
- Runtime: Node.js
- Libraries: JSDOM, Fetch API
web-crawler-cli/
│
├── crawl.js → Core crawler logic (fetches and parses internal links)
├── report.js → Generates and prints crawl reports
└── main.js → Entry point for CLI execution
- Takes a base URL as a command-line argument.
- Recursively fetches internal pages from that URL.
- Parses HTML content using
JSDOMto extract all links. - Generates a report showing how many times each page was found.
-
Clone the Repository
-
git clone https://github.com/namitaa15/web_crawlerhttp.git cd web_crawlerhttp -
Install Dependencies npm install
-
Run the Application
You can run the crawler in two ways:
🟢 Using npm script → npm start https://example.com
🟣 Using Node directly → node main.js https://example.com
Replace
https://example.comwith any valid URL to analyze its internal link structure.