Web Crawler with URL Tree Visualization

A Python web crawler that crawls websites and generates a hierarchical tree visualization of the URL structure. The crawler saves both the list of crawled URLs and a tree diagram showing the site's structure.

Quick Start (If Already Installed)

If you already have Python, pip, and Graphviz installed, you can start using the crawler immediately:

Activate the virtual environment:

source venv/bin/activate

Run the crawler with any website URL:

python3 crawler.py https://example.com

Or with a custom output filename:

python3 crawler.py https://example.com --output my_crawl.txt

The crawler will generate two files in your current directory:

A text file with all crawled URLs (e.g., crawled_urls_20240404_143022.txt)
A tree visualization of the site structure (e.g., crawled_urls_20240404_143022_tree.pdf)

Features

Crawls websites within a specified domain
Excludes media files and documents (images, PDFs, etc.)
Generates timestamped output files
Creates a hierarchical tree visualization of the URL structure with color-coded status
Shows crawling progress in real-time
Handles relative and absolute URLs
Prevents duplicate crawling
Status code tracking: Separates successful (200) and failed (non-200) pages
Sitemap detection: Automatically detects and counts URLs from sitemap.xml
Sitemap-based crawling: Uses sitemap URLs as seed URLs to crawl JavaScript SPAs and sites with dynamic content
Trailing slash preservation: Optional flag to preserve trailing slashes for directory paths
Configurable timeout: Adjust request timeout for slow servers
Enhanced reporting: Detailed breakdown of crawl results with status code statistics

Prerequisites

Python 3.6 or higher
pip (Python package installer)
Graphviz (for tree visualization)

Installation

Clone or download this repository:

git clone <repository-url>
cd web_crawler

Create and activate a virtual environment:

# On macOS/Linux
python3 -m venv venv
source venv/bin/activate

# On Windows
python -m venv venv
.\venv\Scripts\activate

Install required Python packages:

pip install -r requirements.txt

Install Graphviz (required for tree visualization):

On macOS:

brew install graphviz

On Ubuntu/Debian:

sudo apt-get install graphviz

On Windows:

Download and install from Graphviz website

Usage

Basic usage:

python crawler.py https://example.com

Command-Line Options

--output, -o: Custom output filename (default: crawled_urls.txt)
```
python crawler.py https://example.com --output my_crawl.txt
```
--trailing-slash: Preserve trailing slashes for directory paths (not files)
```
python crawler.py https://example.com --trailing-slash
```
This preserves trailing slashes for paths like /api/users/ but removes them for files like /page.html/
--timeout: Set request timeout in seconds (default: 10)
```
python crawler.py https://example.com --timeout 30
```
Useful for slow servers or when experiencing connection timeouts

Combined Options

# Use multiple options together
python crawler.py https://example.com --output my_crawl.txt --trailing-slash --timeout 30

Output Files

For each crawl, the script generates two files with matching timestamps:

URL List File:
- Format: crawled_urls_YYYYMMDD_HHMMSS.txt
- Contains crawled URLs organized by status code:
  - Successful pages (200): All URLs that returned HTTP 200 status
  - Failed pages (non-200): URLs grouped by status code (404, 403, 500, etc.)
- Example: crawled_urls_20240404_143022.txt
- Format example:
```
URLs in sitemap: 150

=== SUCCESSFUL PAGES (200) ===
https://example.com/page1
https://example.com/page2

=== FAILED PAGES (NON-200) ===

--- Status 404 ---
https://example.com/broken-page

--- Status 403 ---
https://example.com/forbidden-page
```
  Note: The sitemap count line only appears if a sitemap was found.
Tree Visualization:
- Format: crawled_urls_YYYYMMDD_HHMMSS_tree.pdf (or .png)
- Shows the hierarchical structure of the crawled URLs with color coding:
  - Blue boxes: Successful pages (HTTP 200)
  - Red boxes with white text: Failed pages (non-200 status codes)
- Displays sitemap URL count at the top (if sitemap is found)
- Example: crawled_urls_20240404_143022_tree.pdf

Example

# Activate virtual environment (if not already activated)
source venv/bin/activate

# Run the crawler
python crawler.py https://support.loqate.com/

# With options
python crawler.py https://support.loqate.com/ --trailing-slash --timeout 30

Console Output

The crawler provides real-time progress and detailed statistics:

Starting crawl at: https://example.com
Domain: example.com

Checking for sitemap...
Found sitemap with 150 URLs
Using sitemap URLs as seed URLs for crawling

Starting crawl...

Crawling: https://example.com
Found 25 new links

Crawl completed!
Total pages crawled: 100
Total time: 0:02:30

Breakdown:
  ✅ Successful pages (200): 85
  ❌ Failed pages (non-200): 15
  Failed pages by status code:
    404: 10 pages
    403: 3 pages
    500: 2 pages

Completed. Found 85 successful URLs.
URLs in sitemap: 150
Results have been written to crawled_urls_20240404_143022.txt

Notes

The crawler respects robots.txt and follows standard web crawling practices
Large websites may take significant time to crawl
The tree visualization works best with hierarchical website structures
PDF output is preferred for better quality, but PNG is used as a fallback
Sitemap detection: The crawler automatically checks for sitemap.xml, sitemap_index.xml, or sitemap-index.xml at the root URL
Sitemap-based crawling: When a sitemap is found, all URLs from the sitemap are used as seed URLs for crawling. This is especially useful for:
- JavaScript Single Page Applications (SPAs) that don't have links in the initial HTML
- Sites with dynamically loaded content
- Ensuring comprehensive coverage of all pages listed in the sitemap
Sitemap count in output: The sitemap URL count is displayed at the top of the output file and in the PDF visualization
Status code handling: Pages with non-200 status codes (404, 403, 500, etc.) are tracked separately and displayed in red in the visualization
Trailing slashes: When using --trailing-slash, only directory paths preserve the slash; files with extensions will have trailing slashes removed
Timeout handling: Connection timeouts are tracked as "Error" status and the crawler continues with other pages

Troubleshooting

If you get a "command not found: python" error:
- Make sure you're in the virtual environment
- Try using python3 instead of python
If tree visualization fails:
- Verify Graphviz is installed correctly
- Check if the output directory is writable
- Try running with a smaller website first
If crawling is too slow:
- The crawler uses a depth-first search approach
- Consider adding a delay between requests for large sites
- Use the --timeout flag to increase timeout for slow servers: --timeout 30
If you see connection timeout errors:
- The default timeout is 10 seconds
- Increase the timeout using --timeout flag: python crawler.py https://example.com --timeout 30
- Timeout errors are tracked and displayed in the failed pages section
- The crawler continues crawling other pages even if some timeout
If sitemap is not detected:
- The crawler checks for /sitemap.xml, /sitemap_index.xml, and /sitemap-index.xml
- If no sitemap is found, the crawler will still function normally by crawling links found on pages
- The sitemap count will simply show as 0 in the output
- Note: If your site is a JavaScript SPA and has no sitemap, the crawler may only find the starting page since links aren't in the initial HTML
If crawler only finds the starting page (JavaScript SPAs):
- The crawler uses sitemap URLs as seed URLs when available
- If a sitemap exists, all URLs from it will be crawled automatically
- For sites without sitemaps, consider adding one to improve crawl coverage
- The crawler will still follow any links it finds in the HTML, but SPAs often load content dynamically

Markdown File Counter

The repository also includes count_mdx.py, a utility script for counting Markdown files (both .md and .mdx) in a directory and its subdirectories.

What is count_mdx.py?

count_mdx.py is a command-line utility that recursively searches through a specified directory to find and count all files with the .md and .mdx extensions. These files are commonly used in modern web development frameworks like Next.js and Gatsby for content management, as well as in documentation projects and static site generators.

Features

Recursively searches directories and subdirectories
Counts all files with .md and .mdx extensions (case-insensitive)
Option to list all found Markdown files with their full paths
Simple command-line interface with argument parsing

Usage

Basic usage (count only):

python count_mdx.py /path/to/directory

Count and list all Markdown files:

python count_mdx.py /path/to/directory --list

Or using the short flag:

python count_mdx.py /path/to/directory -l

Examples

# Count Markdown files in current directory
python count_mdx.py .

# Count and list Markdown files in a project directory
python count_mdx.py /Users/username/projects/my-blog --list

# Count Markdown files in a specific folder
python count_mdx.py ./content/blog

Output

The script provides:

Summary: Total count of Markdown files found
File List (with --list flag): Complete list of all Markdown files with their full paths, sorted alphabetically

Example output:

Found 15 Markdown files in /path/to/directory

Markdown files found:
- /path/to/directory/blog/post-1.md
- /path/to/directory/blog/post-2.mdx
- /path/to/directory/docs/getting-started.md
- /path/to/directory/docs/installation.mdx

Use Cases

Content Audits: Quickly assess the size of Markdown-based content repositories
Migration Planning: Count Markdown files before migrating between frameworks
Project Analysis: Understand the scope of content in documentation or blog projects
Quality Assurance: Verify that all expected Markdown files are present in a build

Requirements

Python 3.6 or higher
No additional dependencies (uses only standard library modules: os and argparse)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
count_mdx.py		count_mdx.py
crawler.py		crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler with URL Tree Visualization

Quick Start (If Already Installed)

Features

Prerequisites

Installation

Usage

Command-Line Options

Combined Options

Output Files

Example

Console Output

Notes

Troubleshooting

Markdown File Counter

What is count_mdx.py?

Features

Usage

Examples

Output

Use Cases

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Crawler with URL Tree Visualization

Quick Start (If Already Installed)

Features

Prerequisites

Installation

Usage

Command-Line Options

Combined Options

Output Files

Example

Console Output

Notes

Troubleshooting

Markdown File Counter

What is count_mdx.py?

Features

Usage

Examples

Output

Use Cases

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages