A Python web crawler that crawls websites and generates a hierarchical tree visualization of the URL structure. The crawler saves both the list of crawled URLs and a tree diagram showing the site's structure.
If you already have Python, pip, and Graphviz installed, you can start using the crawler immediately:
- Activate the virtual environment:
source venv/bin/activate- Run the crawler with any website URL:
python3 crawler.py https://example.comOr with a custom output filename:
python3 crawler.py https://example.com --output my_crawl.txtThe crawler will generate two files in your current directory:
- A text file with all crawled URLs (e.g.,
crawled_urls_20240404_143022.txt) - A tree visualization of the site structure (e.g.,
crawled_urls_20240404_143022_tree.pdf)
- Crawls websites within a specified domain
- Excludes media files and documents (images, PDFs, etc.)
- Generates timestamped output files
- Creates a hierarchical tree visualization of the URL structure with color-coded status
- Shows crawling progress in real-time
- Handles relative and absolute URLs
- Prevents duplicate crawling
- Status code tracking: Separates successful (200) and failed (non-200) pages
- Sitemap detection: Automatically detects and counts URLs from sitemap.xml
- Sitemap-based crawling: Uses sitemap URLs as seed URLs to crawl JavaScript SPAs and sites with dynamic content
- Trailing slash preservation: Optional flag to preserve trailing slashes for directory paths
- Configurable timeout: Adjust request timeout for slow servers
- Enhanced reporting: Detailed breakdown of crawl results with status code statistics
- Python 3.6 or higher
- pip (Python package installer)
- Graphviz (for tree visualization)
- Clone or download this repository:
git clone <repository-url>
cd web_crawler- Create and activate a virtual environment:
# On macOS/Linux
python3 -m venv venv
source venv/bin/activate
# On Windows
python -m venv venv
.\venv\Scripts\activate- Install required Python packages:
pip install -r requirements.txt- Install Graphviz (required for tree visualization):
On macOS:
brew install graphvizOn Ubuntu/Debian:
sudo apt-get install graphvizOn Windows:
- Download and install from Graphviz website
Basic usage:
python crawler.py https://example.com-
--output,-o: Custom output filename (default:crawled_urls.txt)python crawler.py https://example.com --output my_crawl.txt
-
--trailing-slash: Preserve trailing slashes for directory paths (not files)python crawler.py https://example.com --trailing-slash
This preserves trailing slashes for paths like
/api/users/but removes them for files like/page.html/ -
--timeout: Set request timeout in seconds (default: 10)python crawler.py https://example.com --timeout 30
Useful for slow servers or when experiencing connection timeouts
# Use multiple options together
python crawler.py https://example.com --output my_crawl.txt --trailing-slash --timeout 30For each crawl, the script generates two files with matching timestamps:
-
URL List File:
- Format:
crawled_urls_YYYYMMDD_HHMMSS.txt - Contains crawled URLs organized by status code:
- Successful pages (200): All URLs that returned HTTP 200 status
- Failed pages (non-200): URLs grouped by status code (404, 403, 500, etc.)
- Example:
crawled_urls_20240404_143022.txt - Format example:
Note: The sitemap count line only appears if a sitemap was found.
URLs in sitemap: 150 === SUCCESSFUL PAGES (200) === https://example.com/page1 https://example.com/page2 === FAILED PAGES (NON-200) === --- Status 404 --- https://example.com/broken-page --- Status 403 --- https://example.com/forbidden-page
- Format:
-
Tree Visualization:
- Format:
crawled_urls_YYYYMMDD_HHMMSS_tree.pdf(or .png) - Shows the hierarchical structure of the crawled URLs with color coding:
- Blue boxes: Successful pages (HTTP 200)
- Red boxes with white text: Failed pages (non-200 status codes)
- Displays sitemap URL count at the top (if sitemap is found)
- Example:
crawled_urls_20240404_143022_tree.pdf
- Format:
# Activate virtual environment (if not already activated)
source venv/bin/activate
# Run the crawler
python crawler.py https://support.loqate.com/
# With options
python crawler.py https://support.loqate.com/ --trailing-slash --timeout 30The crawler provides real-time progress and detailed statistics:
Starting crawl at: https://example.com
Domain: example.com
Checking for sitemap...
Found sitemap with 150 URLs
Using sitemap URLs as seed URLs for crawling
Starting crawl...
Crawling: https://example.com
Found 25 new links
Crawl completed!
Total pages crawled: 100
Total time: 0:02:30
Breakdown:
✅ Successful pages (200): 85
❌ Failed pages (non-200): 15
Failed pages by status code:
404: 10 pages
403: 3 pages
500: 2 pages
Completed. Found 85 successful URLs.
URLs in sitemap: 150
Results have been written to crawled_urls_20240404_143022.txt
- The crawler respects robots.txt and follows standard web crawling practices
- Large websites may take significant time to crawl
- The tree visualization works best with hierarchical website structures
- PDF output is preferred for better quality, but PNG is used as a fallback
- Sitemap detection: The crawler automatically checks for sitemap.xml, sitemap_index.xml, or sitemap-index.xml at the root URL
- Sitemap-based crawling: When a sitemap is found, all URLs from the sitemap are used as seed URLs for crawling. This is especially useful for:
- JavaScript Single Page Applications (SPAs) that don't have links in the initial HTML
- Sites with dynamically loaded content
- Ensuring comprehensive coverage of all pages listed in the sitemap
- Sitemap count in output: The sitemap URL count is displayed at the top of the output file and in the PDF visualization
- Status code handling: Pages with non-200 status codes (404, 403, 500, etc.) are tracked separately and displayed in red in the visualization
- Trailing slashes: When using
--trailing-slash, only directory paths preserve the slash; files with extensions will have trailing slashes removed - Timeout handling: Connection timeouts are tracked as "Error" status and the crawler continues with other pages
-
If you get a "command not found: python" error:
- Make sure you're in the virtual environment
- Try using
python3instead ofpython
-
If tree visualization fails:
- Verify Graphviz is installed correctly
- Check if the output directory is writable
- Try running with a smaller website first
-
If crawling is too slow:
- The crawler uses a depth-first search approach
- Consider adding a delay between requests for large sites
- Use the
--timeoutflag to increase timeout for slow servers:--timeout 30
-
If you see connection timeout errors:
- The default timeout is 10 seconds
- Increase the timeout using
--timeoutflag:python crawler.py https://example.com --timeout 30 - Timeout errors are tracked and displayed in the failed pages section
- The crawler continues crawling other pages even if some timeout
-
If sitemap is not detected:
- The crawler checks for
/sitemap.xml,/sitemap_index.xml, and/sitemap-index.xml - If no sitemap is found, the crawler will still function normally by crawling links found on pages
- The sitemap count will simply show as 0 in the output
- Note: If your site is a JavaScript SPA and has no sitemap, the crawler may only find the starting page since links aren't in the initial HTML
- The crawler checks for
-
If crawler only finds the starting page (JavaScript SPAs):
- The crawler uses sitemap URLs as seed URLs when available
- If a sitemap exists, all URLs from it will be crawled automatically
- For sites without sitemaps, consider adding one to improve crawl coverage
- The crawler will still follow any links it finds in the HTML, but SPAs often load content dynamically
The repository also includes count_mdx.py, a utility script for counting Markdown files (both .md and .mdx) in a directory and its subdirectories.
count_mdx.py is a command-line utility that recursively searches through a specified directory to find and count all files with the .md and .mdx extensions. These files are commonly used in modern web development frameworks like Next.js and Gatsby for content management, as well as in documentation projects and static site generators.
- Recursively searches directories and subdirectories
- Counts all files with
.mdand.mdxextensions (case-insensitive) - Option to list all found Markdown files with their full paths
- Simple command-line interface with argument parsing
Basic usage (count only):
python count_mdx.py /path/to/directoryCount and list all Markdown files:
python count_mdx.py /path/to/directory --listOr using the short flag:
python count_mdx.py /path/to/directory -l# Count Markdown files in current directory
python count_mdx.py .
# Count and list Markdown files in a project directory
python count_mdx.py /Users/username/projects/my-blog --list
# Count Markdown files in a specific folder
python count_mdx.py ./content/blogThe script provides:
- Summary: Total count of Markdown files found
- File List (with
--listflag): Complete list of all Markdown files with their full paths, sorted alphabetically
Example output:
Found 15 Markdown files in /path/to/directory
Markdown files found:
- /path/to/directory/blog/post-1.md
- /path/to/directory/blog/post-2.mdx
- /path/to/directory/docs/getting-started.md
- /path/to/directory/docs/installation.mdx
- Content Audits: Quickly assess the size of Markdown-based content repositories
- Migration Planning: Count Markdown files before migrating between frameworks
- Project Analysis: Understand the scope of content in documentation or blog projects
- Quality Assurance: Verify that all expected Markdown files are present in a build
- Python 3.6 or higher
- No additional dependencies (uses only standard library modules:
osandargparse)