Skip to content

A fast CLI link checker that recursively crawls internal pages of a site and checks external links.

License

Notifications You must be signed in to change notification settings

cryMG/website-link-checker

Repository files navigation

Website Link Checker

A fast CLI link checker that recursively crawls internal pages of a site and checks external links. It canonicalizes URLs (incl. removing tracking params), can optionally filter social share links, uses retry/backoff for HTTP checks, and prints localized output (EN/DE).

Warning

Use at your own risk. The author of this script assumes no responsibility or liability for any damage, data loss, downtime, or other issues caused by its use.

Features

  • Recursive crawl of internal pages (links from <a>; optionally assets from <img>, <script>, <link>)
  • External link checks with configurable concurrency (ThreadPool)
  • URL canonicalization: lowercase host/scheme, drop default ports, index.html/, sanitize query parameters
  • Remove known tracking parameters (utm_*, fbclid, gclid, …) or remove the entire query
  • Optional filtering of typical social share links (Twitter/X, Facebook, LinkedIn, WhatsApp, Telegram, …)
  • Robust HTTP checks (HEAD/GET with fallback), retries and exponential backoff
  • Localized output: automatic language detection (EN/DE) or via --lang
  • Compact summary and exit code 0/1 depending on broken links

Requirements

  • Python 3.8+
  • Dependencies: requests, urllib3 (see requirements.txt)

Installation (optional):

pip install -r requirements.txt

Alternatively on Debian-based systems:

sudo apt install python3-requests python3-urllib3

Usage

Get the script and make it executable (Linux):

curl -O https://raw.githubusercontent.com/cryMG/website-link-checker/main/website-link-checker.py
chmod +x website-link-checker.py

Run the script directly:

./website-link-checker.py https://example.com

Example output:

Allowed schemes: http, https | Fetchable: http, https
Internal pages (final, unique, canonical): 81
Discovered internal link targets (unique, canonical): 81
External links (unique, canonical, after filters): 20
Total links found (before filters): 2634 | discarded: 60 | social filtered: 384 | scheme not allowed: 60 | scheme not supported: 0
Social filter (by service): facebook:48, linkedin:48, mastodon:48, pinterest:48, reddit:48, telegram:48, twitter:48, whatsapp:48

=== Results ===
No broken internal pages found.

No broken external links found.

Summary: Internal pages: 81 | Internal link targets: 81 | External links checked: 20 | Errors: 0

Tip

Use ./website-link-checker.py --help to see all options and their descriptions.

All parameters

  • start_url (positional)
    • Start URL, e.g., https://example.com
  • --include-assets
    • Also collect assets (img/script/link) (non-recursive). Default: off
  • --no-recursive-assets
    • When --include-assets is enabled, disable recursive asset scanning (e.g., CSS url() and @import). Default: recursive asset scan ON.
  • --max-workers INT
    • Concurrency for external link checks. Default: 16
  • --timeout INT
    • Per-request timeout in seconds. Default: 10
  • --retries INT
    • Automatic retries on errors. Default: 2
  • --backoff FLOAT
    • Exponential backoff factor. Default: 0.3
  • --sleep FLOAT
    • Short pause between internal page fetches in seconds. Default: 0.0
  • --user-agent STRING
    • User-Agent string (default is a sensible UA for this tool)
  • --insecure
    • Disable SSL certificate verification (only if necessary). Default: off
  • --ext-method {auto,head,get}
    • Method for external links. Default: auto (HEAD with GET fallback)
  • --verbose
    • Verbose output (fetch/redirects/errors). Default: off
  • --debug-links
    • Log every discovered link (source, raw, normalized, classification). Default: off
  • --strip-query {none,tracking,all}
    • Query parameter canonicalization. Default: tracking
  • --no-normalize-index
    • Do NOT normalize index files to directory URLs. Default: off (i.e., normalization is active)
  • --no-filter-social
    • Disable filtering of social share links. Default: filtering is ON
  • --schemes LIST
    • Comma-separated list of allowed URI schemes. Default: http,https. Only http/https are actively checked; other allowed schemes are counted/skipped.
  • --lang {auto,en,de}
    • Language for output. Default: auto. Note: help text follows the environment language (LANG, LC_ALL, …).
  • --report-file PATH
    • Optional path to write a report (console output still printed).
  • --report-format {text,json,csv}
    • Report format. Default: inferred from file extension, or text.
  • --progress
    • Show a simple progress indicator during external link checks.
  • --respect-robots
    • Respect robots.txt (Disallow rules); if Crawl-delay is set for your User-Agent, it is used as a minimum sleep.
  • --sitemap PATH_OR_URL
    • Optional sitemap (file path or URL). URLs found are used as crawl seeds (same host only).
  • --exclude REGEX (repeatable)
    • Exclude URL patterns (regular expressions). Can be used multiple times to ignore certain pages/paths during crawling and checking. Patterns match against canonical absolute URLs.
  • --auth user:password
    • Use HTTP Basic Authentication for all requests.
  • --header "Name: value" (repeatable)
    • Add custom HTTP headers. You can pass multiple --header flags.

Examples

  • Basic crawl with verbose output:
./website-link-checker.py --verbose https://example.com
  • Include assets, strip tracking params (default), social filter enabled (default):
./website-link-checker.py --include-assets https://example.com
  • Remove all query parameters and force English output:
./website-link-checker.py --strip-query=all --lang=en https://example.com
  • Restrict schemes (non-http/https are counted, not checked):
./website-link-checker.py --schemes=http,https,mailto,tel https://example.com
  • Write a JSON report and show progress:
./website-link-checker.py --progress --report-file report.json https://example.com
  • Use a sitemap to seed the crawl:
./website-link-checker.py --sitemap https://example.com/sitemap.xml https://example.com
  • Exclude pages/paths by regex (repeat --exclude):
./website-link-checker.py --exclude ".*/privacy" --exclude "https://example.com/(old|legacy)/.*" https://example.com
  • Use Basic Auth and custom headers:
./website-link-checker.py --auth user:secret \
  --header "Accept-Language: de-DE" \
  --header "X-Debug: 1" \
  https://example.com
  • Respect robots.txt and use a custom User-Agent:
./website-link-checker.py --respect-robots --user-agent "MyChecker/1.0" https://example.com

Output and exit codes

  • Diagnostics (number of internal pages, external links, stats) and result lists.
  • Exit code: 0 no broken links, 1 at least one broken link.

Details

  • Only http/https are actively fetched; other allowed schemes are skipped and counted.
  • Language: detection via LC_ALL, LC_MESSAGES, LANG, LANGUAGE; override with --lang.
  • HTTP strategy: auto uses HEAD with GET fallback (e.g., on 405/501). Accepted warning statuses: {401, 403, 999}.
  • SSL: certificate verification is on; --insecure disables (use only if necessary).
  • URL canonicalization: remove tracking params; sort remaining query params.
  • Compatibility: fallback for older urllib3 (method_whitelist vs allowed_methods).

Tests

Using unittest:

python3 -m unittest -v
  • Unit tests: canonicalization, classification, HTML detection, social-share detection, HTTP check (fake session), i18n.
  • CLI tests: help (via LANG) and error paths.

Troubleshooting

  • Wrong language? Set LANG=en_US.UTF-8 or LANG=de_DE.UTF-8 when invoking.
  • SSL/rate limit issues: try --insecure (for testing only), tune --timeout/--retries/--backoff, or reduce --max-workers.

License

Copyright (c) 2025 cryeffect Media Group https://crymg.de

This project is licensed under the GNU General Public License v3 (GPLv3).

About

A fast CLI link checker that recursively crawls internal pages of a site and checks external links.

Topics

Resources

License

Stars

Watchers

Forks

Languages