A fast CLI link checker that recursively crawls internal pages of a site and checks external links. It canonicalizes URLs (incl. removing tracking params), can optionally filter social share links, uses retry/backoff for HTTP checks, and prints localized output (EN/DE).
Warning
Use at your own risk. The author of this script assumes no responsibility or liability for any damage, data loss, downtime, or other issues caused by its use.
- Recursive crawl of internal pages (links from
<a>; optionally assets from<img>,<script>,<link>) - External link checks with configurable concurrency (ThreadPool)
- URL canonicalization: lowercase host/scheme, drop default ports,
index.html→/, sanitize query parameters - Remove known tracking parameters (utm_*, fbclid, gclid, …) or remove the entire query
- Optional filtering of typical social share links (Twitter/X, Facebook, LinkedIn, WhatsApp, Telegram, …)
- Robust HTTP checks (HEAD/GET with fallback), retries and exponential backoff
- Localized output: automatic language detection (EN/DE) or via
--lang - Compact summary and exit code 0/1 depending on broken links
- Python 3.8+
- Dependencies:
requests,urllib3(seerequirements.txt)
Installation (optional):
pip install -r requirements.txtAlternatively on Debian-based systems:
sudo apt install python3-requests python3-urllib3Get the script and make it executable (Linux):
curl -O https://raw.githubusercontent.com/cryMG/website-link-checker/main/website-link-checker.py
chmod +x website-link-checker.pyRun the script directly:
./website-link-checker.py https://example.comExample output:
Allowed schemes: http, https | Fetchable: http, https
Internal pages (final, unique, canonical): 81
Discovered internal link targets (unique, canonical): 81
External links (unique, canonical, after filters): 20
Total links found (before filters): 2634 | discarded: 60 | social filtered: 384 | scheme not allowed: 60 | scheme not supported: 0
Social filter (by service): facebook:48, linkedin:48, mastodon:48, pinterest:48, reddit:48, telegram:48, twitter:48, whatsapp:48
=== Results ===
No broken internal pages found.
No broken external links found.
Summary: Internal pages: 81 | Internal link targets: 81 | External links checked: 20 | Errors: 0
Tip
Use ./website-link-checker.py --help to see all options and their descriptions.
- start_url (positional)
- Start URL, e.g.,
https://example.com
- Start URL, e.g.,
--include-assets- Also collect assets (img/script/link) (non-recursive). Default: off
--no-recursive-assets- When
--include-assetsis enabled, disable recursive asset scanning (e.g., CSS url() and @import). Default: recursive asset scan ON.
- When
--max-workers INT- Concurrency for external link checks. Default: 16
--timeout INT- Per-request timeout in seconds. Default: 10
--retries INT- Automatic retries on errors. Default: 2
--backoff FLOAT- Exponential backoff factor. Default: 0.3
--sleep FLOAT- Short pause between internal page fetches in seconds. Default: 0.0
--user-agent STRING- User-Agent string (default is a sensible UA for this tool)
--insecure- Disable SSL certificate verification (only if necessary). Default: off
--ext-method {auto,head,get}- Method for external links. Default:
auto(HEAD with GET fallback)
- Method for external links. Default:
--verbose- Verbose output (fetch/redirects/errors). Default: off
--debug-links- Log every discovered link (source, raw, normalized, classification). Default: off
--strip-query {none,tracking,all}- Query parameter canonicalization. Default:
tracking
- Query parameter canonicalization. Default:
--no-normalize-index- Do NOT normalize index files to directory URLs. Default: off (i.e., normalization is active)
--no-filter-social- Disable filtering of social share links. Default: filtering is ON
--schemes LIST- Comma-separated list of allowed URI schemes. Default:
http,https. Only http/https are actively checked; other allowed schemes are counted/skipped.
- Comma-separated list of allowed URI schemes. Default:
--lang {auto,en,de}- Language for output. Default:
auto. Note: help text follows the environment language (LANG,LC_ALL, …).
- Language for output. Default:
--report-file PATH- Optional path to write a report (console output still printed).
--report-format {text,json,csv}- Report format. Default: inferred from file extension, or
text.
- Report format. Default: inferred from file extension, or
--progress- Show a simple progress indicator during external link checks.
--respect-robots- Respect robots.txt (Disallow rules); if Crawl-delay is set for your User-Agent, it is used as a minimum sleep.
--sitemap PATH_OR_URL- Optional sitemap (file path or URL). URLs found are used as crawl seeds (same host only).
--exclude REGEX(repeatable)- Exclude URL patterns (regular expressions). Can be used multiple times to ignore certain pages/paths during crawling and checking. Patterns match against canonical absolute URLs.
--auth user:password- Use HTTP Basic Authentication for all requests.
--header "Name: value"(repeatable)- Add custom HTTP headers. You can pass multiple
--headerflags.
- Add custom HTTP headers. You can pass multiple
- Basic crawl with verbose output:
./website-link-checker.py --verbose https://example.com- Include assets, strip tracking params (default), social filter enabled (default):
./website-link-checker.py --include-assets https://example.com- Remove all query parameters and force English output:
./website-link-checker.py --strip-query=all --lang=en https://example.com- Restrict schemes (non-http/https are counted, not checked):
./website-link-checker.py --schemes=http,https,mailto,tel https://example.com- Write a JSON report and show progress:
./website-link-checker.py --progress --report-file report.json https://example.com- Use a sitemap to seed the crawl:
./website-link-checker.py --sitemap https://example.com/sitemap.xml https://example.com- Exclude pages/paths by regex (repeat
--exclude):
./website-link-checker.py --exclude ".*/privacy" --exclude "https://example.com/(old|legacy)/.*" https://example.com- Use Basic Auth and custom headers:
./website-link-checker.py --auth user:secret \
--header "Accept-Language: de-DE" \
--header "X-Debug: 1" \
https://example.com- Respect robots.txt and use a custom User-Agent:
./website-link-checker.py --respect-robots --user-agent "MyChecker/1.0" https://example.com- Diagnostics (number of internal pages, external links, stats) and result lists.
- Exit code:
0no broken links,1at least one broken link.
- Only
http/httpsare actively fetched; other allowed schemes are skipped and counted. - Language: detection via
LC_ALL,LC_MESSAGES,LANG,LANGUAGE; override with--lang. - HTTP strategy:
autousesHEADwith GET fallback (e.g., on 405/501). Accepted warning statuses:{401, 403, 999}. - SSL: certificate verification is on;
--insecuredisables (use only if necessary). - URL canonicalization: remove tracking params; sort remaining query params.
- Compatibility: fallback for older
urllib3(method_whitelistvsallowed_methods).
Using unittest:
python3 -m unittest -v- Unit tests: canonicalization, classification, HTML detection, social-share detection, HTTP check (fake session), i18n.
- CLI tests: help (via
LANG) and error paths.
- Wrong language? Set
LANG=en_US.UTF-8orLANG=de_DE.UTF-8when invoking. - SSL/rate limit issues: try
--insecure(for testing only), tune--timeout/--retries/--backoff, or reduce--max-workers.
Copyright (c) 2025 cryeffect Media Group https://crymg.de
This project is licensed under the GNU General Public License v3 (GPLv3).
- SPDX-License-Identifier: GPL-3.0-only
- Full text: LICENSE file or https://www.gnu.org/licenses/gpl-3.0.html