Website Link Checker

A fast CLI link checker that recursively crawls internal pages of a site and checks external links. It canonicalizes URLs (incl. removing tracking params), can optionally filter social share links, uses retry/backoff for HTTP checks, and prints localized output (EN/DE).

Warning

Use at your own risk. The author of this script assumes no responsibility or liability for any damage, data loss, downtime, or other issues caused by its use.

Features

Recursive crawl of internal pages (links from <a>; optionally assets from <img>, <script>, <link>)
External link checks with configurable concurrency (ThreadPool)
URL canonicalization: lowercase host/scheme, drop default ports, index.html → /, sanitize query parameters
Remove known tracking parameters (utm_*, fbclid, gclid, …) or remove the entire query
Optional filtering of typical social share links (Twitter/X, Facebook, LinkedIn, WhatsApp, Telegram, …)
Robust HTTP checks (HEAD/GET with fallback), retries and exponential backoff
Localized output: automatic language detection (EN/DE) or via --lang
Compact summary and exit code 0/1 depending on broken links

Requirements

Python 3.8+
Dependencies: requests, urllib3 (see requirements.txt)

Installation (optional):

pip install -r requirements.txt

Alternatively on Debian-based systems:

sudo apt install python3-requests python3-urllib3

Usage

Get the script and make it executable (Linux):

curl -O https://raw.githubusercontent.com/cryMG/website-link-checker/main/website-link-checker.py
chmod +x website-link-checker.py

Run the script directly:

./website-link-checker.py https://example.com

Example output:

Allowed schemes: http, https | Fetchable: http, https
Internal pages (final, unique, canonical): 81
Discovered internal link targets (unique, canonical): 81
External links (unique, canonical, after filters): 20
Total links found (before filters): 2634 | discarded: 60 | social filtered: 384 | scheme not allowed: 60 | scheme not supported: 0
Social filter (by service): facebook:48, linkedin:48, mastodon:48, pinterest:48, reddit:48, telegram:48, twitter:48, whatsapp:48

=== Results ===
No broken internal pages found.

No broken external links found.

Summary: Internal pages: 81 | Internal link targets: 81 | External links checked: 20 | Errors: 0

Tip

Use ./website-link-checker.py --help to see all options and their descriptions.

All parameters

start_url (positional)
- Start URL, e.g., https://example.com
--include-assets
- Also collect assets (img/script/link) (non-recursive). Default: off
--no-recursive-assets
- When --include-assets is enabled, disable recursive asset scanning (e.g., CSS url() and @import). Default: recursive asset scan ON.
--max-workers INT
- Concurrency for external link checks. Default: 16
--timeout INT
- Per-request timeout in seconds. Default: 10
--retries INT
- Automatic retries on errors. Default: 2
--backoff FLOAT
- Exponential backoff factor. Default: 0.3
--sleep FLOAT
- Short pause between internal page fetches in seconds. Default: 0.0
--user-agent STRING
- User-Agent string (default is a sensible UA for this tool)
--insecure
- Disable SSL certificate verification (only if necessary). Default: off
--ext-method {auto,head,get}
- Method for external links. Default: auto (HEAD with GET fallback)
--verbose
- Verbose output (fetch/redirects/errors). Default: off
--debug-links
- Log every discovered link (source, raw, normalized, classification). Default: off
--strip-query {none,tracking,all}
- Query parameter canonicalization. Default: tracking
--no-normalize-index
- Do NOT normalize index files to directory URLs. Default: off (i.e., normalization is active)
--no-filter-social
- Disable filtering of social share links. Default: filtering is ON
--schemes LIST
- Comma-separated list of allowed URI schemes. Default: http,https. Only http/https are actively checked; other allowed schemes are counted/skipped.
--lang {auto,en,de}
- Language for output. Default: auto. Note: help text follows the environment language (LANG, LC_ALL, …).
--report-file PATH
- Optional path to write a report (console output still printed).
--report-format {text,json,csv}
- Report format. Default: inferred from file extension, or text.
--progress
- Show a simple progress indicator during external link checks.
--respect-robots
- Respect robots.txt (Disallow rules); if Crawl-delay is set for your User-Agent, it is used as a minimum sleep.
--sitemap PATH_OR_URL
- Optional sitemap (file path or URL). URLs found are used as crawl seeds (same host only).
--exclude REGEX (repeatable)
- Exclude URL patterns (regular expressions). Can be used multiple times to ignore certain pages/paths during crawling and checking. Patterns match against canonical absolute URLs.
--auth user:password
- Use HTTP Basic Authentication for all requests.
--header "Name: value" (repeatable)
- Add custom HTTP headers. You can pass multiple --header flags.

Examples

Basic crawl with verbose output:

./website-link-checker.py --verbose https://example.com

Include assets, strip tracking params (default), social filter enabled (default):

./website-link-checker.py --include-assets https://example.com

Remove all query parameters and force English output:

./website-link-checker.py --strip-query=all --lang=en https://example.com

Restrict schemes (non-http/https are counted, not checked):

./website-link-checker.py --schemes=http,https,mailto,tel https://example.com

Write a JSON report and show progress:

./website-link-checker.py --progress --report-file report.json https://example.com

Use a sitemap to seed the crawl:

./website-link-checker.py --sitemap https://example.com/sitemap.xml https://example.com

Exclude pages/paths by regex (repeat --exclude):

./website-link-checker.py --exclude ".*/privacy" --exclude "https://example.com/(old|legacy)/.*" https://example.com

Use Basic Auth and custom headers:

./website-link-checker.py --auth user:secret \
  --header "Accept-Language: de-DE" \
  --header "X-Debug: 1" \
  https://example.com

Respect robots.txt and use a custom User-Agent:

./website-link-checker.py --respect-robots --user-agent "MyChecker/1.0" https://example.com

Output and exit codes

Diagnostics (number of internal pages, external links, stats) and result lists.
Exit code: 0 no broken links, 1 at least one broken link.

Details

Only http/https are actively fetched; other allowed schemes are skipped and counted.
Language: detection via LC_ALL, LC_MESSAGES, LANG, LANGUAGE; override with --lang.
HTTP strategy: auto uses HEAD with GET fallback (e.g., on 405/501). Accepted warning statuses: {401, 403, 999}.
SSL: certificate verification is on; --insecure disables (use only if necessary).
URL canonicalization: remove tracking params; sort remaining query params.
Compatibility: fallback for older urllib3 (method_whitelist vs allowed_methods).

Tests

Using unittest:

python3 -m unittest -v

Unit tests: canonicalization, classification, HTML detection, social-share detection, HTTP check (fake session), i18n.
CLI tests: help (via LANG) and error paths.

Troubleshooting

Wrong language? Set LANG=en_US.UTF-8 or LANG=de_DE.UTF-8 when invoking.
SSL/rate limit issues: try --insecure (for testing only), tune --timeout/--retries/--backoff, or reduce --max-workers.

License

This project is licensed under the GNU General Public License v3 (GPLv3).

SPDX-License-Identifier: GPL-3.0-only
Full text: LICENSE file or https://www.gnu.org/licenses/gpl-3.0.html

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.de.md		README.de.md
README.md		README.md
requirements.txt		requirements.txt
website-link-checker.py		website-link-checker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Website Link Checker

Features

Requirements

Usage

All parameters

Examples

Output and exit codes

Details

Tests

Troubleshooting

License

About

Uh oh!

Releases 1

Languages

License

cryMG/website-link-checker

Folders and files

Latest commit

History

Repository files navigation

Website Link Checker

Features

Requirements

Usage

All parameters

Examples

Output and exit codes

Details

Tests

Troubleshooting

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Languages