🕷️ Sitemap SEO Crawler

A fast, threaded Python crawler that audits an entire website for SEO issues using its sitemap. It parses the sitemap index, crawls every URL in parallel, and produces a fully formatted 7-sheet Excel report covering titles, meta descriptions, canonicals, structured data, duplicate content, thin pages, coming-soon pages, URL issues, and more.

✨ Features

No config editing required — enter your site URL when prompted at runtime
Sitemap-driven — reads the sitemap index and all sub-sitemaps automatically
Multithreaded — crawls multiple URLs simultaneously for speed
Crash-safe — saves a plain-text URL backup before crawling begins
Bilingual detection — flags "coming soon" pages in both French and English
Duplicate detection — identifies pages sharing identical titles or meta descriptions
Rich Excel output — colour-coded, filterable, 7-sheet workbook

📋 Requirements

Python 3.8 or higher

Install all required packages with:

pip install requests beautifulsoup4 lxml openpyxl

Package	Purpose
`requests`	HTTP requests and redirect chain tracking
`beautifulsoup4`	HTML / XML parsing
`lxml`	Fast XML/HTML parser backend
`openpyxl`	Excel workbook generation and styling

🚀 Step-by-Step Usage Guide

Step 1 — Clone or download the repository

git clone https://github.com/your-username/sitemap-seo-crawler.git
cd sitemap-seo-crawler

Step 2 — Install dependencies

pip install requests beautifulsoup4 lxml openpyxl

Step 3 — Run the crawler

python sitemap_crawler.py

The script will prompt you for a few inputs. Press Enter to accept the default value shown in brackets:

=== Sitemap SEO Crawler ===

Sitemap URL (e.g. https://example.com/sitemap.xml): https://your-site.com/sitemap.xml
Output Excel file [your_site_com_seo_crawl.xlsx]:
URL backup file  [your_site_com_urls.txt]:
Threads [5]:

Prompt	Description	Default
Sitemap URL	Full URL to your sitemap index	(required)
Output Excel file	Name of the generated `.xlsx` report	`{domain}_seo_crawl.xlsx`
URL backup file	Plain-text list of all discovered URLs	`{domain}_urls.txt`
Threads	Number of parallel crawl threads	`5`

Step 4 — Watch the live progress

📋 Fetching sitemap index: https://your-site.com/sitemap.xml
  → Found 4 sub-sitemaps
  ✓ sitemap-posts.xml       →  1,240 URLs
  ✓ sitemap-pages.xml       →     88 URLs
  ...
✅ Total URLs discovered: 1,328

🕷  Starting crawl: 1,328 URLs | 5 threads | 0.1s delay
  [   100/1,328]   7.5% | errors: 0 | coming soon: 3
  ...
✅ Crawl complete: 1,328 pages | 2 errors | 5 coming soon pages

📊 Excel saved: your_site_com_seo_crawl.xlsx

Step 5 — Open the Excel report

The output .xlsx file will appear in the same folder as the script. Open it with Excel, LibreOffice Calc, or Google Sheets.

📊 Output Files

`{domain}_seo_crawl.xlsx` — Main report (7 sheets)

Sheet	Contents
📊 Summary	Headline KPIs: total pages, error count, missing titles/metas, thin content, coming-soon pages
🔍 All Pages	One row per URL — all SEO fields (see full column list below)
🚧 Coming Soon	Pages where "coming soon", "bientôt disponible", or similar patterns were detected
⚠ Issues	Prioritised issue log (CRITICAL / HIGH / MEDIUM) for every problem found
🔗 URL Issues	Pages with spaces, uppercase letters, or session IDs in their URLs
🗺 Sitemap	Every URL from the sitemap with its `lastmod`, `changefreq`, `priority`, and HTTP status
🔁 Duplicates	Grouped list of pages sharing the same title or meta description

`{domain}_urls.txt` — URL backup

A plain-text list of every URL found in the sitemap, written before crawling begins. Useful for resuming or debugging after a crash.

📑 Excel Column Reference — All Pages sheet

Column	Description
`URL`	Full page URL
`Status`	HTTP status code (200, 301, 404, etc.)
`Response (ms)`	Server response time in milliseconds
`Redirect Chain`	Intermediate redirect steps, e.g. `301→https://...`
`Sitemap Source`	Name of the sub-sitemap file this URL came from
`Last Modified`	`lastmod` value from the sitemap
`Priority`	`priority` value from the sitemap
`Change Freq`	`changefreq` value from the sitemap
`Title`	Page `<title>` text
`Title Len`	Character count of the title
`Meta Description`	Content of the `<meta name="description">` tag
`Desc Len`	Character count of the meta description
`H1 Status`	`OK`, `MISSING`, or `MULTIPLE (n)`
`H1 Text`	Text content of the H1 tag(s)
`Word Count`	Visible body word count (scripts/nav/footer excluded)
`Thin Content`	`Yes` if word count < 300, otherwise `OK`
`Coming Soon`	`⚠ YES` if a coming-soon pattern was detected
`Coming Soon Text`	Snippet of text that triggered the detection
`Has Schema`	`Yes` / `No` — JSON-LD structured data present
`Schema Types`	Comma-separated list of `@type` values found
`Canonical Type`	`self-referencing`, `points elsewhere`, or `MISSING`
`Canonical URL`	Value of the `<link rel="canonical">` tag
`Robots Meta`	Content of `<meta name="robots">`
`Noindex`	`Yes` if the robots meta contains `noindex`
`Has Hreflang`	`Yes` / `No`
`Hreflang Langs`	Comma-separated list of declared `hreflang` values
`Meta Keywords`	Content of `<meta name="keywords">` (legacy)
`OG Title`	`og:title` Open Graph value
`OG Description`	`og:description` Open Graph value
`OG Image`	`og:image` Open Graph value
`Total Images`	Number of `<img>` tags on the page
`Missing Alt`	Count of images with no or empty `alt` attribute
`Internal Links`	Links pointing to the same domain
`External Links`	Links pointing to other domains
`URL Issues`	`OK`, or description of problems (spaces, uppercase, session ID)
`Dup Title`	`Dup` if this title appears on more than one page
`Dup Meta Desc`	`Dup` if this meta description appears on more than one page
`H2s (first 5)`	First 5 H2 headings, pipe-separated
`Error`	Crawl error message if the request failed

⚠️ Issue Severity Reference

Severity	Colour	Triggered by
CRITICAL	🔴 Red	Crawl errors, HTTP 4xx/5xx, missing H1, missing title, coming-soon pages
HIGH	🟠 Orange	Multiple H1s, missing meta description, duplicate titles/metas, no canonical, no schema, URL issues, thin content, slow response (>3 s)
MEDIUM	🟡 Yellow	Title too long (>60 chars) or too short (<30), meta description too long (>160) or too short (<70), noindex tag, missing alt text, no hreflang

🔧 Customisation

Coming-soon patterns — extend COMING_SOON_PATTERNS at the top of the file with your own keywords
Thin content threshold — change the < 300 word count check inside crawl_url()
Request delay — adjust DELAY (seconds between requests per thread) to be more or less aggressive
Timeout / retries — adjust TIMEOUT and MAX_RETRIES for slow or unreliable servers

📄 License

MIT License — free to use, modify, and distribute.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
sitemap_crawler.py		sitemap_crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕷️ Sitemap SEO Crawler

✨ Features

📋 Requirements

🚀 Step-by-Step Usage Guide

Step 1 — Clone or download the repository

Step 2 — Install dependencies

Step 3 — Run the crawler

Step 4 — Watch the live progress

Step 5 — Open the Excel report

📊 Output Files

`{domain}_seo_crawl.xlsx` — Main report (7 sheets)

`{domain}_urls.txt` — URL backup

📑 Excel Column Reference — All Pages sheet

⚠️ Issue Severity Reference

🔧 Customisation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🕷️ Sitemap SEO Crawler

✨ Features

📋 Requirements

🚀 Step-by-Step Usage Guide

Step 1 — Clone or download the repository

Step 2 — Install dependencies

Step 3 — Run the crawler

Step 4 — Watch the live progress

Step 5 — Open the Excel report

📊 Output Files

{domain}_seo_crawl.xlsx — Main report (7 sheets)

{domain}_urls.txt — URL backup

📑 Excel Column Reference — All Pages sheet

⚠️ Issue Severity Reference

🔧 Customisation

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`{domain}_seo_crawl.xlsx` — Main report (7 sheets)

`{domain}_urls.txt` — URL backup

Packages