A fast, threaded Python crawler that audits an entire website for SEO issues using its sitemap. It parses the sitemap index, crawls every URL in parallel, and produces a fully formatted 7-sheet Excel report covering titles, meta descriptions, canonicals, structured data, duplicate content, thin pages, coming-soon pages, URL issues, and more.
- No config editing required — enter your site URL when prompted at runtime
- Sitemap-driven — reads the sitemap index and all sub-sitemaps automatically
- Multithreaded — crawls multiple URLs simultaneously for speed
- Crash-safe — saves a plain-text URL backup before crawling begins
- Bilingual detection — flags "coming soon" pages in both French and English
- Duplicate detection — identifies pages sharing identical titles or meta descriptions
- Rich Excel output — colour-coded, filterable, 7-sheet workbook
Python 3.8 or higher
Install all required packages with:
pip install requests beautifulsoup4 lxml openpyxl| Package | Purpose |
|---|---|
requests |
HTTP requests and redirect chain tracking |
beautifulsoup4 |
HTML / XML parsing |
lxml |
Fast XML/HTML parser backend |
openpyxl |
Excel workbook generation and styling |
git clone https://github.com/your-username/sitemap-seo-crawler.git
cd sitemap-seo-crawlerpip install requests beautifulsoup4 lxml openpyxlpython sitemap_crawler.pyThe script will prompt you for a few inputs. Press Enter to accept the default value shown in brackets:
=== Sitemap SEO Crawler ===
Sitemap URL (e.g. https://example.com/sitemap.xml): https://your-site.com/sitemap.xml
Output Excel file [your_site_com_seo_crawl.xlsx]:
URL backup file [your_site_com_urls.txt]:
Threads [5]:
| Prompt | Description | Default |
|---|---|---|
| Sitemap URL | Full URL to your sitemap index | (required) |
| Output Excel file | Name of the generated .xlsx report |
{domain}_seo_crawl.xlsx |
| URL backup file | Plain-text list of all discovered URLs | {domain}_urls.txt |
| Threads | Number of parallel crawl threads | 5 |
📋 Fetching sitemap index: https://your-site.com/sitemap.xml
→ Found 4 sub-sitemaps
✓ sitemap-posts.xml → 1,240 URLs
✓ sitemap-pages.xml → 88 URLs
...
✅ Total URLs discovered: 1,328
🕷 Starting crawl: 1,328 URLs | 5 threads | 0.1s delay
[ 100/1,328] 7.5% | errors: 0 | coming soon: 3
...
✅ Crawl complete: 1,328 pages | 2 errors | 5 coming soon pages
📊 Excel saved: your_site_com_seo_crawl.xlsx
The output .xlsx file will appear in the same folder as the script. Open it with Excel, LibreOffice Calc, or Google Sheets.
| Sheet | Contents |
|---|---|
| 📊 Summary | Headline KPIs: total pages, error count, missing titles/metas, thin content, coming-soon pages |
| 🔍 All Pages | One row per URL — all SEO fields (see full column list below) |
| 🚧 Coming Soon | Pages where "coming soon", "bientôt disponible", or similar patterns were detected |
| ⚠ Issues | Prioritised issue log (CRITICAL / HIGH / MEDIUM) for every problem found |
| 🔗 URL Issues | Pages with spaces, uppercase letters, or session IDs in their URLs |
| 🗺 Sitemap | Every URL from the sitemap with its lastmod, changefreq, priority, and HTTP status |
| 🔁 Duplicates | Grouped list of pages sharing the same title or meta description |
A plain-text list of every URL found in the sitemap, written before crawling begins. Useful for resuming or debugging after a crash.
| Column | Description |
|---|---|
URL |
Full page URL |
Status |
HTTP status code (200, 301, 404, etc.) |
Response (ms) |
Server response time in milliseconds |
Redirect Chain |
Intermediate redirect steps, e.g. 301→https://... |
Sitemap Source |
Name of the sub-sitemap file this URL came from |
Last Modified |
lastmod value from the sitemap |
Priority |
priority value from the sitemap |
Change Freq |
changefreq value from the sitemap |
Title |
Page <title> text |
Title Len |
Character count of the title |
Meta Description |
Content of the <meta name="description"> tag |
Desc Len |
Character count of the meta description |
H1 Status |
OK, MISSING, or MULTIPLE (n) |
H1 Text |
Text content of the H1 tag(s) |
Word Count |
Visible body word count (scripts/nav/footer excluded) |
Thin Content |
Yes if word count < 300, otherwise OK |
Coming Soon |
⚠ YES if a coming-soon pattern was detected |
Coming Soon Text |
Snippet of text that triggered the detection |
Has Schema |
Yes / No — JSON-LD structured data present |
Schema Types |
Comma-separated list of @type values found |
Canonical Type |
self-referencing, points elsewhere, or MISSING |
Canonical URL |
Value of the <link rel="canonical"> tag |
Robots Meta |
Content of <meta name="robots"> |
Noindex |
Yes if the robots meta contains noindex |
Has Hreflang |
Yes / No |
Hreflang Langs |
Comma-separated list of declared hreflang values |
Meta Keywords |
Content of <meta name="keywords"> (legacy) |
OG Title |
og:title Open Graph value |
OG Description |
og:description Open Graph value |
OG Image |
og:image Open Graph value |
Total Images |
Number of <img> tags on the page |
Missing Alt |
Count of images with no or empty alt attribute |
Internal Links |
Links pointing to the same domain |
External Links |
Links pointing to other domains |
URL Issues |
OK, or description of problems (spaces, uppercase, session ID) |
Dup Title |
Dup if this title appears on more than one page |
Dup Meta Desc |
Dup if this meta description appears on more than one page |
H2s (first 5) |
First 5 H2 headings, pipe-separated |
Error |
Crawl error message if the request failed |
| Severity | Colour | Triggered by |
|---|---|---|
| CRITICAL | 🔴 Red | Crawl errors, HTTP 4xx/5xx, missing H1, missing title, coming-soon pages |
| HIGH | 🟠 Orange | Multiple H1s, missing meta description, duplicate titles/metas, no canonical, no schema, URL issues, thin content, slow response (>3 s) |
| MEDIUM | 🟡 Yellow | Title too long (>60 chars) or too short (<30), meta description too long (>160) or too short (<70), noindex tag, missing alt text, no hreflang |
- Coming-soon patterns — extend
COMING_SOON_PATTERNSat the top of the file with your own keywords - Thin content threshold — change the
< 300word count check insidecrawl_url() - Request delay — adjust
DELAY(seconds between requests per thread) to be more or less aggressive - Timeout / retries — adjust
TIMEOUTandMAX_RETRIESfor slow or unreliable servers
MIT License — free to use, modify, and distribute.