Skip to content

XxAKLONxX/sitemap_crowler_seo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

🕷️ Sitemap SEO Crawler

A fast, threaded Python crawler that audits an entire website for SEO issues using its sitemap. It parses the sitemap index, crawls every URL in parallel, and produces a fully formatted 7-sheet Excel report covering titles, meta descriptions, canonicals, structured data, duplicate content, thin pages, coming-soon pages, URL issues, and more.


✨ Features

  • No config editing required — enter your site URL when prompted at runtime
  • Sitemap-driven — reads the sitemap index and all sub-sitemaps automatically
  • Multithreaded — crawls multiple URLs simultaneously for speed
  • Crash-safe — saves a plain-text URL backup before crawling begins
  • Bilingual detection — flags "coming soon" pages in both French and English
  • Duplicate detection — identifies pages sharing identical titles or meta descriptions
  • Rich Excel output — colour-coded, filterable, 7-sheet workbook

📋 Requirements

Python 3.8 or higher

Install all required packages with:

pip install requests beautifulsoup4 lxml openpyxl
Package Purpose
requests HTTP requests and redirect chain tracking
beautifulsoup4 HTML / XML parsing
lxml Fast XML/HTML parser backend
openpyxl Excel workbook generation and styling

🚀 Step-by-Step Usage Guide

Step 1 — Clone or download the repository

git clone https://github.com/your-username/sitemap-seo-crawler.git
cd sitemap-seo-crawler

Step 2 — Install dependencies

pip install requests beautifulsoup4 lxml openpyxl

Step 3 — Run the crawler

python sitemap_crawler.py

The script will prompt you for a few inputs. Press Enter to accept the default value shown in brackets:

=== Sitemap SEO Crawler ===

Sitemap URL (e.g. https://example.com/sitemap.xml): https://your-site.com/sitemap.xml
Output Excel file [your_site_com_seo_crawl.xlsx]:
URL backup file  [your_site_com_urls.txt]:
Threads [5]:
Prompt Description Default
Sitemap URL Full URL to your sitemap index (required)
Output Excel file Name of the generated .xlsx report {domain}_seo_crawl.xlsx
URL backup file Plain-text list of all discovered URLs {domain}_urls.txt
Threads Number of parallel crawl threads 5

Step 4 — Watch the live progress

📋 Fetching sitemap index: https://your-site.com/sitemap.xml
  → Found 4 sub-sitemaps
  ✓ sitemap-posts.xml       →  1,240 URLs
  ✓ sitemap-pages.xml       →     88 URLs
  ...
✅ Total URLs discovered: 1,328

🕷  Starting crawl: 1,328 URLs | 5 threads | 0.1s delay
  [   100/1,328]   7.5% | errors: 0 | coming soon: 3
  ...
✅ Crawl complete: 1,328 pages | 2 errors | 5 coming soon pages

📊 Excel saved: your_site_com_seo_crawl.xlsx

Step 5 — Open the Excel report

The output .xlsx file will appear in the same folder as the script. Open it with Excel, LibreOffice Calc, or Google Sheets.


📊 Output Files

{domain}_seo_crawl.xlsx — Main report (7 sheets)

Sheet Contents
📊 Summary Headline KPIs: total pages, error count, missing titles/metas, thin content, coming-soon pages
🔍 All Pages One row per URL — all SEO fields (see full column list below)
🚧 Coming Soon Pages where "coming soon", "bientôt disponible", or similar patterns were detected
⚠ Issues Prioritised issue log (CRITICAL / HIGH / MEDIUM) for every problem found
🔗 URL Issues Pages with spaces, uppercase letters, or session IDs in their URLs
🗺 Sitemap Every URL from the sitemap with its lastmod, changefreq, priority, and HTTP status
🔁 Duplicates Grouped list of pages sharing the same title or meta description

{domain}_urls.txt — URL backup

A plain-text list of every URL found in the sitemap, written before crawling begins. Useful for resuming or debugging after a crash.


📑 Excel Column Reference — All Pages sheet

Column Description
URL Full page URL
Status HTTP status code (200, 301, 404, etc.)
Response (ms) Server response time in milliseconds
Redirect Chain Intermediate redirect steps, e.g. 301→https://...
Sitemap Source Name of the sub-sitemap file this URL came from
Last Modified lastmod value from the sitemap
Priority priority value from the sitemap
Change Freq changefreq value from the sitemap
Title Page <title> text
Title Len Character count of the title
Meta Description Content of the <meta name="description"> tag
Desc Len Character count of the meta description
H1 Status OK, MISSING, or MULTIPLE (n)
H1 Text Text content of the H1 tag(s)
Word Count Visible body word count (scripts/nav/footer excluded)
Thin Content Yes if word count < 300, otherwise OK
Coming Soon ⚠ YES if a coming-soon pattern was detected
Coming Soon Text Snippet of text that triggered the detection
Has Schema Yes / No — JSON-LD structured data present
Schema Types Comma-separated list of @type values found
Canonical Type self-referencing, points elsewhere, or MISSING
Canonical URL Value of the <link rel="canonical"> tag
Robots Meta Content of <meta name="robots">
Noindex Yes if the robots meta contains noindex
Has Hreflang Yes / No
Hreflang Langs Comma-separated list of declared hreflang values
Meta Keywords Content of <meta name="keywords"> (legacy)
OG Title og:title Open Graph value
OG Description og:description Open Graph value
OG Image og:image Open Graph value
Total Images Number of <img> tags on the page
Missing Alt Count of images with no or empty alt attribute
Internal Links Links pointing to the same domain
External Links Links pointing to other domains
URL Issues OK, or description of problems (spaces, uppercase, session ID)
Dup Title Dup if this title appears on more than one page
Dup Meta Desc Dup if this meta description appears on more than one page
H2s (first 5) First 5 H2 headings, pipe-separated
Error Crawl error message if the request failed

⚠️ Issue Severity Reference

Severity Colour Triggered by
CRITICAL 🔴 Red Crawl errors, HTTP 4xx/5xx, missing H1, missing title, coming-soon pages
HIGH 🟠 Orange Multiple H1s, missing meta description, duplicate titles/metas, no canonical, no schema, URL issues, thin content, slow response (>3 s)
MEDIUM 🟡 Yellow Title too long (>60 chars) or too short (<30), meta description too long (>160) or too short (<70), noindex tag, missing alt text, no hreflang

🔧 Customisation

  • Coming-soon patterns — extend COMING_SOON_PATTERNS at the top of the file with your own keywords
  • Thin content threshold — change the < 300 word count check inside crawl_url()
  • Request delay — adjust DELAY (seconds between requests per thread) to be more or less aggressive
  • Timeout / retries — adjust TIMEOUT and MAX_RETRIES for slow or unreliable servers

📄 License

MIT License — free to use, modify, and distribute.

About

Python crawler that audits an entire website for SEO issues using its sitemap

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages