When crawling a site with a large number of URLs, the process gets stuck during the duplication detection step and never reaches completed status.
What happens
- The crawl finishes normally (
No more URLs to crawl)
- The process starts "Running duplication detection..."
- It appears to hang at this stage and the UI remains in
running
- The user has to manually stop the process to see the results
Cause
detect_duplication_issues in issue_detector.py compares every page with every other page using SequenceMatcher.
This results in an O(n²) operation. On large crawls this means millions of comparisons, which can take hours.