Skip to content

Duplication detection is prohibitively slow on large crawls #45

@rypptc

Description

@rypptc

When crawling a site with a large number of URLs, the process gets stuck during the duplication detection step and never reaches completed status.

What happens

  • The crawl finishes normally (No more URLs to crawl)
  • The process starts "Running duplication detection..."
  • It appears to hang at this stage and the UI remains in running
  • The user has to manually stop the process to see the results

Cause

detect_duplication_issues in issue_detector.py compares every page with every other page using SequenceMatcher.

This results in an O(n²) operation. On large crawls this means millions of comparisons, which can take hours.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions