Changed to a persistent queue from a per-level queue, added MaxPages configuration option and tweaked final output #33
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change moves the concept of depth to the currently active URL, to enable continuous crawling. Before this change, the crawl speed is only as quick as the slowest URL within the current depth. For example, a timeout of 30 seconds on a URL at a given depth, would be blocking.
This may have been a design choice as there are some clear downsides to this approach; potential for queue bloat and higher ram usage, plus a potential race condition on the accuracy of depth dependent on the speed at which URLs are processed. However on the flip side, this will dramatically increase the performance on websites with slower pages.
Overview of changes:
Edit:
I accidentally used master for these changes rather than a branch, I can revert and post separate PR's if you'd prefer, but I've also added in a MaxPages configuration option and updated the readme to cover the new functionality. This works by checking the length of the seen URLs within the merge method, skipping merging when the seen count exceeds the MaxPages setting. This is handy for very large websites where you want to limit the scope of the crawl beyond depth/include/exclude rules.