Skip to content

Fast Scan Resume #183

@nsoft

Description

@nsoft

The current fault tolerance achieves it's goal but if it resumes a very large scan it will spend a period of time hashing documents and determining that it has already seen them. We would like to provide a configuration option (via a method on the builder for the scanner) to skip this and pick up where we left off without wasting as much CPU.

One possible route for this is to log the scanned id's after we've reported status for the initial document, and then load that log of id's into a Trie structure that can be used to check the id's directly without hashing document contents. (Hashing still remains and is required for subsequent scan). Completing a scan should clear out the log preventing this Trie from being built if the previous scan completed successfully.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions