-
Notifications
You must be signed in to change notification settings - Fork 0
USE-91-handle-empty-crawls #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: * Empty crawls were exiting on an error code when this is an expected scenario that should be handled with a clean exit. How this addresses that need: * Add NoValidSeedsError exception * Update harvest CLI command to exit on NoValidSeedsError * Add command_exit to run after all CLI commands to account for updated harvest CLI command * Update _handle_subprocess_logging method to raise NoValidSeedsError exception and add a try/except block for JSON decode errors * Add corresponding CLI and unit tests Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-91
|
|
||
| @main.result_callback() | ||
| @click.pass_context | ||
| def command_exit(ctx: click.Context, *_args: Any, **_kwargs: Any) -> None: # noqa: ANN401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was need to ensure the time-elapsed was still logged whether harvest exited on NoValidSeedsError or the crawl was actually performed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! I'm reflecting as I work on another CLI at the moment how many different ways there are to do things with click.
In that one, here is a snippet of the main group:
@click.pass_context
def main(
ctx: click.Context,
*,
verbose: bool,
) -> None:
ctx.ensure_object(dict)
ctx.obj["start_time"] = time.perf_counter()
root_logger = logging.getLogger()
logger.info(configure_logger(root_logger, verbose=verbose))
logger.info(configure_sentry())
logger.info("Running process")
def _log_command_elapsed_time() -> None: #<-------------------
elapsed_time = time.perf_counter() - ctx.obj["start_time"]
logger.info(
"Total time to complete process: %s", str(timedelta(seconds=elapsed_time))
)
ctx.call_on_close(_log_command_elapsed_time) #<------------------------They have the same effect of calling something after the command has completed. We could probably spend some time analyzing their pros and cons, similaries and differences, idiosyncracies..... but they both seem to work.
FWIW, I like the @main.result_callback() pattern you have used here, and am betting it's more idiomatic click. I may update timdex-embeddings to do that!
Nice work.
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had typed up an approval review, then deleted it wanting to explore some edge cases locally.... and come full circle to enthusiastic approve.
I really like the surgical parsing of crawler logs, identifying the specific error of no seeds, raising a custom exception of NoValidSeedsError, and then catching that custom exception at the CLI level. No room for ambiguity there.
This would work equally well for a crawl driven by repeating --sitemap or a more "organic" crawl with seed URLs in the YAML file.
There is a scenario where the harvester can still exit abruptly, when there are seeds / URLs to crawl, but the overall crawl fails to successfully crawl any. In this scenario, we'll see an error like this:
...
...
File "/browsertrix-harvester/harvester/cli.py", line 315, in harvest
crawl_metadata_records = parser.generate_metadata(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/browsertrix-harvester/harvester/metadata.py", line 170, in generate_metadata
websites_metadata_df = self._remove_duplicate_urls(websites_metadata_df)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/browsertrix-harvester/harvester/metadata.py", line 304, in _remove_duplicate_urls
websites_metadata_df.sort_values("cdx_offset")
...
...
KeyError: 'cdx_offset'
This is worth following up on, but I think is out of scope for this ticket + PR. That work should be a bit more of a safety net, not trying to understand why the crawl resulted in nothing, but reporting and handling that. Maybe we find that sometimes it's okay, and we want to exit gracefully. But we might find that getting to that point and still not having URLs in the CDX file(s) is worth bubbling up an error. Unknown at this time.
Nice work on this!! Really like the approach, and appreciate the scenarios in the PR; with web crawls, it can be a little tricky to recreate crawl scenarios.
|
|
||
| @main.result_callback() | ||
| @click.pass_context | ||
| def command_exit(ctx: click.Context, *_args: Any, **_kwargs: Any) -> None: # noqa: ANN401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! I'm reflecting as I work on another CLI at the moment how many different ways there are to do things with click.
In that one, here is a snippet of the main group:
@click.pass_context
def main(
ctx: click.Context,
*,
verbose: bool,
) -> None:
ctx.ensure_object(dict)
ctx.obj["start_time"] = time.perf_counter()
root_logger = logging.getLogger()
logger.info(configure_logger(root_logger, verbose=verbose))
logger.info(configure_sentry())
logger.info("Running process")
def _log_command_elapsed_time() -> None: #<-------------------
elapsed_time = time.perf_counter() - ctx.obj["start_time"]
logger.info(
"Total time to complete process: %s", str(timedelta(seconds=elapsed_time))
)
ctx.call_on_close(_log_command_elapsed_time) #<------------------------They have the same effect of calling something after the command has completed. We could probably spend some time analyzing their pros and cons, similaries and differences, idiosyncracies..... but they both seem to work.
FWIW, I like the @main.result_callback() pattern you have used here, and am betting it's more idiomatic click. I may update timdex-embeddings to do that!
Nice work.
Purpose and background context
Update
harvestcommand to exit gracefully when no valid seeds are discovered.How can a reviewer manually see the effects of these changes?
command_exit:Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Code review