USE-91-handle-empty-crawls #51

ehanson8 · 2025-10-31T13:52:53Z

Purpose and background context

Update harvest command to exit gracefully when no valid seeds are discovered.

How can a reviewer manually see the effects of these changes?

Build image:

make dist-dev

Set env var:

export CRAWL_NAME="use-91"

Run a crawl with no valid seeds:

docker run -it -v $(PWD)/output/crawls:/crawls browsertrix-harvester-dev:latest \
    --verbose \
    harvest \
    --crawl-name="${CRAWL_NAME}" \
         --config-yaml-file="/crawls/configs/mitlibwebsite.yaml" \
    --metadata-output-file="/crawls/collections/${CRAWL_NAME}/${CRAWL_NAME}-extracted-records-to-index.jsonl" \
    --num-workers 16 \
    --include-fulltext \
    --sitemap=https://libraries.mit.edu/sitemap.xml \
    --sitemap-from-date="2025-12-31" \
    --sitemap-to-date="2025-12-31" \
    --btrix-args-json='{}'

Run a crawl with seeds to see that the time elapsed is still logged after being shifted to command_exit:

docker run -it -v $(PWD)/output/crawls:/crawls browsertrix-harvester-dev:latest \
    --verbose \
    harvest \
    --crawl-name="${CRAWL_NAME}" \
         --config-yaml-file="/crawls/configs/mitlibwebsite.yaml" \
    --metadata-output-file="/crawls/collections/${CRAWL_NAME}/${CRAWL_NAME}-extracted-records-to-index.jsonl" \
    --num-workers 16 \
    --include-fulltext \
    --sitemap=https://libraries.mit.edu/sitemap.xml \
    --sitemap-from-date="2025-10-29"
    --btrix-args-json='{}'

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-91

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: * Empty crawls were exiting on an error code when this is an expected scenario that should be handled with a clean exit. How this addresses that need: * Add NoValidSeedsError exception * Update harvest CLI command to exit on NoValidSeedsError * Add command_exit to run after all CLI commands to account for updated harvest CLI command * Update _handle_subprocess_logging method to raise NoValidSeedsError exception and add a try/except block for JSON decode errors * Add corresponding CLI and unit tests Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-91

ehanson8 · 2025-10-31T13:57:23Z

harvester/cli.py

+
+@main.result_callback()
+@click.pass_context
+def command_exit(ctx: click.Context, *_args: Any, **_kwargs: Any) -> None:  # noqa: ANN401


This was need to ensure the time-elapsed was still logged whether harvest exited on NoValidSeedsError or the crawl was actually performed

I like it! I'm reflecting as I work on another CLI at the moment how many different ways there are to do things with click.

In that one, here is a snippet of the main group:

@click.pass_context def main( ctx: click.Context, *, verbose: bool, ) -> None: ctx.ensure_object(dict) ctx.obj["start_time"] = time.perf_counter() root_logger = logging.getLogger() logger.info(configure_logger(root_logger, verbose=verbose)) logger.info(configure_sentry()) logger.info("Running process") def _log_command_elapsed_time() -> None: #<------------------- elapsed_time = time.perf_counter() - ctx.obj["start_time"] logger.info( "Total time to complete process: %s", str(timedelta(seconds=elapsed_time)) ) ctx.call_on_close(_log_command_elapsed_time) #<------------------------

They have the same effect of calling something after the command has completed. We could probably spend some time analyzing their pros and cons, similaries and differences, idiosyncracies..... but they both seem to work.

FWIW, I like the @main.result_callback() pattern you have used here, and am betting it's more idiomatic click. I may update timdex-embeddings to do that!

Nice work.

ghukill

I had typed up an approval review, then deleted it wanting to explore some edge cases locally.... and come full circle to enthusiastic approve.

I really like the surgical parsing of crawler logs, identifying the specific error of no seeds, raising a custom exception of NoValidSeedsError, and then catching that custom exception at the CLI level. No room for ambiguity there.

This would work equally well for a crawl driven by repeating --sitemap or a more "organic" crawl with seed URLs in the YAML file.

There is a scenario where the harvester can still exit abruptly, when there are seeds / URLs to crawl, but the overall crawl fails to successfully crawl any. In this scenario, we'll see an error like this:

...
...
  File "/browsertrix-harvester/harvester/cli.py", line 315, in harvest
    crawl_metadata_records = parser.generate_metadata(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/browsertrix-harvester/harvester/metadata.py", line 170, in generate_metadata
    websites_metadata_df = self._remove_duplicate_urls(websites_metadata_df)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/browsertrix-harvester/harvester/metadata.py", line 304, in _remove_duplicate_urls
    websites_metadata_df.sort_values("cdx_offset")
...
...
KeyError: 'cdx_offset'

This is worth following up on, but I think is out of scope for this ticket + PR. That work should be a bit more of a safety net, not trying to understand why the crawl resulted in nothing, but reporting and handling that. Maybe we find that sometimes it's okay, and we want to exit gracefully. But we might find that getting to that point and still not having URLs in the CDX file(s) is worth bubbling up an error. Unknown at this time.

Nice work on this!! Really like the approach, and appreciate the scenarios in the PR; with web crawls, it can be a little tricky to recreate crawl scenarios.

ghukill · 2025-10-31T17:24:51Z

harvester/cli.py

+
+@main.result_callback()
+@click.pass_context
+def command_exit(ctx: click.Context, *_args: Any, **_kwargs: Any) -> None:  # noqa: ANN401


I like it! I'm reflecting as I work on another CLI at the moment how many different ways there are to do things with click.

In that one, here is a snippet of the main group:

@click.pass_context def main( ctx: click.Context, *, verbose: bool, ) -> None: ctx.ensure_object(dict) ctx.obj["start_time"] = time.perf_counter() root_logger = logging.getLogger() logger.info(configure_logger(root_logger, verbose=verbose)) logger.info(configure_sentry()) logger.info("Running process") def _log_command_elapsed_time() -> None: #<------------------- elapsed_time = time.perf_counter() - ctx.obj["start_time"] logger.info( "Total time to complete process: %s", str(timedelta(seconds=elapsed_time)) ) ctx.call_on_close(_log_command_elapsed_time) #<------------------------

They have the same effect of calling something after the command has completed. We could probably spend some time analyzing their pros and cons, similaries and differences, idiosyncracies..... but they both seem to work.

FWIW, I like the @main.result_callback() pattern you have used here, and am betting it's more idiomatic click. I may update timdex-embeddings to do that!

Nice work.

ehanson8 requested a review from a team as a code owner October 31, 2025 13:52

ehanson8 commented Oct 31, 2025

View reviewed changes

ghukill self-assigned this Oct 31, 2025

ghukill approved these changes Oct 31, 2025

View reviewed changes

ehanson8 merged commit 1e61e73 into main Oct 31, 2025
4 checks passed

ehanson8 deleted the USE-91-handle-empty-crawls branch October 31, 2025 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USE-91-handle-empty-crawls #51

USE-91-handle-empty-crawls #51

Uh oh!

ehanson8 commented Oct 31, 2025

Uh oh!

ehanson8 Oct 31, 2025

Uh oh!

ghukill Oct 31, 2025

Uh oh!

ghukill left a comment

Uh oh!

ghukill Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

USE-91-handle-empty-crawls #51

USE-91-handle-empty-crawls #51

Uh oh!

Conversation

ehanson8 commented Oct 31, 2025

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

ehanson8 Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

ghukill Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants