Skip to content

Conversation

@ehanson8
Copy link
Contributor

Purpose and background context

Update harvest command to exit gracefully when no valid seeds are discovered.

How can a reviewer manually see the effects of these changes?

  1. Build image:
make dist-dev
  1. Set env var:
export CRAWL_NAME="use-91"
  1. Run a crawl with no valid seeds:
docker run -it -v $(PWD)/output/crawls:/crawls browsertrix-harvester-dev:latest \
    --verbose \
    harvest \
    --crawl-name="${CRAWL_NAME}" \
         --config-yaml-file="/crawls/configs/mitlibwebsite.yaml" \
    --metadata-output-file="/crawls/collections/${CRAWL_NAME}/${CRAWL_NAME}-extracted-records-to-index.jsonl" \
    --num-workers 16 \
    --include-fulltext \
    --sitemap=https://libraries.mit.edu/sitemap.xml \
    --sitemap-from-date="2025-12-31" \
    --sitemap-to-date="2025-12-31" \
    --btrix-args-json='{}'
  1. Run a crawl with seeds to see that the time elapsed is still logged after being shifted to command_exit:
docker run -it -v $(PWD)/output/crawls:/crawls browsertrix-harvester-dev:latest \
    --verbose \
    harvest \
    --crawl-name="${CRAWL_NAME}" \
         --config-yaml-file="/crawls/configs/mitlibwebsite.yaml" \
    --metadata-output-file="/crawls/collections/${CRAWL_NAME}/${CRAWL_NAME}-extracted-records-to-index.jsonl" \
    --num-workers 16 \
    --include-fulltext \
    --sitemap=https://libraries.mit.edu/sitemap.xml \
    --sitemap-from-date="2025-10-29"
    --btrix-args-json='{}'

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced:
* Empty crawls were exiting on an error code when this is an expected scenario that should be handled with a clean exit.

How this addresses that need:
* Add NoValidSeedsError exception
* Update harvest CLI command to exit on NoValidSeedsError
* Add command_exit to run after all CLI commands to account for updated harvest CLI command
* Update _handle_subprocess_logging method to raise NoValidSeedsError exception and add a try/except block for JSON decode errors
* Add corresponding CLI and unit tests

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-91
@ehanson8 ehanson8 requested a review from a team as a code owner October 31, 2025 13:52
Comment on lines +334 to +337

@main.result_callback()
@click.pass_context
def command_exit(ctx: click.Context, *_args: Any, **_kwargs: Any) -> None: # noqa: ANN401
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was need to ensure the time-elapsed was still logged whether harvest exited on NoValidSeedsError or the crawl was actually performed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! I'm reflecting as I work on another CLI at the moment how many different ways there are to do things with click.

In that one, here is a snippet of the main group:

@click.pass_context
def main(
    ctx: click.Context,
    *,
    verbose: bool,
) -> None:
    ctx.ensure_object(dict)
    ctx.obj["start_time"] = time.perf_counter()

    root_logger = logging.getLogger()
    logger.info(configure_logger(root_logger, verbose=verbose))
    logger.info(configure_sentry())
    logger.info("Running process")

    def _log_command_elapsed_time() -> None:   #<-------------------
        elapsed_time = time.perf_counter() - ctx.obj["start_time"]
        logger.info(
            "Total time to complete process: %s", str(timedelta(seconds=elapsed_time))
        )

    ctx.call_on_close(_log_command_elapsed_time)  #<------------------------

They have the same effect of calling something after the command has completed. We could probably spend some time analyzing their pros and cons, similaries and differences, idiosyncracies..... but they both seem to work.

FWIW, I like the @main.result_callback() pattern you have used here, and am betting it's more idiomatic click. I may update timdex-embeddings to do that!

Nice work.

@ghukill ghukill self-assigned this Oct 31, 2025
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had typed up an approval review, then deleted it wanting to explore some edge cases locally.... and come full circle to enthusiastic approve.

I really like the surgical parsing of crawler logs, identifying the specific error of no seeds, raising a custom exception of NoValidSeedsError, and then catching that custom exception at the CLI level. No room for ambiguity there.

This would work equally well for a crawl driven by repeating --sitemap or a more "organic" crawl with seed URLs in the YAML file.

There is a scenario where the harvester can still exit abruptly, when there are seeds / URLs to crawl, but the overall crawl fails to successfully crawl any. In this scenario, we'll see an error like this:

...
...
  File "/browsertrix-harvester/harvester/cli.py", line 315, in harvest
    crawl_metadata_records = parser.generate_metadata(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/browsertrix-harvester/harvester/metadata.py", line 170, in generate_metadata
    websites_metadata_df = self._remove_duplicate_urls(websites_metadata_df)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/browsertrix-harvester/harvester/metadata.py", line 304, in _remove_duplicate_urls
    websites_metadata_df.sort_values("cdx_offset")
...
...
KeyError: 'cdx_offset'

This is worth following up on, but I think is out of scope for this ticket + PR. That work should be a bit more of a safety net, not trying to understand why the crawl resulted in nothing, but reporting and handling that. Maybe we find that sometimes it's okay, and we want to exit gracefully. But we might find that getting to that point and still not having URLs in the CDX file(s) is worth bubbling up an error. Unknown at this time.

Nice work on this!! Really like the approach, and appreciate the scenarios in the PR; with web crawls, it can be a little tricky to recreate crawl scenarios.

Comment on lines +334 to +337

@main.result_callback()
@click.pass_context
def command_exit(ctx: click.Context, *_args: Any, **_kwargs: Any) -> None: # noqa: ANN401
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! I'm reflecting as I work on another CLI at the moment how many different ways there are to do things with click.

In that one, here is a snippet of the main group:

@click.pass_context
def main(
    ctx: click.Context,
    *,
    verbose: bool,
) -> None:
    ctx.ensure_object(dict)
    ctx.obj["start_time"] = time.perf_counter()

    root_logger = logging.getLogger()
    logger.info(configure_logger(root_logger, verbose=verbose))
    logger.info(configure_sentry())
    logger.info("Running process")

    def _log_command_elapsed_time() -> None:   #<-------------------
        elapsed_time = time.perf_counter() - ctx.obj["start_time"]
        logger.info(
            "Total time to complete process: %s", str(timedelta(seconds=elapsed_time))
        )

    ctx.call_on_close(_log_command_elapsed_time)  #<------------------------

They have the same effect of calling something after the command has completed. We could probably spend some time analyzing their pros and cons, similaries and differences, idiosyncracies..... but they both seem to work.

FWIW, I like the @main.result_callback() pattern you have used here, and am betting it's more idiomatic click. I may update timdex-embeddings to do that!

Nice work.

@ehanson8 ehanson8 merged commit 1e61e73 into main Oct 31, 2025
4 checks passed
@ehanson8 ehanson8 deleted the USE-91-handle-empty-crawls branch October 31, 2025 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants