USE 258 - Rework harvester to return "records" vs "metadata records" #53

ghukill · 2025-12-09T17:02:12Z

Purpose and background context

NOTE: The following came directly from the git commit.

Why these changes are being introduced:

A decision was made to pivot the output of this harvester from "metadata records" to just "records". Due to the highly varying structure of websites, any attempt to parse metadata about the website -- e.g. using OpenGraph or Dublin Core tags in the head section -- was inherently opinionated towards the particular sites getting crawled. This was leaning into an unsustainable pattern that when we encountered new websites, with metadata in different places, we'd have to hardcode that metadata extraction into the harvester such that it would be available downstream.

Instead of the harvester returning metadatda records about the websites captured in the crawl, the decision has the harvester returning the full HTML for each website as part of the "record" that is written as output. This HTML can then be used in downstream applications like Transmogrifier to extract metadata, which are designed to be opinionated about a particular source.

The implicit proposal here is that parsing metadata from messy HTML is different than parsing metadata from structured data like an EAD or METS file, but it's still the work of extracting TIMDEX metadata from some kind of source data.

How this addresses that need:

There are three major changes here:

Change "metadata" language to "records" through the codebase. This is needed to convey that we aren't really creating "metadata" records per se, but instead returning structured "records" with the HTML content of each website. The harvester is still doing important work on top of the raw web crawl, just not parsing metadata.
Include the full HTML of the website in the output record. Because HTML can have all kinds of problematic characters, we base64 ASCII encode it, and store in a field called base64_html.
We remove any CLI, class, or method arguments around including fulltext or extracting keywords. Given that we are now including the full HTML in the output, which contains the fulltext of the website in a form we have more control over parsing, we don't need the harvester to extract fulltext as part of the metadatda record. This also fully completes the removal of keyword extraction which were somewhat experimental and will likely be improved upon by embeddings.

90% of the code churn is the renaming of "metadata" --> "record", which is probably bit difficult to check line-by-line, but an important change I think.

How can a reviewer manually see the effects of these changes?

1- Create configuration YAML locally at output/configs/mitlibwebsite.yaml:

# General
generateCDX: true
generateWACZ: true
logExcludeContext: "recorder,pageStatus"
text: to-pages
timeout: 15
userAgentSuffix: "TIMDEXBot"

# Performance
workers: 16 # aim for x2 number of CPU cores

# Seeds and Scoping
# NOTE: It is expected that the browsertrix-harvester will be called with multiple
# --sitemap arguments that augment any seeds defined below.
seeds:
  - url: https://libguides.mit.edu/directory
    scopeType: "custom"
    include:
      - libguides.mit.edu/.*
    depth: 1
    limit: 50
    exclude:
      - ".*az\\.php.*"

# Prevent PAGES from getting crawled; scoping (regex)
exclude:
  - ".*lib\\.mit\\.edu/search/.*"
  - ".*www-ux\\.libraries\\.mit\\.edu.*"
  - ".*mit\\.primo\\.exlibrisgroup\\.com/.*"
  - ".*libraries\\.mit\\.edu/app/uploads.*"
  - ".*libraries\\.mit\\.edu/hours.*"

  # Exclude media files from being crawled as pages
  - ".*\\.mp3$"
  - ".*\\.mp4$"
  - ".*\\.wav$"
  - ".*\\.ogg$"
  - ".*\\.m4a$"
  - ".*\\.avi$"
  - ".*\\.mov$"
  - ".*\\.webm$"
  - ".*\\.mkv$"
  - ".*\\.flac$"
  - ".*\\.wma$"

  # Exclude image files from being crawled as pages
  - ".*\\.jpg$"
  - ".*\\.jpeg$"
  - ".*\\.png$"
  - ".*\\.gif$"
  - ".*\\.webp$"
  - ".*\\.svg$"
  - ".*\\.bmp$"
  - ".*\\.tiff$"
  - ".*\\.ico$"

  # Exclude other files
  - ".*\\.zip$"
  - ".*\\.tar$"
  - ".*\\.gz$"

# Prevent RESOURCES / ASSETS from getting retrieved (regex)
# Aggressive blocking to save ~20GB+ of media
blockRules:
  # Block ALL video domains - saves 14.3 GB (63.9%)
  - url: ".*googlevideo\\.com.*"
  - url: ".*youtube\\.com.*"
  - url: ".*vimeo\\.com.*"

  # Block video files anywhere in URL - saves 14.3 GB
  - url: ".*\\.mp4"
  - url: ".*\\.webm"
  - url: ".*\\.mov"
  - url: ".*\\.avi"
  - url: ".*\\.m4v"
  - url: ".*\\.mkv"
  - url: ".*\\.flv"
  - url: ".*\\.wmv"

  # Block ALL audio from cdn.libraries.mit.edu EXCEPT certain paths
  # Block the entire CDN domain for media/dissemination
  - url: "cdn\\.libraries\\.mit\\.edu/media"
  - url: "cdn\\.libraries\\.mit\\.edu/dissemination"

  # Block audio files ANYWHERE in URL - saves 6.4 GB (28.6%)
  # Pattern matches .mp3 with optional trailing content (query params, fragments, or nothing)
  - url: "\\.mp3"
  - url: "\\.wav"
  - url: "\\.ogg"
  - url: "\\.m4a"
  - url: "\\.aac"
  - url: "\\.flac"
  - url: "\\.wma"

  # Block image files ANYWHERE in URL - saves 1.3 GB (5.8%)
  - url: ".*\\.jpg"
  - url: ".*\\.jpeg"
  - url: ".*\\.png"
  - url: ".*\\.gif"
  - url: ".*\\.webp"
  - url: ".*\\.svg"
  - url: ".*\\.bmp"
  - url: ".*\\.tiff"
  - url: ".*\\.ico"

  # Block PDFs - saves 63 MB (0.3%)
  - url: ".*\\.pdf"

  # Block fonts - saves ~2 MB
  - url: ".*\\.woff2?"
  - url: ".*\\.ttf"
  - url: ".*\\.otf"
  - url: ".*\\.eot"

  # Block other large media/binary files
  - url: ".*\\.zip"
  - url: ".*\\.tar"
  - url: ".*\\.gz"
  - url: ".*\\.rar"
  - url: ".*\\.7z"
  - url: ".*\\.exe"
  - url: ".*\\.dmg"
  - url: ".*\\.iso"

# Browser settings to aggressively prevent media loading
browserArgs:
  - "--disable-images"
  - "--blink-settings=imagesEnabled=false"
  - "--autoplay-policy=document-user-activation-required"
  - "--disable-features=AudioServiceOutOfProcess"
  - "--disable-dev-shm-usage"
  - "--no-sandbox"

# Disable all behaviors to prevent media interaction
behaviors: []

2- Update local docker image:

make docker-build

3- Run harvest:

export CRAWL_NAME="mitlibwebsite"
docker run -it \
-v $(PWD)/output/crawls:/crawls \
browsertrix-harvester-dev \
--verbose \
harvest \
--crawl-name="${CRAWL_NAME}" \
--config-yaml-file="/crawls/configs/mitlibwebsite.yaml" \
--records-output-file="/crawls/collections/${CRAWL_NAME}/${CRAWL_NAME}-extracted-records-to-index.jsonl" \
--num-workers 8

Note some changes in the CLI args:

--metadata-output-file --> --records-output-file

4- Analyze the results located at output/crawls/collections/mitlibwebsite/mitlibwebsite-extracted-records-to-index.jsonl

Depending on your preferred way to view the output, the JSONLines file now should have a new html_base64 ASCII field per row. And, you'll notice the absence of former columns like og_title, dc_title, etc; this is the "metadata" that has been removed.

The encoded HTML can be decoded back into the original HTML with some python like this (which is what Transmogrifier will do):

import base64
import jsonlines


with jsonlines.open('output/crawls/collections/mitlibwebsite/mitlibwebsite-extracted-records-to-index.jsonl') as reader:
    records = list(reader)
record = records[0]

html_content = base64.b64decode(record['html_base64']).decode()
print(html_content)
# output here is the actual HTML, decoded from base64 binary ASCII string....

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES: The browsertrix-harvester now returns "records" vs "metadata records" as its primary output for each run. The records contain some light metadatda about the website itself, e.g. URL, etc., but an equally important column is the full, rendered HTML of the page. It is expected that downstream systems like Transmogrifier will be responsible for extracting metadata in a more focused and opinionated fashion.

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-258

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: A decision was made to pivot the output of this harvester from "metadata records" to just "records". Due to the highly varying structure of websites, any attempt to parse metadata about the website -- e.g. using OpenGraph or Dublin Core tags in the head section -- was inherently opinionated towards the particular sites getting crawled. This was leaning into an unsustainable pattern that when we encountered new websites, with metadata in different places, we'd have to hardcode that metadata extraction into the harvester such that it would be available downstream. Instead of the harvester returning metadatda records about the websites captured in the crawl, the decision has the harvester returning the full HTML for each website as part of the "record" that is written as output. This HTML can then be used in downstream applications like Transmogrifier to extract metadata, which are designed to be opinionated about a particular source. The implicit proposal here is that parsing metadata from messy HTML is different than parsing metadata from structured data like an EAD or METS file, but it's still the work of extracting *TIMDEX* metadata from some kind of source data. How this addresses that need: There are three major changes here: 1. Change "metadata" language to "records" through the codebase. This is needed to convey that we aren't really creating "metadata" records per se, but instead returning structured "records" with the HTML content of each website. The harvester is still doing important work on top of the raw web crawl, just not parsing metadata. 2. Include the full HTML of the website in the output record. Because HTML can have all kinds of problematic characters, we base64 ASCII encode it, and store in a field called `base64_html`. 3. We remove any CLI, class, or method arguments around including fulltext or extracting keywords. Given that we are now including the full HTML in the output, which contains the fulltext of the website in a form we have more control over parsing, we don't need the harvester to extract fulltext as part of the metadatda record. This also fully completes the removal of keyword extraction which were somewhat experimental and will likely be improved upon by embeddings. Side effects of this change: * The browsertrix-harvester now returns "records" vs "metadata records" as its primary output for each run. The records contain some light metadatda about the website itself, e.g. URL, etc., but an equally important column is the full, rendered HTML of the page. It is expected that downstream systems like Transmogrifier will be responsible for extracting metadata in a more focused and opinionated fashion. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-258

ghukill · 2025-12-09T20:35:42Z

To whomever @MITLibraries/dataeng reviews this:

I'm realizing that given the file renaming from metadata.py to records.py, some of the line-by-line changes are lost.

I'm adding a couple of inline comments of places of importance.

ghukill · 2025-12-09T20:36:16Z

harvester/records.py

+                # add base64 encoded full HTML
+                html_content = wacz_client.get_website_content(
+                    str(row.filename),
+                    str(row.offset),
+                    decode=False,
+                )
+                record["html_base64"] = base64.b64encode(html_content).decode()  # type: ignore[arg-type]


This is where we base64 encode and store the page HTML in the output record. This replaces a considerable amount of logic and complexity where we parsed metadata tags from the HTML.

ehanson8

Works as expected and this is a smart and logical shift of the metadata parsing out of this repo. Great work!

ghukill · 2025-12-09T21:38:58Z

Works as expected and this is a smart and logical shift of the metadata parsing out of this repo. Great work!

Thanks @ehanson8! I felt good about this fairly substantial shift, glad you are too.

ghukill force-pushed the USE-258-html-vs-metadata-parsing branch from 6435e60 to 8245d1a Compare December 9, 2025 17:55

ghukill marked this pull request as ready for review December 9, 2025 19:11

ghukill requested a review from a team as a code owner December 9, 2025 19:11

jonavellecuerdo self-assigned this Dec 9, 2025

ghukill commented Dec 9, 2025

View reviewed changes

ehanson8 approved these changes Dec 9, 2025

View reviewed changes

jonavellecuerdo approved these changes Dec 10, 2025

View reviewed changes

ghukill merged commit 0cce93d into main Dec 10, 2025
4 checks passed

This was referenced Dec 10, 2025

USE 272 - Add response headers to output records #54

Merged

USE 270 - update browsetrix command generation MITLibraries/timdex-pipeline-lambdas#330

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USE 258 - Rework harvester to return "records" vs "metadata records" #53

USE 258 - Rework harvester to return "records" vs "metadata records" #53

Uh oh!

ghukill commented Dec 9, 2025 •

edited

Loading

Uh oh!

ghukill commented Dec 9, 2025

Uh oh!

ghukill Dec 9, 2025

Uh oh!

ehanson8 left a comment

Uh oh!

ghukill commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

USE 258 - Rework harvester to return "records" vs "metadata records" #53

USE 258 - Rework harvester to return "records" vs "metadata records" #53

Uh oh!

Conversation

ghukill commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

ghukill commented Dec 9, 2025

Uh oh!

ghukill Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

ghukill commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ghukill commented Dec 9, 2025 •

edited

Loading