Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Dec 9, 2025

Purpose and background context

NOTE: The following came directly from the git commit.


Why these changes are being introduced:

A decision was made to pivot the output of this harvester from "metadata records" to just "records". Due to the highly varying structure of websites, any attempt to parse metadata about the website -- e.g. using OpenGraph or Dublin Core tags in the head section -- was inherently opinionated towards the particular sites getting crawled. This was leaning into an unsustainable pattern that when we encountered new websites, with metadata in different places, we'd have to hardcode that metadata extraction into the harvester such that it would be available downstream.

Instead of the harvester returning metadatda records about the websites captured in the crawl, the decision has the harvester returning the full HTML for each website as part of the "record" that is written as output. This HTML can then be used in downstream applications like Transmogrifier to extract metadata, which are designed to be opinionated about a particular source.

The implicit proposal here is that parsing metadata from messy HTML is different than parsing metadata from structured data like an EAD or METS file, but it's still the work of extracting TIMDEX metadata from some kind of source data.

How this addresses that need:

There are three major changes here:

  1. Change "metadata" language to "records" through the codebase. This is needed to convey that we aren't really creating "metadata" records per se, but instead returning structured "records" with the HTML content of each website. The harvester is still doing important work on top of the raw web crawl, just not parsing metadata.

  2. Include the full HTML of the website in the output record. Because HTML can have all kinds of problematic characters, we base64 ASCII encode it, and store in a field called base64_html.

  3. We remove any CLI, class, or method arguments around including fulltext or extracting keywords. Given that we are now including the full HTML in the output, which contains the fulltext of the website in a form we have more control over parsing, we don't need the harvester to extract fulltext as part of the metadatda record. This also fully completes the removal of keyword extraction which were somewhat experimental and will likely be improved upon by embeddings.


90% of the code churn is the renaming of "metadata" --> "record", which is probably bit difficult to check line-by-line, but an important change I think.

How can a reviewer manually see the effects of these changes?

1- Create configuration YAML locally at output/configs/mitlibwebsite.yaml:

# General
generateCDX: true
generateWACZ: true
logExcludeContext: "recorder,pageStatus"
text: to-pages
timeout: 15
userAgentSuffix: "TIMDEXBot"

# Performance
workers: 16 # aim for x2 number of CPU cores

# Seeds and Scoping
# NOTE: It is expected that the browsertrix-harvester will be called with multiple
# --sitemap arguments that augment any seeds defined below.
seeds:
  - url: https://libguides.mit.edu/directory
    scopeType: "custom"
    include:
      - libguides.mit.edu/.*
    depth: 1
    limit: 50
    exclude:
      - ".*az\\.php.*"

# Prevent PAGES from getting crawled; scoping (regex)
exclude:
  - ".*lib\\.mit\\.edu/search/.*"
  - ".*www-ux\\.libraries\\.mit\\.edu.*"
  - ".*mit\\.primo\\.exlibrisgroup\\.com/.*"
  - ".*libraries\\.mit\\.edu/app/uploads.*"
  - ".*libraries\\.mit\\.edu/hours.*"

  # Exclude media files from being crawled as pages
  - ".*\\.mp3$"
  - ".*\\.mp4$"
  - ".*\\.wav$"
  - ".*\\.ogg$"
  - ".*\\.m4a$"
  - ".*\\.avi$"
  - ".*\\.mov$"
  - ".*\\.webm$"
  - ".*\\.mkv$"
  - ".*\\.flac$"
  - ".*\\.wma$"

  # Exclude image files from being crawled as pages
  - ".*\\.jpg$"
  - ".*\\.jpeg$"
  - ".*\\.png$"
  - ".*\\.gif$"
  - ".*\\.webp$"
  - ".*\\.svg$"
  - ".*\\.bmp$"
  - ".*\\.tiff$"
  - ".*\\.ico$"

  # Exclude other files
  - ".*\\.zip$"
  - ".*\\.tar$"
  - ".*\\.gz$"

# Prevent RESOURCES / ASSETS from getting retrieved (regex)
# Aggressive blocking to save ~20GB+ of media
blockRules:
  # Block ALL video domains - saves 14.3 GB (63.9%)
  - url: ".*googlevideo\\.com.*"
  - url: ".*youtube\\.com.*"
  - url: ".*vimeo\\.com.*"

  # Block video files anywhere in URL - saves 14.3 GB
  - url: ".*\\.mp4"
  - url: ".*\\.webm"
  - url: ".*\\.mov"
  - url: ".*\\.avi"
  - url: ".*\\.m4v"
  - url: ".*\\.mkv"
  - url: ".*\\.flv"
  - url: ".*\\.wmv"

  # Block ALL audio from cdn.libraries.mit.edu EXCEPT certain paths
  # Block the entire CDN domain for media/dissemination
  - url: "cdn\\.libraries\\.mit\\.edu/media"
  - url: "cdn\\.libraries\\.mit\\.edu/dissemination"

  # Block audio files ANYWHERE in URL - saves 6.4 GB (28.6%)
  # Pattern matches .mp3 with optional trailing content (query params, fragments, or nothing)
  - url: "\\.mp3"
  - url: "\\.wav"
  - url: "\\.ogg"
  - url: "\\.m4a"
  - url: "\\.aac"
  - url: "\\.flac"
  - url: "\\.wma"

  # Block image files ANYWHERE in URL - saves 1.3 GB (5.8%)
  - url: ".*\\.jpg"
  - url: ".*\\.jpeg"
  - url: ".*\\.png"
  - url: ".*\\.gif"
  - url: ".*\\.webp"
  - url: ".*\\.svg"
  - url: ".*\\.bmp"
  - url: ".*\\.tiff"
  - url: ".*\\.ico"

  # Block PDFs - saves 63 MB (0.3%)
  - url: ".*\\.pdf"

  # Block fonts - saves ~2 MB
  - url: ".*\\.woff2?"
  - url: ".*\\.ttf"
  - url: ".*\\.otf"
  - url: ".*\\.eot"

  # Block other large media/binary files
  - url: ".*\\.zip"
  - url: ".*\\.tar"
  - url: ".*\\.gz"
  - url: ".*\\.rar"
  - url: ".*\\.7z"
  - url: ".*\\.exe"
  - url: ".*\\.dmg"
  - url: ".*\\.iso"

# Browser settings to aggressively prevent media loading
browserArgs:
  - "--disable-images"
  - "--blink-settings=imagesEnabled=false"
  - "--autoplay-policy=document-user-activation-required"
  - "--disable-features=AudioServiceOutOfProcess"
  - "--disable-dev-shm-usage"
  - "--no-sandbox"

# Disable all behaviors to prevent media interaction
behaviors: []

2- Update local docker image:

make docker-build

3- Run harvest:

export CRAWL_NAME="mitlibwebsite"
docker run -it \
-v $(PWD)/output/crawls:/crawls \
browsertrix-harvester-dev \
--verbose \
harvest \
--crawl-name="${CRAWL_NAME}" \
--config-yaml-file="/crawls/configs/mitlibwebsite.yaml" \
--records-output-file="/crawls/collections/${CRAWL_NAME}/${CRAWL_NAME}-extracted-records-to-index.jsonl" \
--num-workers 8

Note some changes in the CLI args:

  • --metadata-output-file --> --records-output-file

4- Analyze the results located at output/crawls/collections/mitlibwebsite/mitlibwebsite-extracted-records-to-index.jsonl

Depending on your preferred way to view the output, the JSONLines file now should have a new html_base64 ASCII field per row. And, you'll notice the absence of former columns like og_title, dc_title, etc; this is the "metadata" that has been removed.

The encoded HTML can be decoded back into the original HTML with some python like this (which is what Transmogrifier will do):

import base64
import jsonlines


with jsonlines.open('output/crawls/collections/mitlibwebsite/mitlibwebsite-extracted-records-to-index.jsonl') as reader:
    records = list(reader)
record = records[0]

html_content = base64.b64decode(record['html_base64']).decode()
print(html_content)
# output here is the actual HTML, decoded from base64 binary ASCII string....

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES: The browsertrix-harvester now returns "records" vs "metadata records" as its primary output for each run. The records contain some light metadatda about the website itself, e.g. URL, etc., but an equally important column is the full, rendered HTML of the page. It is expected that downstream systems like Transmogrifier will be responsible for extracting metadata in a more focused and opinionated fashion.

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced:

A decision was made to pivot the output of this harvester from "metadata records" to
just "records".  Due to the highly varying structure of websites, any attempt to parse
metadata about the website -- e.g. using OpenGraph or Dublin Core tags in the head
section -- was inherently opinionated towards the particular sites getting crawled.  This
was leaning into an unsustainable pattern that when we encountered new websites, with
metadata in different places, we'd have to hardcode that metadata extraction into the
harvester such that it would be available downstream.

Instead of the harvester returning metadatda records about the websites captured in the
crawl, the decision has the harvester returning the full HTML for each website as part
of the "record" that is written as output.  This HTML can then be used in downstream
applications like Transmogrifier to extract metadata, which are designed to be
opinionated about a particular source.

The implicit proposal here is that parsing metadata from messy HTML is different than
parsing metadata from structured data like an EAD or METS file, but it's still the
work of extracting *TIMDEX* metadata from some kind of source data.

How this addresses that need:

There are three major changes here:

1. Change "metadata" language to "records" through the codebase.  This is needed to
convey that we aren't really creating "metadata" records per se, but instead returning
structured "records" with the HTML content of each website.  The harvester is still
doing important work on top of the raw web crawl, just not parsing metadata.

2. Include the full HTML of the website in the output record.  Because HTML can have
all kinds of problematic characters, we base64 ASCII encode it, and store in a field
called `base64_html`.

3. We remove any CLI, class, or method arguments around including fulltext or
extracting keywords.  Given that we are now including the full HTML in the output,
which contains the fulltext of the website in a form we have more control over parsing,
we don't need the harvester to extract fulltext as part of the metadatda record.  This
also fully completes the removal of keyword extraction which were somewhat experimental
and will likely be improved upon by embeddings.

Side effects of this change:
* The browsertrix-harvester now returns "records" vs "metadata records" as its primary
output for each run.  The records contain some light metadatda about the website itself,
e.g. URL, etc., but an equally important column is the full, rendered HTML of the page.
It is expected that downstream systems like Transmogrifier will be responsible for
extracting metadata in a more focused and opinionated fashion.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-258
@ghukill ghukill force-pushed the USE-258-html-vs-metadata-parsing branch from 6435e60 to 8245d1a Compare December 9, 2025 17:55
@ghukill ghukill marked this pull request as ready for review December 9, 2025 19:11
@ghukill ghukill requested a review from a team as a code owner December 9, 2025 19:11
@jonavellecuerdo jonavellecuerdo self-assigned this Dec 9, 2025
@ghukill
Copy link
Contributor Author

ghukill commented Dec 9, 2025

To whomever @MITLibraries/dataeng reviews this:

I'm realizing that given the file renaming from metadata.py to records.py, some of the line-by-line changes are lost.

I'm adding a couple of inline comments of places of importance.

Comment on lines +77 to +83
# add base64 encoded full HTML
html_content = wacz_client.get_website_content(
str(row.filename),
str(row.offset),
decode=False,
)
record["html_base64"] = base64.b64encode(html_content).decode() # type: ignore[arg-type]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we base64 encode and store the page HTML in the output record. This replaces a considerable amount of logic and complexity where we parsed metadata tags from the HTML.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected and this is a smart and logical shift of the metadata parsing out of this repo. Great work!

@ghukill
Copy link
Contributor Author

ghukill commented Dec 9, 2025

Works as expected and this is a smart and logical shift of the metadata parsing out of this repo. Great work!

Thanks @ehanson8! I felt good about this fairly substantial shift, glad you are too.

@ghukill ghukill merged commit 0cce93d into main Dec 10, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants