Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Dec 1, 2025

Purpose and background context

This PR performs a bit of prep work to add staff directory pages to the scope of the mitlibwebsite TIMDEX source crawl.

The most meaningful update is the ability to parse metadata from <meta name="DC:*" content="..."/> tags in the captured HTML. This builds on the previous behavior of only looking for <meta property="og_*">...</meta> tags. We now effectively have two built-in "strategies" for parsing metadata from the captured HTML.

Note this paragraph in this git commit that acknowledges issues with continuing this pattern:

While these dedicated sub-methods help encapsulate the logic for OpenGraph
and Dublin Core tag parsing, it would be an unsustainable pattern to
keep adding sub-methods each time we encounter new information we want to
extract from the captured HTML. Noting that we may encounter a future where
this harvester should return the full, raw HTML as part of the record,
letting more opinionated, downstream contexts (e.g. Transmogrifier) extract
metadata. For now, this very generic head > meta tag parsing seems reasonable
but we should keep an eye on this.

How can a reviewer manually see the effects of these changes?

1- Build an updated docker image:

make docker-build

2- Create a directory if not exists, output/crawls/configs, and a test YAML file there at output/crawls/configs/mitlibwebsite.yaml (this will get mounted into the Docker container):

# General
generateCDX: true
generateWACZ: true
logExcludeContext: "recorder,pageStatus"
text: to-pages
timeout: 15
userAgentSuffix: "TIMDEXBot"

# Performance
workers: 16 # aim for x2 number of CPU cores

# Seeds and Scoping
# NOTE: It is expected that the browsertrix-harvester will be called with multiple
# --sitemap arguments that augment any seeds defined below.
seeds:
  - url: https://libguides.mit.edu/directory
    scopeType: "custom"
    include:
      - libguides.mit.edu/.*
    depth: 1
    limit: 50
    exclude:
      - ".*az\\.php.*"

# Prevent PAGES from getting crawled; scoping (regex)
exclude:
  - ".*lib\\.mit\\.edu/search/.*"
  - ".*www-ux\\.libraries\\.mit\\.edu.*"
  - ".*mit\\.primo\\.exlibrisgroup\\.com/.*"
  - ".*libraries\\.mit\\.edu/app/uploads.*"
  - ".*libraries\\.mit\\.edu/hours.*"

  # Exclude media files from being crawled as pages
  - ".*\\.mp3$"
  - ".*\\.mp4$"
  - ".*\\.wav$"
  - ".*\\.ogg$"
  - ".*\\.m4a$"
  - ".*\\.avi$"
  - ".*\\.mov$"
  - ".*\\.webm$"
  - ".*\\.mkv$"
  - ".*\\.flac$"
  - ".*\\.wma$"

  # Exclude image files from being crawled as pages
  - ".*\\.jpg$"
  - ".*\\.jpeg$"
  - ".*\\.png$"
  - ".*\\.gif$"
  - ".*\\.webp$"
  - ".*\\.svg$"
  - ".*\\.bmp$"
  - ".*\\.tiff$"
  - ".*\\.ico$"

  # Exclude other files
  - ".*\\.zip$"
  - ".*\\.tar$"
  - ".*\\.gz$"

# Prevent RESOURCES / ASSETS from getting retrieved (regex)
# Aggressive blocking to save ~20GB+ of media
blockRules:
  # Block ALL video domains - saves 14.3 GB (63.9%)
  - url: ".*googlevideo\\.com.*"
  - url: ".*youtube\\.com.*"
  - url: ".*vimeo\\.com.*"

  # Block video files anywhere in URL - saves 14.3 GB
  - url: ".*\\.mp4"
  - url: ".*\\.webm"
  - url: ".*\\.mov"
  - url: ".*\\.avi"
  - url: ".*\\.m4v"
  - url: ".*\\.mkv"
  - url: ".*\\.flv"
  - url: ".*\\.wmv"

  # Block ALL audio from cdn.libraries.mit.edu EXCEPT certain paths
  # Block the entire CDN domain for media/dissemination
  - url: "cdn\\.libraries\\.mit\\.edu/media"
  - url: "cdn\\.libraries\\.mit\\.edu/dissemination"

  # Block audio files ANYWHERE in URL - saves 6.4 GB (28.6%)
  # Pattern matches .mp3 with optional trailing content (query params, fragments, or nothing)
  - url: "\\.mp3"
  - url: "\\.wav"
  - url: "\\.ogg"
  - url: "\\.m4a"
  - url: "\\.aac"
  - url: "\\.flac"
  - url: "\\.wma"

  # Block image files ANYWHERE in URL - saves 1.3 GB (5.8%)
  - url: ".*\\.jpg"
  - url: ".*\\.jpeg"
  - url: ".*\\.png"
  - url: ".*\\.gif"
  - url: ".*\\.webp"
  - url: ".*\\.svg"
  - url: ".*\\.bmp"
  - url: ".*\\.tiff"
  - url: ".*\\.ico"

  # Block PDFs - saves 63 MB (0.3%)
  - url: ".*\\.pdf"

  # Block fonts - saves ~2 MB
  - url: ".*\\.woff2?"
  - url: ".*\\.ttf"
  - url: ".*\\.otf"
  - url: ".*\\.eot"

  # Block other large media/binary files
  - url: ".*\\.zip"
  - url: ".*\\.tar"
  - url: ".*\\.gz"
  - url: ".*\\.rar"
  - url: ".*\\.7z"
  - url: ".*\\.exe"
  - url: ".*\\.dmg"
  - url: ".*\\.iso"

# Browser settings to aggressively prevent media loading
browserArgs:
  - "--disable-images"
  - "--blink-settings=imagesEnabled=false"
  - "--autoplay-policy=document-user-activation-required"
  - "--disable-features=AudioServiceOutOfProcess"
  - "--disable-dev-shm-usage"
  - "--no-sandbox"

# Disable all behaviors to prevent media interaction
behaviors: []

3- Run a harvest:

export CRAWL_NAME="mitlibwebsite-staff-directory"
docker run -it \
-v $(PWD)/output/crawls:/crawls \
browsertrix-harvester-dev \
--verbose \
harvest \
--crawl-name="${CRAWL_NAME}" \
--config-yaml-file="/crawls/configs/mitlibwebsite.yaml" \
--metadata-output-file="/crawls/collections/${CRAWL_NAME}/${CRAWL_NAME}-extracted-records-to-index.jsonl" \
--num-workers 16 \
--include-fulltext \
--btrix-args-json='{}'

The seeds section is new to this YAML, which is otherwise what currently powers mitlibwebsite with the help of multiple --sitemap CLI arguments. The point of running this crawl here is not to get into that YAML configuration so much, but instead to see how URLs crawled from libguides.mit.edu are now picking up Dublin Core metadata.

4- Analyze the file resulting metadata records at output/crawls/collections/mitlibwebsite-staff-directory/mitlibwebsite-staff-directory-extracted-records-to-index.jsonl

Should see values for columns properties DC.Title, DC.Description, etc., which would not have shown up prior to changes in this PR. Also note that we also, still see values for og_title, og_description, etc. This duplication is known and okay; it's the responsibility of Transmog to decide which values to use.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced:

One unique function of this harvester is to extract metadata about the
crawled website from the raw HTML.  The first pass extracted OpenGraph <meta>
tag values which were present on all the Wordpress pages we crawled.  Now, we
find ourselves expanding the type of sites we crawl a bit and need to parse
Dublin Core <meta> tags.

While these dedicated sub-methods help encapsulate the logic for OpenGraph
and Dublin Core <meta> tag parsing, it would be an unsustainable pattern to
keep adding sub-methods each time we encounter new information we want to
extract from the captured HTML.  Noting that we may encounter a future where
this harvester should return the full, raw HTML as part of the record,
letting more opinionated, downstream contexts (e.g. Transmogrifier) extract
metadata.  For now, this very generic head > meta tag parsing seems reasonable
but we should keep an eye on this.

How this addresses that need:

Refactors metadata parsing to dedicated sub-methods, porting the pre-existing
OpenGraph logic and adding new Dublin Core logic.

Side effects of this change:
* All records will now have metadata columns for Dublin Core tags, but may
just be NULL if those fields not present.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-240
@ghukill ghukill marked this pull request as ready for review December 1, 2025 20:34
@ghukill ghukill requested a review from a team as a code owner December 1, 2025 20:34
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected, one optional suggestion to consider but non-blocking!

if content_stripped != "":
dc_tag_name_friendly = dc_tag_name.replace(":", "_")
fields[dc_tag_name_friendly] = content_stripped
return fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: since there's a lot of repetition between _parse_open_graph_meta_elements and _parse_dublin_core_meta_elements, you could consider a single method that adds a tag list option and has an og and dc option. But's that admittedly a more complicated signature so that's why it's optional!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah....I had kind of wondered / felt the same. My instinct is to leave as-is for now, with the anticipation that we'll be revisiting metadata extraction in this harvester quite a bit in the coming months.

Honestly, I'm beginning to wonder if this harvester should just produce JSONLines with some high level metadata about the website -- e.g. URL, etc. -- but then just include the full HTML for downstream systems like Transmog to parse metadata. This might mean removing these both.

As noted in the PR + git commit, and given the nature of the messy internet of which this harvester crawls, I have a strong feeling it was a misstep to have this kind of opinionated parsing in this harvester. It feels like it should be Transmogrifier that is extracting metadata with a per source opinionation from the raw HTML. If we think of the raw HTML in the same context as a large EAD or something, it's not as odd. It's decidely not metadata like an EAD.... but one could argue the HTML is a wrapper for the metadata which is somewhere in the page and Transmog is responsible for parsing.

In short: Transmog does and should have per-source opinionation, this harvester likely should not. For this reason, while I think this temporary bridge is helpful, we might ultimately want to remove this kind of metadata parsing from the harvester.

@ghukill ghukill merged commit 2bc3958 into main Dec 2, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants