USE 240 - prep work for staff directory in mitlibwebsite source #52

ghukill · 2025-12-01T20:32:04Z

Purpose and background context

This PR performs a bit of prep work to add staff directory pages to the scope of the mitlibwebsite TIMDEX source crawl.

The most meaningful update is the ability to parse metadata from <meta name="DC:*" content="..."/> tags in the captured HTML. This builds on the previous behavior of only looking for <meta property="og_*">...</meta> tags. We now effectively have two built-in "strategies" for parsing metadata from the captured HTML.

Note this paragraph in this git commit that acknowledges issues with continuing this pattern:

While these dedicated sub-methods help encapsulate the logic for OpenGraph
and Dublin Core tag parsing, it would be an unsustainable pattern to
keep adding sub-methods each time we encounter new information we want to
extract from the captured HTML. Noting that we may encounter a future where
this harvester should return the full, raw HTML as part of the record,
letting more opinionated, downstream contexts (e.g. Transmogrifier) extract
metadata. For now, this very generic head > meta tag parsing seems reasonable
but we should keep an eye on this.

How can a reviewer manually see the effects of these changes?

1- Build an updated docker image:

make docker-build

2- Create a directory if not exists, output/crawls/configs, and a test YAML file there at output/crawls/configs/mitlibwebsite.yaml (this will get mounted into the Docker container):

# General
generateCDX: true
generateWACZ: true
logExcludeContext: "recorder,pageStatus"
text: to-pages
timeout: 15
userAgentSuffix: "TIMDEXBot"

# Performance
workers: 16 # aim for x2 number of CPU cores

# Seeds and Scoping
# NOTE: It is expected that the browsertrix-harvester will be called with multiple
# --sitemap arguments that augment any seeds defined below.
seeds:
  - url: https://libguides.mit.edu/directory
    scopeType: "custom"
    include:
      - libguides.mit.edu/.*
    depth: 1
    limit: 50
    exclude:
      - ".*az\\.php.*"

# Prevent PAGES from getting crawled; scoping (regex)
exclude:
  - ".*lib\\.mit\\.edu/search/.*"
  - ".*www-ux\\.libraries\\.mit\\.edu.*"
  - ".*mit\\.primo\\.exlibrisgroup\\.com/.*"
  - ".*libraries\\.mit\\.edu/app/uploads.*"
  - ".*libraries\\.mit\\.edu/hours.*"

  # Exclude media files from being crawled as pages
  - ".*\\.mp3$"
  - ".*\\.mp4$"
  - ".*\\.wav$"
  - ".*\\.ogg$"
  - ".*\\.m4a$"
  - ".*\\.avi$"
  - ".*\\.mov$"
  - ".*\\.webm$"
  - ".*\\.mkv$"
  - ".*\\.flac$"
  - ".*\\.wma$"

  # Exclude image files from being crawled as pages
  - ".*\\.jpg$"
  - ".*\\.jpeg$"
  - ".*\\.png$"
  - ".*\\.gif$"
  - ".*\\.webp$"
  - ".*\\.svg$"
  - ".*\\.bmp$"
  - ".*\\.tiff$"
  - ".*\\.ico$"

  # Exclude other files
  - ".*\\.zip$"
  - ".*\\.tar$"
  - ".*\\.gz$"

# Prevent RESOURCES / ASSETS from getting retrieved (regex)
# Aggressive blocking to save ~20GB+ of media
blockRules:
  # Block ALL video domains - saves 14.3 GB (63.9%)
  - url: ".*googlevideo\\.com.*"
  - url: ".*youtube\\.com.*"
  - url: ".*vimeo\\.com.*"

  # Block video files anywhere in URL - saves 14.3 GB
  - url: ".*\\.mp4"
  - url: ".*\\.webm"
  - url: ".*\\.mov"
  - url: ".*\\.avi"
  - url: ".*\\.m4v"
  - url: ".*\\.mkv"
  - url: ".*\\.flv"
  - url: ".*\\.wmv"

  # Block ALL audio from cdn.libraries.mit.edu EXCEPT certain paths
  # Block the entire CDN domain for media/dissemination
  - url: "cdn\\.libraries\\.mit\\.edu/media"
  - url: "cdn\\.libraries\\.mit\\.edu/dissemination"

  # Block audio files ANYWHERE in URL - saves 6.4 GB (28.6%)
  # Pattern matches .mp3 with optional trailing content (query params, fragments, or nothing)
  - url: "\\.mp3"
  - url: "\\.wav"
  - url: "\\.ogg"
  - url: "\\.m4a"
  - url: "\\.aac"
  - url: "\\.flac"
  - url: "\\.wma"

  # Block image files ANYWHERE in URL - saves 1.3 GB (5.8%)
  - url: ".*\\.jpg"
  - url: ".*\\.jpeg"
  - url: ".*\\.png"
  - url: ".*\\.gif"
  - url: ".*\\.webp"
  - url: ".*\\.svg"
  - url: ".*\\.bmp"
  - url: ".*\\.tiff"
  - url: ".*\\.ico"

  # Block PDFs - saves 63 MB (0.3%)
  - url: ".*\\.pdf"

  # Block fonts - saves ~2 MB
  - url: ".*\\.woff2?"
  - url: ".*\\.ttf"
  - url: ".*\\.otf"
  - url: ".*\\.eot"

  # Block other large media/binary files
  - url: ".*\\.zip"
  - url: ".*\\.tar"
  - url: ".*\\.gz"
  - url: ".*\\.rar"
  - url: ".*\\.7z"
  - url: ".*\\.exe"
  - url: ".*\\.dmg"
  - url: ".*\\.iso"

# Browser settings to aggressively prevent media loading
browserArgs:
  - "--disable-images"
  - "--blink-settings=imagesEnabled=false"
  - "--autoplay-policy=document-user-activation-required"
  - "--disable-features=AudioServiceOutOfProcess"
  - "--disable-dev-shm-usage"
  - "--no-sandbox"

# Disable all behaviors to prevent media interaction
behaviors: []

3- Run a harvest:

export CRAWL_NAME="mitlibwebsite-staff-directory"
docker run -it \
-v $(PWD)/output/crawls:/crawls \
browsertrix-harvester-dev \
--verbose \
harvest \
--crawl-name="${CRAWL_NAME}" \
--config-yaml-file="/crawls/configs/mitlibwebsite.yaml" \
--metadata-output-file="/crawls/collections/${CRAWL_NAME}/${CRAWL_NAME}-extracted-records-to-index.jsonl" \
--num-workers 16 \
--include-fulltext \
--btrix-args-json='{}'

The seeds section is new to this YAML, which is otherwise what currently powers mitlibwebsite with the help of multiple --sitemap CLI arguments. The point of running this crawl here is not to get into that YAML configuration so much, but instead to see how URLs crawled from libguides.mit.edu are now picking up Dublin Core metadata.

4- Analyze the file resulting metadata records at output/crawls/collections/mitlibwebsite-staff-directory/mitlibwebsite-staff-directory-extracted-records-to-index.jsonl

Should see values for columns properties DC.Title, DC.Description, etc., which would not have shown up prior to changes in this PR. Also note that we also, still see values for og_title, og_description, etc. This duplication is known and okay; it's the responsibility of Transmog to decide which values to use.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-240

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: One unique function of this harvester is to extract metadata about the crawled website from the raw HTML. The first pass extracted OpenGraph <meta> tag values which were present on all the Wordpress pages we crawled. Now, we find ourselves expanding the type of sites we crawl a bit and need to parse Dublin Core <meta> tags. While these dedicated sub-methods help encapsulate the logic for OpenGraph and Dublin Core <meta> tag parsing, it would be an unsustainable pattern to keep adding sub-methods each time we encounter new information we want to extract from the captured HTML. Noting that we may encounter a future where this harvester should return the full, raw HTML as part of the record, letting more opinionated, downstream contexts (e.g. Transmogrifier) extract metadata. For now, this very generic head > meta tag parsing seems reasonable but we should keep an eye on this. How this addresses that need: Refactors metadata parsing to dedicated sub-methods, porting the pre-existing OpenGraph logic and adding new Dublin Core logic. Side effects of this change: * All records will now have metadata columns for Dublin Core tags, but may just be NULL if those fields not present. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-240

ehanson8

Works as expected, one optional suggestion to consider but non-blocking!

ehanson8 · 2025-12-02T15:41:11Z

harvester/metadata.py

+                if content_stripped != "":
+                    dc_tag_name_friendly = dc_tag_name.replace(":", "_")
+                    fields[dc_tag_name_friendly] = content_stripped
+        return fields


Optional: since there's a lot of repetition between _parse_open_graph_meta_elements and _parse_dublin_core_meta_elements, you could consider a single method that adds a tag list option and has an og and dc option. But's that admittedly a more complicated signature so that's why it's optional!

Yeah....I had kind of wondered / felt the same. My instinct is to leave as-is for now, with the anticipation that we'll be revisiting metadata extraction in this harvester quite a bit in the coming months.

Honestly, I'm beginning to wonder if this harvester should just produce JSONLines with some high level metadata about the website -- e.g. URL, etc. -- but then just include the full HTML for downstream systems like Transmog to parse metadata. This might mean removing these both.

As noted in the PR + git commit, and given the nature of the messy internet of which this harvester crawls, I have a strong feeling it was a misstep to have this kind of opinionated parsing in this harvester. It feels like it should be Transmogrifier that is extracting metadata with a per source opinionation from the raw HTML. If we think of the raw HTML in the same context as a large EAD or something, it's not as odd. It's decidely not metadata like an EAD.... but one could argue the HTML is a wrapper for the metadata which is somewhere in the page and Transmog is responsible for parsing.

In short: Transmog does and should have per-source opinionation, this harvester likely should not. For this reason, while I think this temporary bridge is helpful, we might ultimately want to remove this kind of metadata parsing from the harvester.

harvester/crawl.py

ghukill added 3 commits December 1, 2025 14:17

Update dependencies

b32073b

Add missing method arg docstrings

8b18960

ghukill marked this pull request as ready for review December 1, 2025 20:34

ghukill requested a review from a team as a code owner December 1, 2025 20:34

ehanson8 approved these changes Dec 2, 2025

View reviewed changes

ghukill merged commit 2bc3958 into main Dec 2, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USE 240 - prep work for staff directory in mitlibwebsite source #52

USE 240 - prep work for staff directory in mitlibwebsite source #52

Uh oh!

ghukill commented Dec 1, 2025 •

edited

Loading

Uh oh!

ehanson8 left a comment

Uh oh!

ehanson8 Dec 2, 2025

Uh oh!

ghukill Dec 2, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

USE 240 - prep work for staff directory in mitlibwebsite source #52

USE 240 - prep work for staff directory in mitlibwebsite source #52

Uh oh!

Conversation

ghukill commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

ehanson8 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ghukill commented Dec 1, 2025 •

edited

Loading