-
Notifications
You must be signed in to change notification settings - Fork 0
USE 240 - prep work for staff directory in mitlibwebsite source #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: One unique function of this harvester is to extract metadata about the crawled website from the raw HTML. The first pass extracted OpenGraph <meta> tag values which were present on all the Wordpress pages we crawled. Now, we find ourselves expanding the type of sites we crawl a bit and need to parse Dublin Core <meta> tags. While these dedicated sub-methods help encapsulate the logic for OpenGraph and Dublin Core <meta> tag parsing, it would be an unsustainable pattern to keep adding sub-methods each time we encounter new information we want to extract from the captured HTML. Noting that we may encounter a future where this harvester should return the full, raw HTML as part of the record, letting more opinionated, downstream contexts (e.g. Transmogrifier) extract metadata. For now, this very generic head > meta tag parsing seems reasonable but we should keep an eye on this. How this addresses that need: Refactors metadata parsing to dedicated sub-methods, porting the pre-existing OpenGraph logic and adding new Dublin Core logic. Side effects of this change: * All records will now have metadata columns for Dublin Core tags, but may just be NULL if those fields not present. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-240
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works as expected, one optional suggestion to consider but non-blocking!
| if content_stripped != "": | ||
| dc_tag_name_friendly = dc_tag_name.replace(":", "_") | ||
| fields[dc_tag_name_friendly] = content_stripped | ||
| return fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: since there's a lot of repetition between _parse_open_graph_meta_elements and _parse_dublin_core_meta_elements, you could consider a single method that adds a tag list option and has an og and dc option. But's that admittedly a more complicated signature so that's why it's optional!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah....I had kind of wondered / felt the same. My instinct is to leave as-is for now, with the anticipation that we'll be revisiting metadata extraction in this harvester quite a bit in the coming months.
Honestly, I'm beginning to wonder if this harvester should just produce JSONLines with some high level metadata about the website -- e.g. URL, etc. -- but then just include the full HTML for downstream systems like Transmog to parse metadata. This might mean removing these both.
As noted in the PR + git commit, and given the nature of the messy internet of which this harvester crawls, I have a strong feeling it was a misstep to have this kind of opinionated parsing in this harvester. It feels like it should be Transmogrifier that is extracting metadata with a per source opinionation from the raw HTML. If we think of the raw HTML in the same context as a large EAD or something, it's not as odd. It's decidely not metadata like an EAD.... but one could argue the HTML is a wrapper for the metadata which is somewhere in the page and Transmog is responsible for parsing.
In short: Transmog does and should have per-source opinionation, this harvester likely should not. For this reason, while I think this temporary bridge is helpful, we might ultimately want to remove this kind of metadata parsing from the harvester.
Purpose and background context
This PR performs a bit of prep work to add staff directory pages to the scope of the
mitlibwebsiteTIMDEX source crawl.The most meaningful update is the ability to parse metadata from
<meta name="DC:*" content="..."/>tags in the captured HTML. This builds on the previous behavior of only looking for<meta property="og_*">...</meta>tags. We now effectively have two built-in "strategies" for parsing metadata from the captured HTML.Note this paragraph in this git commit that acknowledges issues with continuing this pattern:
How can a reviewer manually see the effects of these changes?
1- Build an updated docker image:
2- Create a directory if not exists,
output/crawls/configs, and a test YAML file there atoutput/crawls/configs/mitlibwebsite.yaml(this will get mounted into the Docker container):3- Run a harvest:
The
seedssection is new to this YAML, which is otherwise what currently powersmitlibwebsitewith the help of multiple--sitemapCLI arguments. The point of running this crawl here is not to get into that YAML configuration so much, but instead to see how URLs crawled fromlibguides.mit.eduare now picking up Dublin Core metadata.4- Analyze the file resulting metadata records at
output/crawls/collections/mitlibwebsite-staff-directory/mitlibwebsite-staff-directory-extracted-records-to-index.jsonlShould see values for columns properties
DC.Title,DC.Description, etc., which would not have shown up prior to changes in this PR. Also note that we also, still see values forog_title,og_description, etc. This duplication is known and okay; it's the responsibility of Transmog to decide which values to use.Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES
What are the relevant tickets?
Code review