-
Notifications
You must be signed in to change notification settings - Fork 0
USE 258 - Rework harvester to return "records" vs "metadata records" #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: A decision was made to pivot the output of this harvester from "metadata records" to just "records". Due to the highly varying structure of websites, any attempt to parse metadata about the website -- e.g. using OpenGraph or Dublin Core tags in the head section -- was inherently opinionated towards the particular sites getting crawled. This was leaning into an unsustainable pattern that when we encountered new websites, with metadata in different places, we'd have to hardcode that metadata extraction into the harvester such that it would be available downstream. Instead of the harvester returning metadatda records about the websites captured in the crawl, the decision has the harvester returning the full HTML for each website as part of the "record" that is written as output. This HTML can then be used in downstream applications like Transmogrifier to extract metadata, which are designed to be opinionated about a particular source. The implicit proposal here is that parsing metadata from messy HTML is different than parsing metadata from structured data like an EAD or METS file, but it's still the work of extracting *TIMDEX* metadata from some kind of source data. How this addresses that need: There are three major changes here: 1. Change "metadata" language to "records" through the codebase. This is needed to convey that we aren't really creating "metadata" records per se, but instead returning structured "records" with the HTML content of each website. The harvester is still doing important work on top of the raw web crawl, just not parsing metadata. 2. Include the full HTML of the website in the output record. Because HTML can have all kinds of problematic characters, we base64 ASCII encode it, and store in a field called `base64_html`. 3. We remove any CLI, class, or method arguments around including fulltext or extracting keywords. Given that we are now including the full HTML in the output, which contains the fulltext of the website in a form we have more control over parsing, we don't need the harvester to extract fulltext as part of the metadatda record. This also fully completes the removal of keyword extraction which were somewhat experimental and will likely be improved upon by embeddings. Side effects of this change: * The browsertrix-harvester now returns "records" vs "metadata records" as its primary output for each run. The records contain some light metadatda about the website itself, e.g. URL, etc., but an equally important column is the full, rendered HTML of the page. It is expected that downstream systems like Transmogrifier will be responsible for extracting metadata in a more focused and opinionated fashion. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-258
6435e60 to
8245d1a
Compare
|
To whomever @MITLibraries/dataeng reviews this: I'm realizing that given the file renaming from I'm adding a couple of inline comments of places of importance. |
| # add base64 encoded full HTML | ||
| html_content = wacz_client.get_website_content( | ||
| str(row.filename), | ||
| str(row.offset), | ||
| decode=False, | ||
| ) | ||
| record["html_base64"] = base64.b64encode(html_content).decode() # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we base64 encode and store the page HTML in the output record. This replaces a considerable amount of logic and complexity where we parsed metadata tags from the HTML.
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works as expected and this is a smart and logical shift of the metadata parsing out of this repo. Great work!
Thanks @ehanson8! I felt good about this fairly substantial shift, glad you are too. |
Purpose and background context
NOTE: The following came directly from the git commit.
Why these changes are being introduced:
A decision was made to pivot the output of this harvester from "metadata records" to just "records". Due to the highly varying structure of websites, any attempt to parse metadata about the website -- e.g. using OpenGraph or Dublin Core tags in the head section -- was inherently opinionated towards the particular sites getting crawled. This was leaning into an unsustainable pattern that when we encountered new websites, with metadata in different places, we'd have to hardcode that metadata extraction into the harvester such that it would be available downstream.
Instead of the harvester returning metadatda records about the websites captured in the crawl, the decision has the harvester returning the full HTML for each website as part of the "record" that is written as output. This HTML can then be used in downstream applications like Transmogrifier to extract metadata, which are designed to be opinionated about a particular source.
The implicit proposal here is that parsing metadata from messy HTML is different than parsing metadata from structured data like an EAD or METS file, but it's still the work of extracting TIMDEX metadata from some kind of source data.
How this addresses that need:
There are three major changes here:
Change "metadata" language to "records" through the codebase. This is needed to convey that we aren't really creating "metadata" records per se, but instead returning structured "records" with the HTML content of each website. The harvester is still doing important work on top of the raw web crawl, just not parsing metadata.
Include the full HTML of the website in the output record. Because HTML can have all kinds of problematic characters, we base64 ASCII encode it, and store in a field called
base64_html.We remove any CLI, class, or method arguments around including fulltext or extracting keywords. Given that we are now including the full HTML in the output, which contains the fulltext of the website in a form we have more control over parsing, we don't need the harvester to extract fulltext as part of the metadatda record. This also fully completes the removal of keyword extraction which were somewhat experimental and will likely be improved upon by embeddings.
90% of the code churn is the renaming of "metadata" --> "record", which is probably bit difficult to check line-by-line, but an important change I think.
How can a reviewer manually see the effects of these changes?
1- Create configuration YAML locally at
output/configs/mitlibwebsite.yaml:2- Update local docker image:
3- Run harvest:
Note some changes in the CLI args:
--metadata-output-file-->--records-output-file4- Analyze the results located at
output/crawls/collections/mitlibwebsite/mitlibwebsite-extracted-records-to-index.jsonlDepending on your preferred way to view the output, the JSONLines file now should have a new
html_base64ASCII field per row. And, you'll notice the absence of former columns likeog_title,dc_title, etc; this is the "metadata" that has been removed.The encoded HTML can be decoded back into the original HTML with some python like this (which is what Transmogrifier will do):
Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES: The browsertrix-harvester now returns "records" vs "metadata records" as its primary output for each run. The records contain some light metadatda about the website itself, e.g. URL, etc., but an equally important column is the full, rendered HTML of the page. It is expected that downstream systems like Transmogrifier will be responsible for extracting metadata in a more focused and opinionated fashion.
What are the relevant tickets?
Code review