Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Dec 10, 2025

Purpose and background context

Now that we include the full, rendered HTML in output records for downstream systems to parse and use, it would be helpful to also have the associated HTTP response headers which provide additional information. Sometimes these can help indicate when the website was updated, or alternate links, etc. It may very well not get used, but it would be beneficial to err on the side of including.

How this addresses that need:

  • During record generation, include the response headers as new response_headers column.

The following is an example of response headers in a single output record:

{'accept-ranges': ['bytes'],
  'age': ['423908'],
  'cache-control': ['public, max-age=604800'],
  'content-length': ['123552'],
  'content-type': ['text/html; charset=UTF-8'],
  'date': ['Tue, 09 Dec 2025 14:24:39 GMT'],
  'link': ['<https://libraries.mit.edu/data-management/wp-json/>; rel="https://api.w.org/", <https://libraries.mit.edu/data-management/wp-json/wp/v2/pages/63>; rel="alternate"; title="JSON"; type="application/json", <https://libraries.mit.edu/data-management/?p=63>; rel=shortlink'],
  'permissions-policy': ['geolocation=(), microphone=(), camera=()'],
  'referrer-policy': ['no-referrer-when-downgrade'],
  'server': ['nginx'],
  'strict-transport-security': ['max-age=300'],
  'vary': ['Accept-Encoding, Cookie'],
  'via': ['1.1 varnish'],
  'x-cache': ['HIT'],
  'x-cache-hits': ['1'],
  'x-content-type-options': ['nosniff'],
  'x-frame-options': ['SAMEORIGIN'],
  'x-pantheon-styx-hostname': ['styx-fe3-b-5df8569779-qgzkq'],
  'x-pingback': ['https://libraries.mit.edu/data-management/xmlrpc.php'],
  'x-served-by': ['cache-chi-kigq8000172-CHI'],
  'x-styx-req-id': ['ce67c628-d12f-11f0-b8d0-b2d84bf0c087'],
  'x-timer': ['S1765290279.297512,VS0,VE4'],
  'x-orig-content-encoding': ['gzip']}}

How can a reviewer manually see the effects of these changes?

Optionally, the instructions + crawl from this previously merged PR can be used. Though, the output above is essentially what you'll see following the instructions below.

1- Run the crawl as outlined in PR 53

2- Observe a single record:

import base64
import jsonlines

with jsonlines.open('output/crawls/collections/mitlibwebsite/mitlibwebsite-extracted-records-to-index.jsonl') as reader:
    records = list(reader)
record = records[0]

record["response_headers"]
# dictionary output here...

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES: Transmogrifier will have access to response headers if needed.

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced:

Now that we include the full, rendered HTML in output records for downstream systems
to parse and use, it would be helpful to also have the associated HTTP response headers
which provide additional information.  Sometimes these can help indicate when the website
was updated, or alternate links, etc.  It may very well not get used, but it would be
beneficial to err on the side of including.

How this addresses that need:
* During record generation, include the response headers as new `response_headers`
column.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-272
@ghukill ghukill marked this pull request as ready for review December 10, 2025 18:54
@ghukill ghukill requested a review from a team as a code owner December 10, 2025 18:54
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected and a great addition!

@ghukill ghukill merged commit 284cafe into main Dec 10, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants