Skip to content

KeyError for 'content-location' and 'link' when trying to save non-HTML #65

@Kirkman

Description

@Kirkman

I'm trying to see if I can integrate savepagenow into my election night scraping system. The idea would be to save online results files into the Wayback Machine when my system detects the results have changed.

Most of the URLs I want to save are CSVs, JSON, or XML files. However, I am often finding that when I try to use savepagenow to save them, I get error tracebacks like these:

Traceback (most recent call last):
  File "/Users/xxxx/.virtualenvs/yyyyy/zzzzzzz/savepagenow/api.py", line 99, in capture
    content_location = response.headers["Content-Location"]
  File "/Users/xxxx/.virtualenvs/yyyyy/zzzzzzz/requests/structures.py", line 52, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-location'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/xxxx/.virtualenvs/yyyyy/zzzzzzz/savepagenow/api.py", line 105, in capture
    header_links = parse_header_links(response.headers["Link"])
  File "/Users/xxxx/.virtualenvs/yyyyy/zzzzzzz/requests/structures.py", line 52, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'link'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scrape-results.py", line 533, in <module>
    scrape(wks, headers, counties)
  File "scrape-results.py", line 375, in scrape
    archive_url, captured_flag = savepagenow.capture_or_cache(url, authenticate=True, user_agent="savepagenow (https://stltoday.com)")
  File "/Users/xxxx/.virtualenvs/yyyyy/zzzzzzz/savepagenow/api.py", line 148, in capture_or_cache
    capture(
  File "/Users/xxxx/.virtualenvs/yyyyy/zzzzzzz/savepagenow/api.py", line 110, in capture
    raise WaybackRuntimeError(
savepagenow.exceptions.WaybackRuntimeError: {'status_code': 200, 'headers': {'Server': 'nginx', 'Date': 'Wed, 31 Jul 2024 17:00:50 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'server-timing': 'captures_list;dur=0.433099, exclusion.robots;dur=0.030207, exclusion.robots.policy;dur=0.022052, esindex;dur=0.008905, cdx.remote;dur=7.628953, LoadShardBlock;dur=296.977015, PetaboxLoader3.datanode;dur=255.918508, load_resource;dur=8.860866, MISS', 'x-app-server': 'wwwb-app204', 'x-ts': '200', 'x-tr': '367', 'X-location': 'All', 'X-RL': '1', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()', 'Content-Encoding': 'gzip'}}

It's very odd. Occasionally the requests work, but most times they error out with this same sequence. You may be able to reproduce with any/all three of these command-line examples:

savepagenow https://www.livevoterturnout.com/ENR/stcharlesmoenr/28/summary_28.xml
savepagenow https://extcontent.stlouisco.com/BOE/eResults/media/media.csv
savepagenow https://travisenr.blob.core.usgovcloudapi.net/prod/Current_02.json

Anyway, is this just me? Am I doing something wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions