Most of the URLs I want to save are CSVs, JSON, or XML files. However, I am often finding that when I try to use savepagenow to save them, I get error tracebacks like these:
Traceback (most recent call last):
File "/Users/xxxx/.virtualenvs/yyyyy/zzzzzzz/savepagenow/api.py", line 99, in capture
content_location = response.headers["Content-Location"]
File "/Users/xxxx/.virtualenvs/yyyyy/zzzzzzz/requests/structures.py", line 52, in __getitem__
return self._store[key.lower()][1]
KeyError: 'content-location'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/xxxx/.virtualenvs/yyyyy/zzzzzzz/savepagenow/api.py", line 105, in capture
header_links = parse_header_links(response.headers["Link"])
File "/Users/xxxx/.virtualenvs/yyyyy/zzzzzzz/requests/structures.py", line 52, in __getitem__
return self._store[key.lower()][1]
KeyError: 'link'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "scrape-results.py", line 533, in <module>
scrape(wks, headers, counties)
File "scrape-results.py", line 375, in scrape
archive_url, captured_flag = savepagenow.capture_or_cache(url, authenticate=True, user_agent="savepagenow (https://stltoday.com)")
File "/Users/xxxx/.virtualenvs/yyyyy/zzzzzzz/savepagenow/api.py", line 148, in capture_or_cache
capture(
File "/Users/xxxx/.virtualenvs/yyyyy/zzzzzzz/savepagenow/api.py", line 110, in capture
raise WaybackRuntimeError(
savepagenow.exceptions.WaybackRuntimeError: {'status_code': 200, 'headers': {'Server': 'nginx', 'Date': 'Wed, 31 Jul 2024 17:00:50 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'server-timing': 'captures_list;dur=0.433099, exclusion.robots;dur=0.030207, exclusion.robots.policy;dur=0.022052, esindex;dur=0.008905, cdx.remote;dur=7.628953, LoadShardBlock;dur=296.977015, PetaboxLoader3.datanode;dur=255.918508, load_resource;dur=8.860866, MISS', 'x-app-server': 'wwwb-app204', 'x-ts': '200', 'x-tr': '367', 'X-location': 'All', 'X-RL': '1', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()', 'Content-Encoding': 'gzip'}}
It's very odd. Occasionally the requests work, but most times they error out with this same sequence. You may be able to reproduce with any/all three of these command-line examples:
savepagenow https://www.livevoterturnout.com/ENR/stcharlesmoenr/28/summary_28.xml
savepagenow https://extcontent.stlouisco.com/BOE/eResults/media/media.csv
savepagenow https://travisenr.blob.core.usgovcloudapi.net/prod/Current_02.json
I'm trying to see if I can integrate
savepagenowinto my election night scraping system. The idea would be to save online results files into the Wayback Machine when my system detects the results have changed.Most of the URLs I want to save are CSVs, JSON, or XML files. However, I am often finding that when I try to use
savepagenowto save them, I get error tracebacks like these:It's very odd. Occasionally the requests work, but most times they error out with this same sequence. You may be able to reproduce with any/all three of these command-line examples:
Anyway, is this just me? Am I doing something wrong?