Skip to content

WIP: Support wpull as more effective alternative to wget#84

Draft
davidfraser wants to merge 6 commits intoSimonMo88:masterfrom
inconceivableza:wpull
Draft

WIP: Support wpull as more effective alternative to wget#84
davidfraser wants to merge 6 commits intoSimonMo88:masterfrom
inconceivableza:wpull

Conversation

@davidfraser
Copy link
Copy Markdown
Contributor

This is a work-in-progress effort, which I'd appreciate feedback on.

In my testing of ghost-static-site-generator on an out-of-the-box ghost setup, I found that certain URLs weren't retrieved because they were referred to using the production domain name (notably /about/)
This is actually tricky to correct using wget without some complicated work to first identify the URLs for another domain and then retrieve them as well

My solution here is to use wpull as an alternative to wget. This is a Python reimplementation for wget focused on mirroring,
which supports plugins to adjust the process. It may not be for everybody, but it certainly proved to be quite effective for this use case.

Note: The original wpull is not maintained any more, so I've been using the grab-site fork which is in active development and has a recent release 5.0.3. The original docs are generally still applicable.

In order to install wpull, I have the following commands in my docker file, which enables building the package:

apk add --no-cache bash python3 py3-pip py3-pkgconfig python3-dev gcc musl-dev linux-headers pkgconfig libxml2-dev 
pip install --root-user-action=ignore --break-system-packages git+https://github.com/ArchiveTeam/ludios_wpull@5.0.3

It would obviously be more ideal if a package were directly available

In testing, I found that this is a little slower than wget (2 seconds instead of less than 1 to retrieve a site), but retrieves the site more accurately - with less parameters it finds more of the valid files, and there's less need to add things to the list of items to retrieve.

but it's much better to give it one invocation with multiple URLs
this should also be better at different themes as it finds things itself

(cherry picked from commit 0a50c3f)
(cherry picked from commit cbeac8f)
@davidfraser davidfraser marked this pull request as draft April 24, 2025 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant