-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Problem
Followup to #8545. Currently there are 60 books in OL that have Wikisource IDs. IDs are formatted as langcode:title (i.e. en:George_Bernard_Shaw). Import Wikisource works into Open Library.
https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing
Breakdown
- @pidgezero-one Create a script which implements proposal below to get WikiSource data and coerce the data into Open Library's import format
- @cdrini Verify a ~10 sample of the resulting books
- Manually verify the extracted books
- Both: Run bulk import
Proposal & Constraints
Hit English Wikisource's API and paginate through result sets of hits that fall under the "Validated texts" category: https://en.wikisource.org/w/api.php?action=query&generator=categorymembers&gcmtitle=Category:Validated_texts&gcmlimit=500&prop=categories|info|revisions&rvprop=content&rvslots=main&format=json&cllimit=max
The response includes documents that aren't books. Books are not flagged with a distinct category. We may have to also browse Wikisource's API to manually draft a list of categories that we should ignore any member of, such as Subpages (individual chapters of books), Posters, Songs, etc.
The response includes most of the work's metadata (author, year, etc) as wiki markup of the page's infobox. Consider using a library like wptools to parse it.
In the future, we will want to expand this import to support other languages besides en.wikisource.org, and perhaps expand beyond the Validated texts category, so the solution to this should be extensible.
A potential downside to how we derive Wikisource IDs is that we use the page title, which is modifiable, instead of the page's canonical ID, and that leaves us at the mercy of Wikisource's works being moved or having their names changed. This will likely be a pretty rare (if ever) occurrence, but if we ever decide we want to use canonical IDs instead, any wikisource item can be obtained with curid. Example: https://en.wikisource.org/?curid=4496925 and https://en.wikisource.org/wiki/%22Red%22_Fed._Memoirs are the same page. (This is also an example of a page whose title may need to be URLencoded in outbound links.)