Skip to content

Import Wikisource trusted book provider data #9671

@pidgezero-one

Description

@pidgezero-one

Problem

Followup to #8545. Currently there are 60 books in OL that have Wikisource IDs. IDs are formatted as langcode:title (i.e. en:George_Bernard_Shaw). Import Wikisource works into Open Library.

https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing

Breakdown

  • @pidgezero-one Create a script which implements proposal below to get WikiSource data and coerce the data into Open Library's import format
  • @cdrini Verify a ~10 sample of the resulting books
  • Manually verify the extracted books
  • Both: Run bulk import

Proposal & Constraints

Hit English Wikisource's API and paginate through result sets of hits that fall under the "Validated texts" category: https://en.wikisource.org/w/api.php?action=query&generator=categorymembers&gcmtitle=Category:Validated_texts&gcmlimit=500&prop=categories|info|revisions&rvprop=content&rvslots=main&format=json&cllimit=max

The response includes documents that aren't books. Books are not flagged with a distinct category. We may have to also browse Wikisource's API to manually draft a list of categories that we should ignore any member of, such as Subpages (individual chapters of books), Posters, Songs, etc.

The response includes most of the work's metadata (author, year, etc) as wiki markup of the page's infobox. Consider using a library like wptools to parse it.

In the future, we will want to expand this import to support other languages besides en.wikisource.org, and perhaps expand beyond the Validated texts category, so the solution to this should be extensible.

A potential downside to how we derive Wikisource IDs is that we use the page title, which is modifiable, instead of the page's canonical ID, and that leaves us at the mercy of Wikisource's works being moved or having their names changed. This will likely be a pretty rare (if ever) occurrence, but if we ever decide we want to use canonical IDs instead, any wikisource item can be obtained with curid. Example: https://en.wikisource.org/?curid=4496925 and https://en.wikisource.org/wiki/%22Red%22_Fed._Memoirs are the same page. (This is also an example of a page whose title may need to be URLencoded in outbound links.)

Leads

Stakeholders

@cdrini @pidgezero-one


Metadata

Metadata

Assignees

Labels

Lead: @cdriniIssues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed]Needs: ResponseIssues which require feedback from leadPriority: 3Issues that we can consider at our leisure. [managed]Theme: Trusted Book ProvidersType: Feature RequestIssue describes a feature or enhancement we'd like to implement. [managed]

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions