Skip to content

Conversation

@originalankur
Copy link
Contributor

Extracted from open library dataset.

# Books Dataset

The books.json is a subset from the openlibrary [books datasets](https://openlibrary.org/developers/dumps)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would need to add the CC0 1.0 universal license here I think: https://openlibrary.org/help/faq/using#ownership

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Haroenv To the best of my knowledge when it comes to CC0 1.0 universal license following rules apply.

  • You may use the dataset for commercial purposes.
  • No need to cite or reference the license.
  • Attribution is optional, not required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Haroenv if you insist will add a copy in the folder. Do advice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging in on the licensing, Ankur. Based on your research I agree with you.

@Haroenv Haroenv requested review from chuckmeyer and pixelastic July 2, 2025 13:57
@pixelastic
Copy link
Contributor

Hey @originalankur, thanks for the PR.

I had a look at the content of the file, and I'm afraid some of the books might contain sensitive content (at least one suspicious case of doxxing, and mentions of child pornography), that we don't really want in our public list of data.

I cleaned the list and shrinked the number of books to ~24k rather than ~33k (which also puts the file size at 49MB, right below the suggested 50MB github limit).
You can find my clean version in the books-clean branch.

Can you pull it in to replace your version, please?

@originalankur
Copy link
Contributor Author

@pixelastic Thank you for cleaning the data, I should have thought of this. I will update the PR. Thanks Tim.

@pixelastic
Copy link
Contributor

Hey @originalankur ping me once you've updated the PR and I'll merge it. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants