Skip to content

Latest commit

 

History

History
193 lines (105 loc) · 7.79 KB

File metadata and controls

193 lines (105 loc) · 7.79 KB

Sebastian Raschka
Last updated: 09/29/2014

A collection of links to various free and open-source datasets.


## Sections


Dataset repositories

[back to top]

  • Kaggle - Kaggle, the leading platform for predictive modeling competitions.

  • UCI MLR - UC Irvine Machine Learning Repository

  • google.com/publicdata - public data maintained by Google

  • Freebase - A community-curated database of well-known people, places, and things

  • mldata.org - machine learning data set repository for uploading and finding data sets

  • Infochimps - a huge collection of large-sized data sets

  • Amazon Web Services - Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.

  • Databib - a searchable catalog / registry / directory / bibliography of research data repositories.

  • figshare - an online digital repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos.

  • reddit r/datasets - datasets shared on reddit

  • datahub - the free, powerful data management platform from the Open Knowledge Foundation

  • Quandl - a search engine for numerical data

  • enigma - a search engine for public records published by governments, companies and organizations.



Datasets by topics

[back to top]



Images

[back to top]



Audio

[back to top]

  • Mobio - bi-modal (audio and video) data taken from 152 people

  • Million Song Dataset - The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

  • Music Data Mining - A collection of research done on music analysis and links to various datasets.



Text

[back to top]

  • TechTC - Technion Repository of Text Categorization Datasets containing 300 labeled datasets with categorization difficulties indicated by baseline SVM accuracies.

  • SMS Spam Collection - A public dataset of 5572 SMS messages that are labeled as either "spam" or "ham" (not spam).

  • musiXmatch - A dataset of lyrics for the songs in the one million songs dataset. The lyrics are pre-processed and available as "bag of words" after stemming.

  • Google books Ngram Viewer - The corpus of Google books as n-grams available for quick online queries or download.



Natural sciences

[back to top]



Web, technology, and social networks

[back to top]



Historical data and human resources

[back to top]



Finance and companies

[back to top]



Government data and politics

[back to top]

  • United Nations Data about health, environment, energy.. for different nations

  • United Stated

  • Open Government Data List of cities/counties in the U.S. with open data, plus other countries involved in open-data project

  • Survey Data from U.S.

  • EconData - economic time series, produced by a number of U.S. Government agencies and distributed in a variety of formats and media

  • USGovXML - USGovXML is an index to publicly available web services and XML data sources that are provided by the US government

  • Nominate/vote data - Datasets including all the D-NOMINATE and W-NOMINATE scores