HuggingFace Dataset Scraper

This is a Python project that scrapes dataset details from the HuggingFace website, specifically for datasets listed under huggingface.co/datasets. The script collects information such as descriptions, size, modalities, formats, tags, and libraries of various datasets.

Features

Extract dataset descriptions, size, and various categories from HuggingFace dataset pages.
Processes multiple dataset links from a specified folder.
Saves both raw and cleaned dataset information in JSON format.
User-friendly console progress updates for each step.

Prerequisites

Python 3.x
Required packages listed in requirements.txt (install them with pip install -r requirements.txt)

Usage

Folder Structure

Ensure you have the following folder structure before running the scraper:

.
|-- links/
|   |-- example.txt  # Example file containing dataset links (you can create more link files here)
|-- data/
|   |-- raw/         # Folder where raw JSON output will be saved
|   |-- clean/       # Folder where cleaned JSON output will be saved
|-- huggingface_scraper.py  # Main Python script

Step-by-Step Guide

Add Dataset Links:
- Create a file in the links folder named, for example, geo.txt or text.txt.
- Add URLs of HuggingFace datasets, one per line. You can use the following format as an example:

https://huggingface.co/datasets/fka/awesome-chatgpt-prompts
https://huggingface.co/datasets/HuggingFaceFW/fineweb

You can also check the links/example.txt file for more examples.

Run the Scraper:
- Use the command:
```
python huggingface_scraper.py
```
This will scrape all datasets listed in your link files and save the results in the data/raw and data/clean folders.
Output Files:
- data/raw/geo_raw.json: Raw dataset information for geo.txt with all extracted content.
- data/clean/geo_clean.json: Cleaned dataset information for geo.txt, with whitespace and unnecessary characters removed.
- data/raw/text_raw.json: Raw dataset information for text.txt with all extracted content.
- data/clean/text_clean.json: Cleaned dataset information for text.txt, with whitespace and unnecessary characters removed.
Use the Output:
- You can use the cleaned JSON output (e.g., data/clean/geo_clean.json) to assist in selecting appropriate datasets for your project.
- Load the JSON file into an AI assistant (like ChatGPT or similar) and provide prompt.

Example Output

You can find an example of an output JSON file (example_output.json) in the repository under data/example_output.json.

Notes

This scraper only works for dataset pages on HuggingFace (huggingface.co/datasets).
Make sure to use only valid HuggingFace dataset links to avoid any issues while fetching data.
You can add as many link files in the links folder as needed, and the script will process them all.

Disclaimer

Use this tool responsibly, respecting the HuggingFace website's terms of service. This script is designed for educational purposes. And was randomly done in a few hours under well known illness named "boredom and irresponsibility". Will work in 1 out of 100 cases, if you are lucky. Good luck, human!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
links		links
.gitignore		.gitignore
README.md		README.md
huggingface_scraper.py		huggingface_scraper.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HuggingFace Dataset Scraper

Features

Prerequisites

Usage

Folder Structure

Step-by-Step Guide

Example Output

Notes

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Zheckan/HugScraper

Folders and files

Latest commit

History

Repository files navigation

HuggingFace Dataset Scraper

Features

Prerequisites

Usage

Folder Structure

Step-by-Step Guide

Example Output

Notes

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages