This is a Python project that scrapes dataset details from the HuggingFace website, specifically for datasets listed under huggingface.co/datasets. The script collects information such as descriptions, size, modalities, formats, tags, and libraries of various datasets.
- Extract dataset descriptions, size, and various categories from HuggingFace dataset pages.
- Processes multiple dataset links from a specified folder.
- Saves both raw and cleaned dataset information in JSON format.
- User-friendly console progress updates for each step.
- Python 3.x
- Required packages listed in
requirements.txt(install them withpip install -r requirements.txt)
Ensure you have the following folder structure before running the scraper:
.
|-- links/
| |-- example.txt # Example file containing dataset links (you can create more link files here)
|-- data/
| |-- raw/ # Folder where raw JSON output will be saved
| |-- clean/ # Folder where cleaned JSON output will be saved
|-- huggingface_scraper.py # Main Python script
- Add Dataset Links:
- Create a file in the
linksfolder named, for example,geo.txtortext.txt. - Add URLs of HuggingFace datasets, one per line. You can use the following format as an example:
- Create a file in the
https://huggingface.co/datasets/fka/awesome-chatgpt-prompts
https://huggingface.co/datasets/HuggingFaceFW/fineweb
You can also check the links/example.txt file for more examples.
-
Run the Scraper:
- Use the command:
python huggingface_scraper.pyThis will scrape all datasets listed in your link files and save the results in the
data/rawanddata/cleanfolders. -
Output Files:
data/raw/geo_raw.json: Raw dataset information forgeo.txtwith all extracted content.data/clean/geo_clean.json: Cleaned dataset information forgeo.txt, with whitespace and unnecessary characters removed.data/raw/text_raw.json: Raw dataset information fortext.txtwith all extracted content.data/clean/text_clean.json: Cleaned dataset information fortext.txt, with whitespace and unnecessary characters removed.
-
Use the Output:
- You can use the cleaned JSON output (e.g.,
data/clean/geo_clean.json) to assist in selecting appropriate datasets for your project. - Load the JSON file into an AI assistant (like ChatGPT or similar) and provide prompt.
- You can use the cleaned JSON output (e.g.,
You can find an example of an output JSON file (example_output.json) in the repository under data/example_output.json.
- This scraper only works for dataset pages on HuggingFace (
huggingface.co/datasets). - Make sure to use only valid HuggingFace dataset links to avoid any issues while fetching data.
- You can add as many link files in the
linksfolder as needed, and the script will process them all.
Use this tool responsibly, respecting the HuggingFace website's terms of service. This script is designed for educational purposes. And was randomly done in a few hours under well known illness named "boredom and irresponsibility". Will work in 1 out of 100 cases, if you are lucky. Good luck, human!