Hugging Face Dataset Downloader and Tokenizer

This repository contains Python scripts for downloading and tokenizing datasets from the Hugging Face Datasets library.

Requirements

To use these scripts, you will need:

Python 3.x
The Hugging Face datasets library (installable via pip)
A Conda environment (recommended, but not required)
An environment variable HF_DATASETS_CACHE to indicate where the cache stores so that it won't exceed the space of the default disk.

To install the datasets library via pip, run:

pip install datasets

Before using the scripts, you should set the HF_DATASETS_CACHE environment variable to indicate where the cache will be stored. For example, to set it to a directory called hf_datasets_cache in your specified directory, run:

export HF_DATASETS_CACHE="$YOURDIR/hf_datasets_cache"

Usage

To download a dataset, run the download_dataset.py script. By default, it downloads the c4 dataset to a subdirectory in the downloaded_dataset/hf_datasets_cache directory.

python download_dataset.py

To tokenize a downloaded dataset, run the tokenize_dataset.py script with the desired tokenization method as an argument. By default, it tokenizes the c4 dataset using the GPT-2 tokenizer from the Hugging Face transformers library, and saves the resulting tokenized dataset to a binary file called c4_en_train_gpt2 in the tokenized_bin directory.

python tokenize_dataset.py

Note that you can specify a different dataset by modifying the dataset_name , split_name, subset_name variables in the download_dataset.py script, and you can specify a different tokenization method by modifying variable by yourself in the tokenize_dataset.py script.

Example

To check some example usages(i.e. download and tokenize openwebtext), please check the codes in the folder example

The way to tokenize PAN11

You can find the code in the folder "extract_plagiarism_sequence_from_pan11" that extracts all the suspicious and source plagiarism sequence pairs. The suspicious texts are the query texts used in the project "https://github.com/rutgers-db/DuplicateSearch_OPH". To construct the bin file of PAN11 for indexing in duplicate search as described in our paper, you can use the example/tokenizePAN11.py script.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
example		example
extract_plagiarism_sequence_from_pan11		extract_plagiarism_sequence_from_pan11
stopwords_tokens		stopwords_tokens
test		test
.gitignore		.gitignore
20BTokenizer_filtered_tokens.bin		20BTokenizer_filtered_tokens.bin
20B_tokenizer.json		20B_tokenizer.json
README.md		README.md
download_dataset.py		download_dataset.py
tokenize_dataset.py		tokenize_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hugging Face Dataset Downloader and Tokenizer

Requirements

Usage

Example

The way to tokenize PAN11

About

Uh oh!

Releases

Packages

Languages

rutgers-db/TokenizeDataset

Folders and files

Latest commit

History

Repository files navigation

Hugging Face Dataset Downloader and Tokenizer

Requirements

Usage

Example

The way to tokenize PAN11

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages