This repository contains Python scripts for downloading and tokenizing datasets from the Hugging Face Datasets library.
To use these scripts, you will need:
- Python 3.x
- The Hugging Face datasets library (installable via pip)
- A Conda environment (recommended, but not required)
- An environment variable
HF_DATASETS_CACHEto indicate where the cache stores so that it won't exceed the space of the default disk.
To install the datasets library via pip, run:
pip install datasets
Before using the scripts, you should set the HF_DATASETS_CACHE environment variable to indicate where the cache will be stored. For example, to set it to a directory called hf_datasets_cache in your specified directory, run:
export HF_DATASETS_CACHE="$YOURDIR/hf_datasets_cache"
To download a dataset, run the download_dataset.py script. By default, it downloads the c4 dataset to a subdirectory in the downloaded_dataset/hf_datasets_cache directory.
python download_dataset.py
To tokenize a downloaded dataset, run the tokenize_dataset.py script with the desired tokenization method as an argument. By default, it tokenizes the c4 dataset using the GPT-2 tokenizer from the Hugging Face transformers library, and saves the resulting tokenized dataset to a binary file called c4_en_train_gpt2 in the tokenized_bin directory.
python tokenize_dataset.py
Note that you can specify a different dataset by modifying the dataset_name , split_name, subset_name variables in the download_dataset.py script, and you can specify a different tokenization method by modifying variable by yourself in the tokenize_dataset.py script.
To check some example usages(i.e. download and tokenize openwebtext), please check the codes in the folder example
You can find the code in the folder "extract_plagiarism_sequence_from_pan11" that extracts all the suspicious and source plagiarism sequence pairs. The suspicious texts are the query texts used in the project "https://github.com/rutgers-db/DuplicateSearch_OPH". To construct the bin file of PAN11 for indexing in duplicate search as described in our paper, you can use the example/tokenizePAN11.py script.