cd pretrain/chinese_process/
python collect.pycd pretrain/english_process/
python collect.pyNote that the redpajama_train.json is obtained by running the following command:
cd pretrain
python download_from_hf.pyFirst of all, get into the folder train, and run combine the chinese and english datasets and get the train.txt file:
./combine_chinese_corpus.shThen, split the train.txt into 8 subfiles (each for one GPU process to load):
# shuffle and split
./split.shDownload the QASPER-v0.3 dataset from the path listed in ../README.md and run:
python collect.pySimple download the scientific emotional dialogue dataset and put it under the ./data/sft/emotional folder.
Download Dolly corpus by run:
cd data/sft/dolly
python download_from_hf.pyDownload SciMRC dataset by the link listed in ../README, and process it by running the following command:
python collect.pyCombine the dolly, SciMRC and QASPER instruction dataset for supervised fine-tuning (without emotional dialogue dataset):
python combine.py