Name	Name	Last commit message	Last commit date
parent directory ..
pretrain	pretrain
sft	sft
README.md	README.md

Name

Last commit message

Last commit date

Data Processing

1. Process Scientific Pre-training Dataset

1.1 Process Chinese Dataset

cd pretrain/chinese_process/
python collect.py

1.2 Process English Dataset

cd pretrain/english_process/
python collect.py

Note that the redpajama_train.json is obtained by running the following command:

cd pretrain
python download_from_hf.py

1.3 Combine All the Dataset

First of all, get into the folder train, and run combine the chinese and english datasets and get the train.txt file:

./combine_chinese_corpus.sh

Then, split the train.txt into 8 subfiles (each for one GPU process to load):

# shuffle and split
./split.sh

2. Process SFT Dataset

2.1 Process Paper-ground Question Answering Dataset

Download the QASPER-v0.3 dataset from the path listed in ../README.md and run:

python collect.py

2.2 Process Emotional Dialogue Dataset

Simple download the scientific emotional dialogue dataset and put it under the ./data/sft/emotional folder.

2.3 Process Dolly Dataset

Download Dolly corpus by run:

cd data/sft/dolly
python download_from_hf.py

2.4 Process SciMRC Dataset

Download SciMRC dataset by the link listed in ../README, and process it by running the following command:

python collect.py

2.5 Combine Paper-ground Instruction Dataset

Combine the dolly, SciMRC and QASPER instruction dataset for supervised fine-tuning (without emotional dialogue dataset):

python combine.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Data Processing

1. Process Scientific Pre-training Dataset

1.1 Process Chinese Dataset

1.2 Process English Dataset

1.3 Combine All the Dataset

2. Process SFT Dataset

2.1 Process Paper-ground Question Answering Dataset

2.2 Process Emotional Dialogue Dataset

2.3 Process Dolly Dataset

2.4 Process SciMRC Dataset

2.5 Combine Paper-ground Instruction Dataset

FilesExpand file tree

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Data Processing

1. Process Scientific Pre-training Dataset

1.1 Process Chinese Dataset

1.2 Process English Dataset

1.3 Combine All the Dataset

2. Process SFT Dataset

2.1 Process Paper-ground Question Answering Dataset

2.2 Process Emotional Dialogue Dataset

2.3 Process Dolly Dataset

2.4 Process SciMRC Dataset

2.5 Combine Paper-ground Instruction Dataset