Pipeline developed to clean datasets used for training MT and LLM models.
it is necessary to install git-lfs to clone the repository
sudo apt-get install git-lfs
docker build -t proxectonos/nos:pipeline .
docker run --mount src=path/to/folder,target=/aliasfolderfordocker/,type=bind proxectonos/nos:pipeline command(tokenizer, detokenizer, etc)
pip install -r requirements.txt
chmod +x entrypoint.sh
./entrypoint.sh command (see below)
By default it expects a .jsonl file. You can transform your .txt file into .jsonl format by using the following command:
./entrypoint formatter -p $path_to_file -delimiter $regex_to_divide_txt -o $output_file_path
Executing the command ./entrypoint standard_pipeline $path_input_file calls the following commands:
- encoding
- deduplication
- pyplexity (perplexity filter)
- quelingua (filter by lang)
sh entrypoint.sh --help
sh entrypoint.sh formatter --path --output --technique --delimiterTransforms a .txt file input into a .jsonl file. The --delimiter can be any regex pattern, preceded by $ e.g. $'#|||#' where #|||# is the pattern used to divide the text.sh entrypoint.sh tokenizer --path --outputtokenizes a latin script text. This tokenizer was developed mainly for Galician.sh entrypoint.sh detokenizer --path --outputdetokenizes a text previously parsed with tokenizer.sh entrypoint.sh filter_lang --path --output --filter_results_by_lang-line by line identification of the language a document is written in. If filter_results_by_lang is provided, the output file will only contain text in the specified language. filter_results_by_lang languages are 2 letter tags e.g. gl for Galician, es for Spanish, etc. -sh entrypoint.sh recoglang --pathReads an input text file and returns the language it is written in.sh entrypoint.sh encoder --path --outputfixes encoding issues in files.sh entrypoint.sh jaccard --path --outputDeduplicates files based on their Jaccard similarity.sh entrypoint.sh pyplexityCalculates perplexity of the input. This script implements PyPlexity
Please, cite this paper if you use the modules of this NLP toolkit to clean a corpus:
- Iria de-Dios-Flores, Silvia Paniagua Suárez, Cristina Carbajal Pérez, Daniel Bardanca Outeiriño, Marcos Garcia, and Pablo Gamallo. 2024. CorpusNÓS: A massive Galician corpus for training large language models. In Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1, pages 593–599, Santiago de Compostela, Galicia/Spain. Association for Computational Lingustics.