NLP_processing for Galician

Pipeline developed to clean datasets used for training MT and LLM models.

Installation

it is necessary to install git-lfs to clone the repository sudo apt-get install git-lfs

With docker

docker build -t proxectonos/nos:pipeline .

How to process a file using the container

docker run --mount src=path/to/folder,target=/aliasfolderfordocker/,type=bind proxectonos/nos:pipeline command(tokenizer, detokenizer, etc)

Without Docker

install requirements.txt

pip install -r requirements.txt

make entrypoint executable

chmod +x entrypoint.sh
./entrypoint.sh command (see below)

run standard text cleaning routine

By default it expects a .jsonl file. You can transform your .txt file into .jsonl format by using the following command:

./entrypoint formatter -p $path_to_file -delimiter $regex_to_divide_txt -o $output_file_path

Executing the command ./entrypoint standard_pipeline $path_input_file calls the following commands:

encoding
deduplication
pyplexity (perplexity filter)
quelingua (filter by lang)

Available commands

sh entrypoint.sh --help

sh entrypoint.sh formatter --path --output --technique --delimiter Transforms a .txt file input into a .jsonl file. The --delimiter can be any regex pattern, preceded by $ e.g. $'#|||#' where #|||# is the pattern used to divide the text.
sh entrypoint.sh tokenizer --path --output tokenizes a latin script text. This tokenizer was developed mainly for Galician.
sh entrypoint.sh detokenizer --path --output detokenizes a text previously parsed with tokenizer.
sh entrypoint.sh filter_lang --path --output --filter_results_by_lang -line by line identification of the language a document is written in. If filter_results_by_lang is provided, the output file will only contain text in the specified language. filter_results_by_lang languages are 2 letter tags e.g. gl for Galician, es for Spanish, etc. -sh entrypoint.sh recoglang --path Reads an input text file and returns the language it is written in.
sh entrypoint.sh encoder --path --output fixes encoding issues in files.
sh entrypoint.sh jaccard --path --output Deduplicates files based on their Jaccard similarity.
sh entrypoint.sh pyplexity Calculates perplexity of the input. This script implements PyPlexity

How to cite this software

Please, cite this paper if you use the modules of this NLP toolkit to clean a corpus:

Iria de-Dios-Flores, Silvia Paniagua Suárez, Cristina Carbajal Pérez, Daniel Bardanca Outeiriño, Marcos Garcia, and Pablo Gamallo. 2024. CorpusNÓS: A massive Galician corpus for training large language models. In Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1, pages 593–599, Santiago de Compostela, Galicia/Spain. Association for Computational Lingustics.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
methods		methods
models		models
test/samples		test/samples
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
dockerfile		dockerfile
entrypoint.sh		entrypoint.sh
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP_processing for Galician

Installation

With docker

How to process a file using the container

Without Docker

install requirements.txt

make entrypoint executable

run standard text cleaning routine

Available commands

How to cite this software

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP_processing for Galician

Installation

With docker

How to process a file using the container

Without Docker

install requirements.txt

make entrypoint executable

run standard text cleaning routine

Available commands

How to cite this software

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages