Note: dictionary data in this repo is a read-only mirror (translated to open formats for data interchange) of the official Unitex repository, where active development is ongoing.
The Brazilian Portuguese (pt-BR language), Unitex primary sources for the vocabulary and its morphological definitions, in a open data (FrictionlessData) interchange format.
Controlled primary sources:
-
pt-BRAlphabet: Alphabet.csv and Alphabet_sort.csv -
pt-BRDELAS: DELA for Simple words, "Dicionário de Palavras Simples para o Português Brasileiro". ~67500 canonic words and its inflection rules. DELAS.csv. -
pt-BRDELACF: DELA for Compound Forms, "Dicionário de Palavras Compostas Flexionadas para o Português Brasileiro". ~4000 compound words and its morphological classification. DELACF.csv. -
pt-BRInflections: all*.fst2(finite state transducer v2) files, the compiled format for inflection graphs (see chapter 14.3 of the Unitex Manual). Each file contains only the basic representations of transitions of the graph — not changes by Graph-layout editing, changes only when topology or classification is modified. Under construction (JSON format), see dumps folder.
-
Main sources:
-
Unitex-GramLab-3.1-usermanual-en - UNITEX 3.1 USER MANUAL. See:
- Chapter 3.1, "The DELA dictionaries" (DELAF, DELAS, DELACF)
- Chapter 3.5, "Automatic inflection";
- Chapter 13.22, "Grf2Fst2".
-
Novo dicionario de formas flexionadas do UNITEX-PB, Avaliação da flexão verbal (2015).
-
Date ranges https://en.wikipedia.org/wiki/Reforms_of_Portuguese_orthography#Timeline_of_spelling_reforms
See spreadsheets do download here as data/*.csv.
Any other file must be validated by software (see SQL back-end).