This repository hosts the data engineering pipelines, SQL schemas, and migration scripts developed to construct the Digital Database of the Kanoê Language (ISO 639-3: kxo).
The primary goal of this project was to rescue and modernize linguistic data from legacy formats (unstructured CSVs, deprecated SQL Server dumps, and PDF reports) into a robust, normalized, and cloud-native relational architecture.
⚠️ Looking for the data? The curated dataset, final CSV files, and the official DOI for citation are hosted on our Hugging Face repository:
This database exposes a public RESTful API hosted on Supabase. You can access the data programmatically without downloading CSV files.
Base API URL:
https://mtzqswessyrrbulntblx.supabase.co
Note: Use standard PostgREST patterns to query tables (e.g., /rest/v1/view_dicionario_completo).
The database was designed using PostgreSQL (via Supabase) employing a Hybrid SQL/NoSQL approach to handle the complexity and sparsity typical of linguistic fieldwork data.
-
Normalized Core (SQL):
- We adopted a strict relational model for the core entities:
Lemma(Headword),Sense(Meaning), andForm(Pronunciation). This ensures referential integrity—you cannot have a translation without a word, nor an example sentence without a bibliographic source.
- We adopted a strict relational model for the core entities:
-
Flexible Metadata (NoSQL/JSONB):
- To handle irregular data (e.g., specific cultural notes that only appear in 5% of entries), we utilized PostgreSQL's
JSONBcolumns. This allowed us to store semi-structured data without creating dozens of sparse columns.
- To handle irregular data (e.g., specific cultural notes that only appear in 5% of entries), we utilized PostgreSQL's
-
Full-Text Search:
- Implemented native PL/pgSQL functions to allow accent-insensitive and case-insensitive searching across Kanoê terms, Portuguese translations, and cultural notes simultaneously.
Migrating legacy linguistic data is rarely straightforward. Here are the key takeaways from this engineering journey:
Attempting to insert 1,500+ raw rows directly into a normalized schema is error-prone.
- Solution: We implemented an ETL (Extract, Transform, Load) process using a temporary Staging Table (
import_csv_unificado). - Benefit: This allowed us to dump raw CSV data first, clean it via SQL, and then distribute it to the final tables (
palavra,significado,frase) using intelligent scripts that check for duplicates and handle upserts.
We initially considered a pure NoSQL document database (like MongoDB) due to the emptiness of many fields in the original dataset.
- Realization: Linguistic data relies heavily on citations. If a source is deleted, all associated examples must handled safely.
- Conclusion: A relational database with Foreign Keys
ON DELETE CASCADEwas safer for data integrity, whileJSONBcolumns provided the necessary flexibility for sparse attributes.
Managing duplicate bibliographic entries (e.g., "Cartilha Vol II" vs "Cartilha Vol. 2") was a major challenge.
- Fix: We established a strict unique constraint on the
filename/referencecolumn and used normalization scripts to merge duplicate sources before linking them to the lexical entries.
-
01_schema.sql: DDL to create the tables (palavra,significado,bibliografia...).02_etl_processing.sql: The logic used to clean and distribute raw CSV data into the schema.03_views_and_security.sql: Creates the user-friendlyview_dicionario_completoand sets up Row Level Security (RLS).04_functions.sql: Custom search functions (buscar_kanoe).
Visual representation of the relationships between tables in the Kanoê-DB:
erDiagram
PALAVRA ||--|{ SIGNIFICADO : "has senses"
PALAVRA ||--|{ PRONUNCIA : "has forms"
PALAVRA ||--o{ FRASE : "exemplified by"
BIBLIOGRAFIA ||--o{ SIGNIFICADO : "source of"
BIBLIOGRAFIA ||--o{ FRASE : "source of"
PALAVRA {
bigint id_palavra PK
text termo_kanoe
text classe_gramatical
jsonb tags_nosql
}
SIGNIFICADO {
bigint id_sentido PK
text traducao_primaria
text nota_cultural
bigint fk_id_bibliografia FK
}
FRASE {
bigint id_frase PK
text texto_kanoe
text traducao_pt
bigint fk_id_palavra FK
bigint fk_id_bibliografia FK
}
PRONUNCIA {
bigint id_forma PK
text ipa
text grafia
}
BIBLIOGRAFIA {
bigint id_fonte PK
text titulo
text tipo
text arquivo_ref
}
If you wish to run a local instance of this database:
- Clone this repository.
- Set up a PostgreSQL instance (Docker or Supabase).
- Run the scripts in the
/sqlfolder in numerical order. - Import the CSV data (available on Hugging Face) into the staging table.
- [Gabrielly Gomes]: Data Engineering, SQL Modeling, Curatorship, and Standardization.
- [Iago Aragão]: Data Collection, Transcription, and Quality Assurance.
doi = { 10.57967/hf/7042 }
The SQL code in this repository is available under the MIT License. The linguistic data content is available under Creative Commons Attribution 4.0 (CC-BY 4.0).