Skip to content

codeonthespectrum/eu-falo-kanoe

Repository files navigation

I Speak Kanoe

Kanoê Language Database: Engineering & Migration Scripts

Hugging Face Dataset PostgreSQL Supabase License

🏛️ About the Project

This repository hosts the data engineering pipelines, SQL schemas, and migration scripts developed to construct the Digital Database of the Kanoê Language (ISO 639-3: kxo).

The primary goal of this project was to rescue and modernize linguistic data from legacy formats (unstructured CSVs, deprecated SQL Server dumps, and PDF reports) into a robust, normalized, and cloud-native relational architecture.

⚠️ Looking for the data? The curated dataset, final CSV files, and the official DOI for citation are hosted on our Hugging Face repository:

👉 Access the Dataset on Hugging Face


📡 API Access

This database exposes a public RESTful API hosted on Supabase. You can access the data programmatically without downloading CSV files.

Base API URL:

https://mtzqswessyrrbulntblx.supabase.co

Note: Use standard PostgREST patterns to query tables (e.g., /rest/v1/view_dicionario_completo).

⚙️ Technical Architecture

The database was designed using PostgreSQL (via Supabase) employing a Hybrid SQL/NoSQL approach to handle the complexity and sparsity typical of linguistic fieldwork data.

Key Engineering Decisions

  1. Normalized Core (SQL):

    • We adopted a strict relational model for the core entities: Lemma (Headword), Sense (Meaning), and Form (Pronunciation). This ensures referential integrity—you cannot have a translation without a word, nor an example sentence without a bibliographic source.
  2. Flexible Metadata (NoSQL/JSONB):

    • To handle irregular data (e.g., specific cultural notes that only appear in 5% of entries), we utilized PostgreSQL's JSONB columns. This allowed us to store semi-structured data without creating dozens of sparse columns.
  3. Full-Text Search:

    • Implemented native PL/pgSQL functions to allow accent-insensitive and case-insensitive searching across Kanoê terms, Portuguese translations, and cultural notes simultaneously.

🧠 Lessons Learned & Challenges

Migrating legacy linguistic data is rarely straightforward. Here are the key takeaways from this engineering journey:

1. The "Staging Table" Strategy

Attempting to insert 1,500+ raw rows directly into a normalized schema is error-prone.

  • Solution: We implemented an ETL (Extract, Transform, Load) process using a temporary Staging Table (import_csv_unificado).
  • Benefit: This allowed us to dump raw CSV data first, clean it via SQL, and then distribute it to the final tables (palavra, significado, frase) using intelligent scripts that check for duplicates and handle upserts.

2. Integrity vs. Flexibility

We initially considered a pure NoSQL document database (like MongoDB) due to the emptiness of many fields in the original dataset.

  • Realization: Linguistic data relies heavily on citations. If a source is deleted, all associated examples must handled safely.
  • Conclusion: A relational database with Foreign Keys ON DELETE CASCADE was safer for data integrity, while JSONB columns provided the necessary flexibility for sparse attributes.

3. Source of Truth

Managing duplicate bibliographic entries (e.g., "Cartilha Vol II" vs "Cartilha Vol. 2") was a major challenge.

  • Fix: We established a strict unique constraint on the filename/reference column and used normalization scripts to merge duplicate sources before linking them to the lexical entries.

🛠️ Repository Structure

    • 01_schema.sql: DDL to create the tables (palavra, significado, bibliografia...).
    • 02_etl_processing.sql: The logic used to clean and distribute raw CSV data into the schema.
    • 03_views_and_security.sql: Creates the user-friendly view_dicionario_completo and sets up Row Level Security (RLS).
    • 04_functions.sql: Custom search functions (buscar_kanoe).

🗺️ Database Schema (ER Diagram)

Visual representation of the relationships between tables in the Kanoê-DB:

erDiagram
    PALAVRA ||--|{ SIGNIFICADO : "has senses"
    PALAVRA ||--|{ PRONUNCIA : "has forms"
    PALAVRA ||--o{ FRASE : "exemplified by"
    
    BIBLIOGRAFIA ||--o{ SIGNIFICADO : "source of"
    BIBLIOGRAFIA ||--o{ FRASE : "source of"

    PALAVRA {
        bigint id_palavra PK
        text termo_kanoe
        text classe_gramatical
        jsonb tags_nosql
    }

    SIGNIFICADO {
        bigint id_sentido PK
        text traducao_primaria
        text nota_cultural
        bigint fk_id_bibliografia FK
    }

    FRASE {
        bigint id_frase PK
        text texto_kanoe
        text traducao_pt
        bigint fk_id_palavra FK
        bigint fk_id_bibliografia FK
    }

    PRONUNCIA {
        bigint id_forma PK
        text ipa
        text grafia
    }

    BIBLIOGRAFIA {
        bigint id_fonte PK
        text titulo
        text tipo
        text arquivo_ref
    }
Loading

🚀 How to Replicate

If you wish to run a local instance of this database:

  1. Clone this repository.
  2. Set up a PostgreSQL instance (Docker or Supabase).
  3. Run the scripts in the /sql folder in numerical order.
  4. Import the CSV data (available on Hugging Face) into the staging table.

👥 Credits

  • [Gabrielly Gomes]: Data Engineering, SQL Modeling, Curatorship, and Standardization.
  • [Iago Aragão]: Data Collection, Transcription, and Quality Assurance.

doi = { 10.57967/hf/7042 }


📄 License

The SQL code in this repository is available under the MIT License. The linguistic data content is available under Creative Commons Attribution 4.0 (CC-BY 4.0).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published