An interactive linguistic visual analytics tool for exploring language and word evolution over large-scale Wiktionary (Wiktextract) data.
Status: Early 0.x development. Public APIs and data formats may change without notice.
This is an academic research / exploratory tool. Visualization accuracy depends on source data quality.
- WiktionaryViz
- Disclaimer
- Table of Contents
- 1. Overview
- 2. Architecture & Data Flow
- 3. Prerequisites
- 4. Tech Stack
- 5. Data Source & Indexing
- 6. Quick Start
- 7. Local Development
- 8. Environment Variables
- 9. Docker Usage
- 10. API Reference
- 11. Frontend Build & Deployment
- 12. CI / Release Automation
- 13. Troubleshooting
- 14. Contributing
- 15. License & Attribution
- 16. Security / Responsible Use
WiktionaryViz lets you explore lexical evolution: ancestry timelines, descendant trees, geographic distributions, and phonetic drift (feature-based IPA alignment). It streams from a large Wiktextract JSONL dump (20GB+ uncompressed) via byte‑offset indexing for near O(1) random access.
Design principles:
- Minimal preprocessing (just an index + small derived stats files)
- Progressive disclosure visualizations (timeline, radial, network, map)
- Reproducible containerized backend + static frontend deployable to GitHub Pages
- Acquire dataset (
wiktionary_data.jsonl[.gz]) from Kaikki.org. - Build offset index + stats (
wiktionary_index.json,most_*, etc.). - FastAPI loads index at startup; endpoints mmap the JSONL and seek directly.
- Services layer supplies ancestry + descendant traversal + phonetic alignment.
- Frontend fetches REST endpoints; data transformed via D3/Leaflet into visuals.
Key directories:
backend/– FastAPI app, index builder, servicessrc/components/– Visualization & page componentssrc/hooks/– Data fetching + caching hookssrc/types/– TypeScript domain modelssrc/utils/– API base, export & mapping utilities
- Node.js 18+ (20.x recommended)
- Python 3.11 recommended (>=3.9 minimum for local non-Docker)
- Docker + Docker Compose (for container workflow)
- ~40GB free disk (compressed + uncompressed + indices)
| Layer | Technologies |
|---|---|
| Frontend | React 19, TypeScript, Vite, D3, Leaflet, Tailwind |
| Backend | FastAPI, Uvicorn, mmap random access, PanPhon |
| Packaging | Docker (multi-arch), GitHub Container Registry |
| CI/CD | GitHub Actions (deploy, release-please, Docker build) |
| Data | Wiktextract JSONL dump (Kaikki.org) |
Default dataset URL (override WIKTIONARY_DATA_URL):
https://kaikki.org/dictionary/raw-wiktextract-data.jsonl.gz
Index build (after dataset present):
cd backend
python build_index.pyArtifacts (in backend/data/):
wiktionary_data.jsonl– Raw dump (auto-downloaded if not skipped)wiktionary_index.json–{word_lang_code: [byte_offset, ...]}mappinglongest_words.json,most_translations.json,most_descendants.json
Retrieval strategy: mmap + seek to recorded offset, read one JSON line, parse on demand.
git clone https://github.com/vialab/WiktionaryViz.git
cd WiktionaryViz
npm install
cd backend && python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt && cd ..
npm run backend # starts FastAPI (downloads dataset if needed)
# separate terminal
npm run dev # starts Vite dev serverdocker compose up --build -dBackend at http://localhost:8000
docker run -p 8000:8000 \
-e WIKTIONARY_DATA_URL=https://kaikki.org/dictionary/raw-wiktextract-data.jsonl.gz \
-e ALLOWED_ORIGINS=* \
-e OPENAI_API_KEY=<your-openai-key> \
ghcr.io/vialab/wiktionaryviz-backend:latestnpm run build
npm run previewCore scripts (package.json):
npm run dev– Frontend dev servernpm run backend– FastAPI (reload) via Uvicornnpm run dev:full– Concurrent backend + frontendnpm run build– Type check + production bundlenpm run build:api– Build withVITE_API_BACKENDinjectednpm run lint/format/format:checknpm run backend:up/backend:down– Compose helpersnpm run deploy– Deploy static site to GitHub Pages
Recommended: Node 20.x, Python 3.11.
| Variable | Scope | Required (Prod) | Default | Description |
|---|---|---|---|---|
VITE_API_BACKEND |
Frontend build | Yes | (none) | Absolute backend API base baked into bundle |
ALLOWED_ORIGINS |
Backend | No | * |
Comma list for CORS |
PORT |
Backend | No | 8000 |
Uvicorn port |
OPENAI_API_KEY |
Backend | If AI features | (none) | For IPA estimation (fallback) |
WIKTIONARY_DATA_URL |
Backend | No | Kaikki URL | Dataset source |
SKIP_DOWNLOAD |
Backend | No | 0 |
Set 1 to skip auto download |
Image: ghcr.io/vialab/wiktionaryviz-backend:latest
Features:
- Non-root runtime
- Layer-cached dependency install
- Streaming download + unzip on first run
- Reusable volume to avoid re-download
Compose volume example:
volumes:
wiktionary-data:
services:
backend:
image: ghcr.io/vialab/wiktionaryviz-backend:latest
volumes:
- wiktionary-data:/app/dataBase (dev): http://localhost:8000 – Interactive docs at /docs.
| Method | Path | Status | Notes |
|---|---|---|---|
| GET | / |
stable | Health message |
| GET | /word-data |
implemented | Returns single lexical entry (raw JSON) |
| GET | /available-languages |
implemented | Lists languages containing a given surface form |
| GET | /random-interesting-word |
implemented | Random entry from stats category |
| GET | /ancestry-chain |
partial | Builds chain; IPA estimation fallback; drift scores computed per link |
| GET | /phonetic-drift-detailed |
partial | Returns alignment + feature changes (no compact score yet) |
| GET | /descendant-tree |
partial | Builds tree from given word; traversal heuristics evolving |
| GET | /descendant-tree-from-root |
partial | Treats provided word/lang as root |
| (future) | /phonetic-drift |
planned | Compact numeric distance only |
| (future) | /compare |
planned | Word vs word ancestry + drift |
| (future) | /ai/* |
planned | Exploration suggestions |
| (future) | /kwic |
planned | KWIC concordance lines |
Error handling: 404 for missing indexed key, 500 for unexpected exceptions.
GitHub Pages deploy (via workflow): builds with VITE_API_BACKEND secret → publishes dist/ to gh-pages (base path set in vite.config.ts).
Local production build:
VITE_API_BACKEND=https://your-backend.example.org npm run buildServe the dist/ output (any static host).
Workflows:
release-please.yml– Conventional commits → automated versioning & CHANGELOGfrontend-deploy.yml– Static site build + Pages publishbackend-docker.yml– Multi-arch Docker build & GHCR push
Release model: 0.x (minor may break). Conventional commit scopes recommended for clarity.
| Symptom | Cause | Resolution |
|---|---|---|
| 404 for valid word | Index missing or not loaded | Ensure index + stats built (python build_index.py) & restart backend |
| Slow first start | Large dataset download/unpack | Persist data volume / pre-seed dump |
| CORS errors | Origin mismatch | Set ALLOWED_ORIGINS or deploy behind proxy |
| Empty visualizations | Endpoint stubs / partial implementations | Guard frontend fetches / contribute implementations |
| Memory pressure | Large mmap JSONL | Increase container memory / split dataset |
| Mixed content blocked | HTTPS site → HTTP API | Serve API over HTTPS or use proxy |
- Create a focused feature/bug branch (
feat/…,fix/…). - Use Conventional Commits (e.g.,
feat(descendants): add depth limiting). You can useocoto help generate commit messages that follow this standard. - Run lint + format before pushing:
npm run lint && npm run format:check. - Document new or changed endpoints in the README API table.
- Avoid committing large data dumps (use download flow / volumes).
- Do not manually edit
CHANGELOG.md(managed by release-please).
Suggested commit scopes: frontend, backend, descendants, phonology, timeline, geospatial, build, ci, docs.
- MIT License (see
LICENSE). - Data derived from Wiktionary via Wiktextract / Kaikki.org (CC-BY-SA 3.0 & GFDL terms apply to original content).
- Phonological feature data: PanPhon.
Please cite Wiktionary, Wiktextract, and PanPhon in academic outputs.
- Do not rely on AI-estimated IPA for authoritative linguistic claims.
- Validate external input before exposing new endpoints publicly.
- Respect Wiktionary licensing when redistributing derived datasets.