WiktionaryViz

An interactive linguistic visual analytics tool for exploring language and word evolution over large-scale Wiktionary (Wiktextract) data.

Status: Early 0.x development. Public APIs and data formats may change without notice.

Disclaimer

This is an academic research / exploratory tool. Visualization accuracy depends on source data quality.

1. Overview

WiktionaryViz lets you explore lexical evolution: ancestry timelines, descendant trees, geographic distributions, and phonetic drift (feature-based IPA alignment). It streams from a large Wiktextract JSONL dump (20GB+ uncompressed) via byte‑offset indexing for near O(1) random access.

Design principles:

Minimal preprocessing (just an index + small derived stats files)
Progressive disclosure visualizations (timeline, radial, network, map)
Reproducible containerized backend + static frontend deployable to GitHub Pages

2. Architecture & Data Flow

Acquire dataset (wiktionary_data.jsonl[.gz]) from Kaikki.org.
Build offset index + stats (wiktionary_index.json, most_*, etc.).
FastAPI loads index at startup; endpoints mmap the JSONL and seek directly.
Services layer supplies ancestry + descendant traversal + phonetic alignment.
Frontend fetches REST endpoints; data transformed via D3/Leaflet into visuals.

Key directories:

backend/ – FastAPI app, index builder, services
src/components/ – Visualization & page components
src/hooks/ – Data fetching + caching hooks
src/types/ – TypeScript domain models
src/utils/ – API base, export & mapping utilities

3. Prerequisites

Node.js 18+ (20.x recommended)
Python 3.11 recommended (>=3.9 minimum for local non-Docker)
Docker + Docker Compose (for container workflow)
~40GB free disk (compressed + uncompressed + indices)

4. Tech Stack

Layer	Technologies
Frontend	React 19, TypeScript, Vite, D3, Leaflet, Tailwind
Backend	FastAPI, Uvicorn, mmap random access, PanPhon
Packaging	Docker (multi-arch), GitHub Container Registry
CI/CD	GitHub Actions (deploy, release-please, Docker build)
Data	Wiktextract JSONL dump (Kaikki.org)

5. Data Source & Indexing

Default dataset URL (override WIKTIONARY_DATA_URL):

https://kaikki.org/dictionary/raw-wiktextract-data.jsonl.gz

Index build (after dataset present):

cd backend
python build_index.py

Artifacts (in backend/data/):

wiktionary_data.jsonl – Raw dump (auto-downloaded if not skipped)
wiktionary_index.json – {word_lang_code: [byte_offset, ...]} mapping
longest_words.json, most_translations.json, most_descendants.json

Retrieval strategy: mmap + seek to recorded offset, read one JSON line, parse on demand.

6. Quick Start

A. Full Local (Frontend + Backend)

git clone https://github.com/vialab/WiktionaryViz.git
cd WiktionaryViz
npm install
cd backend && python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt && cd ..
npm run backend   # starts FastAPI (downloads dataset if needed)
# separate terminal
npm run dev       # starts Vite dev server

Open http://localhost:5173

B. Docker Compose (Backend only)

docker compose up --build -d

Backend at http://localhost:8000

C. Run Published Backend Image

docker run -p 8000:8000 \
  -e WIKTIONARY_DATA_URL=https://kaikki.org/dictionary/raw-wiktextract-data.jsonl.gz \
  -e ALLOWED_ORIGINS=* \
  -e OPENAI_API_KEY=<your-openai-key> \
  ghcr.io/vialab/wiktionaryviz-backend:latest

D. Static Frontend Preview

npm run build
npm run preview

7. Local Development

Core scripts (package.json):

npm run dev – Frontend dev server
npm run backend – FastAPI (reload) via Uvicorn
npm run dev:full – Concurrent backend + frontend
npm run build – Type check + production bundle
npm run build:api – Build with VITE_API_BACKEND injected
npm run lint / format / format:check
npm run backend:up / backend:down – Compose helpers
npm run deploy – Deploy static site to GitHub Pages

Recommended: Node 20.x, Python 3.11.

8. Environment Variables

Variable	Scope	Required (Prod)	Default	Description
`VITE_API_BACKEND`	Frontend build	Yes	(none)	Absolute backend API base baked into bundle
`ALLOWED_ORIGINS`	Backend	No	`*`	Comma list for CORS
`PORT`	Backend	No	`8000`	Uvicorn port
`OPENAI_API_KEY`	Backend	If AI features	(none)	For IPA estimation (fallback)
`WIKTIONARY_DATA_URL`	Backend	No	Kaikki URL	Dataset source
`SKIP_DOWNLOAD`	Backend	No	`0`	Set `1` to skip auto download

9. Docker Usage

Image: ghcr.io/vialab/wiktionaryviz-backend:latest

Features:

Non-root runtime
Layer-cached dependency install
Streaming download + unzip on first run
Reusable volume to avoid re-download

Compose volume example:

volumes:
  wiktionary-data:
services:
  backend:
    image: ghcr.io/vialab/wiktionaryviz-backend:latest
    volumes:
      - wiktionary-data:/app/data

10. API Reference

Base (dev): http://localhost:8000 – Interactive docs at /docs.

Method	Path	Status	Notes
GET	`/`	stable	Health message
GET	`/word-data`	implemented	Returns single lexical entry (raw JSON)
GET	`/available-languages`	implemented	Lists languages containing a given surface form
GET	`/random-interesting-word`	implemented	Random entry from stats category
GET	`/ancestry-chain`	partial	Builds chain; IPA estimation fallback; drift scores computed per link
GET	`/phonetic-drift-detailed`	partial	Returns alignment + feature changes (no compact score yet)
GET	`/descendant-tree`	partial	Builds tree from given word; traversal heuristics evolving
GET	`/descendant-tree-from-root`	partial	Treats provided word/lang as root
(future)	`/phonetic-drift`	planned	Compact numeric distance only
(future)	`/compare`	planned	Word vs word ancestry + drift
(future)	`/ai/*`	planned	Exploration suggestions
(future)	`/kwic`	planned	KWIC concordance lines

Error handling: 404 for missing indexed key, 500 for unexpected exceptions.

11. Frontend Build & Deployment

GitHub Pages deploy (via workflow): builds with VITE_API_BACKEND secret → publishes dist/ to gh-pages (base path set in vite.config.ts).

Local production build:

VITE_API_BACKEND=https://your-backend.example.org npm run build

Serve the dist/ output (any static host).

12. CI / Release Automation

Workflows:

release-please.yml – Conventional commits → automated versioning & CHANGELOG
frontend-deploy.yml – Static site build + Pages publish
backend-docker.yml – Multi-arch Docker build & GHCR push

Release model: 0.x (minor may break). Conventional commit scopes recommended for clarity.

13. Troubleshooting

Symptom	Cause	Resolution
404 for valid word	Index missing or not loaded	Ensure index + stats built (`python build_index.py`) & restart backend
Slow first start	Large dataset download/unpack	Persist data volume / pre-seed dump
CORS errors	Origin mismatch	Set `ALLOWED_ORIGINS` or deploy behind proxy
Empty visualizations	Endpoint stubs / partial implementations	Guard frontend fetches / contribute implementations
Memory pressure	Large mmap JSONL	Increase container memory / split dataset
Mixed content blocked	HTTPS site → HTTP API	Serve API over HTTPS or use proxy

14. Contributing

Create a focused feature/bug branch (feat/…, fix/…).
Use Conventional Commits (e.g., feat(descendants): add depth limiting). You can use oco to help generate commit messages that follow this standard.
Run lint + format before pushing: npm run lint && npm run format:check.
Document new or changed endpoints in the README API table.
Avoid committing large data dumps (use download flow / volumes).
Do not manually edit CHANGELOG.md (managed by release-please).

Suggested commit scopes: frontend, backend, descendants, phonology, timeline, geospatial, build, ci, docs.

15. License & Attribution

MIT License (see LICENSE).
Data derived from Wiktionary via Wiktextract / Kaikki.org (CC-BY-SA 3.0 & GFDL terms apply to original content).
Phonological feature data: PanPhon.

Please cite Wiktionary, Wiktextract, and PanPhon in academic outputs.

16. Security / Responsible Use

Do not rely on AI-estimated IPA for authoritative linguistic claims.
Validate external input before exposing new endpoints publicly.
Respect Wiktionary licensing when redistributing derived datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
.github		.github
backend		backend
public		public
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
.opencommitignore		.opencommitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
eslint.config.js		eslint.config.js
index.html		index.html
package.json		package.json
tsconfig.app.json		tsconfig.app.json
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WiktionaryViz

Disclaimer

Table of Contents

1. Overview

2. Architecture & Data Flow

3. Prerequisites

4. Tech Stack

5. Data Source & Indexing

6. Quick Start

A. Full Local (Frontend + Backend)

B. Docker Compose (Backend only)

C. Run Published Backend Image

D. Static Frontend Preview

7. Local Development

8. Environment Variables

9. Docker Usage

10. API Reference

11. Frontend Build & Deployment

12. CI / Release Automation

13. Troubleshooting

14. Contributing

15. License & Attribution

16. Security / Responsible Use

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

vialab/WiktionaryViz

Folders and files

Latest commit

History

Repository files navigation

WiktionaryViz

Disclaimer

Table of Contents

1. Overview

2. Architecture & Data Flow

3. Prerequisites

4. Tech Stack

5. Data Source & Indexing

6. Quick Start

A. Full Local (Frontend + Backend)

B. Docker Compose (Backend only)

C. Run Published Backend Image

D. Static Frontend Preview

7. Local Development

8. Environment Variables

9. Docker Usage

10. API Reference

11. Frontend Build & Deployment

12. CI / Release Automation

13. Troubleshooting

14. Contributing

15. License & Attribution

16. Security / Responsible Use

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages