IPABrew is a web application for reconstructing proto-language forms from cognate words across related languages. It combines neural ML models with traditional linguistic algorithms to infer ancestral word forms, visualize language family trees, and estimate divergence dates.
This project was created in ~6 hours for a hackathon, and was substantially AI assisted.
Prerequisites: Python 3.11+ (3.12 recommended), pip
git clone https://github.com/LabGuy94/IPABrew.git && cd IPABrew
pip install -r requirements.txt
cd backend && python run.pyOpen http://localhost:8080.
Flask backend serves a single-page vanilla JS frontend. No build step.
| Path | Role |
|---|---|
backend/app/__init__.py |
Flask app factory, loads DPD model at startup |
backend/app/routes.py |
All API endpoints (Blueprint mounted at /api) |
backend/app/reconstruction.py |
Core reconstruction logic (both ML and algorithmic paths) |
backend/app/services/dpd_service.py |
DPD model loading and inference singleton |
backend/app/glottochronology.py |
Divergence dating (Swadesh formula + NED mapping) |
backend/app/ipa_utils.py |
IPA feature distance calculations (panphon wrapper) |
backend/app/dpd/ |
Vendored DPD model code (from cmu-llab/dpd) |
backend/app/templates/index.html |
Single-page app HTML |
backend/app/static/js/app.js |
All frontend logic |
backend/app/static/css/style.css |
All styles |
backend/data/romance_ipa.tsv |
Meloni Romance dataset (5 languages + Latin, IPA) |
model/ |
Trained model checkpoint and data (see model/README.md) |
Bidirectional Transformer (daughter-to-proto + proto-to-daughter) from Lu, Xie & Mortensen (2024), based on cmu-llab/dpd.
- Trained on combined SIGTYP 2022 + WikiHan datasets
- Tokenizes IPA input, encodes daughter forms with language embeddings, decodes proto-form
- Default method when
model/checkpoints/epoch34.ckptis present
Multiple sequence alignment (SCA algorithm) + majority-vote reconstruction.
- Uses LingPy's
Multiple.prog_align()for phonologically-informed alignment - Votes on each alignment column to determine proto-segment
- Fallback when ML model is unavailable, or selectable by the user
Swadesh retention rate formula:
t = ln(c) / (2 * ln(r)) * 1000
where c = cognate retention proportion, r = 0.86.
NED-to-years mapping for estimating divergence from phonological distance of individual pairs:
| NED Range | Estimated Age | Category |
|---|---|---|
| 0.0 -- 0.1 | 0 -- 500 years | Dialects |
| 0.1 -- 0.3 | 500 -- 1,500 years | Romance languages |
| 0.3 -- 0.5 | 1,500 -- 3,000 years | Germanic family |
| 0.5 -- 0.7 | 3,000 -- 5,000 years | IE subfamilies |
| 0.7+ | 5,000+ years | Deep/uncertain |
Calibration dates: Romance ~1,750 yrs, Proto-Germanic ~2,500 yrs, Balto-Slavic ~3,500 yrs, PIE ~6,000 yrs.
- Tree editor: build custom language family trees with drag-and-drop
- D3.js v7 visualization of reconstructed proto-form trees
- IPA keyboard for phonetic input
- Demo datasets for 6+ language families
- Single-page app, no framework, no build step
Bundled Meloni Romance dataset (backend/data/romance_ipa.tsv): cognate sets across Romanian, French, Italian, Spanish, Portuguese with Latin proto-forms in IPA transcription.
All endpoints are under /api.
| Method | Path | Description |
|---|---|---|
| GET | /api/health |
Health check |
| GET | /api/dataset/sample |
Get sample entries from Romance dataset. Query params: count (default 20), offset (default 0) |
| GET | /api/dataset/search |
Search dataset by IPA substring. Query param: q (required), limit (default 20) |
| POST | /api/reconstruct |
Reconstruct proto-form. Body: {"words": [...], "languages": [...]} or {"index": N} for dataset entry |
| POST | /api/reconstruct_tree |
Reconstruct full tree bottom-up. Body: {"tree": {...}, "method": "ml"|"algorithm"} |
| GET | /api/model/status |
Check if DPD ML model is loaded. Returns {"available": true/false} |
| POST | /api/align |
Align IPA words. Body: {"words": [...]} |
| POST | /api/ipa/distance |
Compute phonological distance between two words. Body: {"word1": "...", "word2": "..."} |
| POST | /api/ipa/features |
Get articulatory feature vectors for a word. Body: {"word": "..."} |
| POST | /api/date |
Estimate divergence date. Body: {"cognate_pct": 0.6} or {"ned": 0.3} |
| GET | /api/date/curve |
Get retention rate curve data |
| GET | /api/date/calibration |
Get calibration dates for known language splits |
IPABrew/
├── backend/
│ ├── app/
│ │ ├── dpd/ # Vendored DPD model code
│ │ │ ├── lib/ # Data loading, utilities
│ │ │ └── models/ # Neural model definitions
│ │ ├── services/
│ │ │ └── dpd_service.py # Model singleton
│ │ ├── static/
│ │ │ ├── css/style.css
│ │ │ └── js/app.js
│ │ ├── templates/
│ │ │ └── index.html
│ │ ├── __init__.py # App factory
│ │ ├── glottochronology.py
│ │ ├── ipa_utils.py
│ │ ├── reconstruction.py
│ │ └── routes.py
│ ├── data/
│ │ └── romance_ipa.tsv
│ ├── tests/ # pytest test suite
│ └── run.py # Dev server entrypoint
├── model/
│ ├── checkpoints/
│ │ ├── epoch34.ckpt # 77MB trained model
│ │ └── model_config.yaml # Training configuration
│ ├── data/
│ │ └── combined/ # Training/eval data (pickle)
│ └── notebooks/
│ └── train_dpd.ipynb # Google Colab training notebook
├── requirements.txt # All Python dependencies
└── README.md
- Dev server:
cd backend && python run.py(Flask debug mode on port 8080) - The DPD model loads eagerly at startup. First boot takes a few seconds.
- To add demo language families, add data files and update the frontend's family selector in
app.js. - Model checkpoint and training details: see
model/README.md.
- DPD BiReconstructor: Lu, Xie & Mortensen (2024). "DPD: A Diverse, Paired and Dense Dataset for Proto-language Reconstruction." ACL 2024. cmu-llab/dpd
- LingPy: List & Forkel. Sequence comparison in computational historical linguistics. lingpy.org
- panphon: Mortensen, Dalmia & Littell. Panphon: a resource for mapping IPA segments to articulatory feature vectors. github.com/dmort27/panphon
- Meloni Romance dataset: Meloni et al. (2021). "Ab Antiquo: Neural Proto-language Reconstruction." NAACL 2021.
