Cross-language entity resolution with pluggable entity types, multi-strategy ensemble scoring, and explainable results.
A production-grade entity resolution service that determines whether two entity names refer to the same real-world thing -- even when they're written in different languages, scripts, or formats.
Example: Is ζ ͺεΌδΌη€Ύγ½γγΌ the same as Sony Corporation? The engine says yes, with a confidence score, strategy breakdown, and step-by-step explanation of how it got there.
| Capability | Details |
|---|---|
| Pluggable entity types | Company, person, product -- add new types with a single config file |
| Cross-language | Japanese β English via transliteration, NFKC normalization, phonetic encoding |
| 4 scoring strategies | Jaro-Winkler, Levenshtein, Token Sort, Phonetic -- combined via weighted ensemble |
| Trigram blocking | Sub-linear candidate retrieval from 100k+ entities |
| Explainable | Every match includes a full processing trace: language detection β normalization β scoring |
| Batch processing | Async job queue with concurrency control and progress tracking |
| Production-ready | Structured logging, CORS, security headers, health checks, Docker multi-stage build |
βββββββββββββββββββββββββββββββββββ
β Query: "γ½γγΌ" β
ββββββββββββββββ¬βββββββββββββββββββ
β
ββββββββββββββββΌβββββββββββββββββββ
β Language Detection β
β (Unicode + langdetect) β
ββββββββββββββββ¬βββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
β Normalization β β Transliteration β β Phonetic Encodingβ
β (NFKC, suffix β β (pykakasi β β β (Soundex-style β
β stripping) β β romaji) β β key generation) β
ββββββββββ¬ββββββββ ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ
βββββββββββββββββββββββΌββββββββββββββββββββββ
β
ββββββββββββββββΌβββββββββββββββββββ
β Candidate Blocking β
β (trigram overlap + phonetic key) β
ββββββββββββββββ¬βββββββββββββββββββ
β
ββββββββββββββββΌβββββββββββββββββββ
β Ensemble Scoring β
β ββββββββββ¬βββββββββ¬βββββββββββ β
β βJaro- βLeven- βToken β β
β βWinkler βshtein βSort β β
β β(0.30) β(0.25) β(0.25) β β
β ββββββββββ΄βββββββββ΄βββββββββββ β
β ββββββββββββββββββββββββββββββ β
β βPhonetic Match (0.20) β β
β ββββββββββββββββββββββββββββββ β
ββββββββββββββββ¬βββββββββββββββββββ
β
ββββββββββββββββΌβββββββββββββββββββ
β Ranked Results + Explain β
βββββββββββββββββββββββββββββββββββ
# Clone
git clone https://github.com/pradhankukiran/entity-resolution-engine.git
cd entity-resolution-engine
# Install
make install
# Seed the database with Japanese corporate registry data
make seed
# Start the dev server
make devOpen http://localhost:8000 for the UI, or http://localhost:8000/docs for the interactive API docs.
docker compose up --buildAll entity types are accessible under /v1/{entity_type}/:
POST /v1/{entity_type}/search Search entities by name
POST /v1/{entity_type}/match Compare two names directly
POST /v1/{entity_type}/batch Submit batch search job
GET /v1/{entity_type}/batch/{job_id} Poll batch job status
GET /health Liveness check
GET /stats Database statistics
Backward-compatible routes (/search, /match, /batch) default to the company entity type.
Search:
curl -s -X POST http://localhost:8000/v1/company/search \
-H "Content-Type: application/json" \
-d '{"query": "γ½γγΌ", "limit": 5}'{
"query": "γ½γγΌ",
"entity_type": "company",
"detected_language": "ja",
"matches": [
{
"rank": 1,
"entity_name": "γ½γγΌγ°γ«γΌγζ ͺεΌδΌη€Ύ",
"score": 0.87,
"strategy_scores": [...]
}
]
}Compare two names:
curl -s -X POST http://localhost:8000/v1/company/match \
-H "Content-Type: application/json" \
-d '{"name_a": "Sony Corporation", "name_b": "γ½γγΌζ ͺεΌδΌη€Ύ"}'{
"name_a": "Sony Corporation",
"name_b": "γ½γγΌζ ͺεΌδΌη€Ύ",
"final_score": 0.72,
"strategy_scores": [
{"strategy_name": "jaro_winkler", "score": 0.68},
{"strategy_name": "levenshtein", "score": 0.55},
{"strategy_name": "token_sort", "score": 0.65},
{"strategy_name": "phonetic", "score": 1.0}
]
}Adding support for a new entity type (person, product, etc.) requires one file -- no changes to the pipeline, scoring, or API layers.
1. Define the config:
# src/entity_resolution/entity_types/person.py
from entity_resolution.entity_types.config import EntityTypeConfig, FieldDef
def person_candidate_forms(row):
forms = {"name_normalized": row.get("name_normalized") or ""}
if row.get("phonetic_key"):
forms["phonetic_key"] = row["phonetic_key"]
return forms
PERSON_CONFIG = EntityTypeConfig(
type_name="person",
display_name="Person",
table_name="persons",
ngram_table_name="person_ngrams",
id_column="person_id",
db_fields=[
FieldDef("full_name", "TEXT", nullable=False),
FieldDef("name_normalized", "TEXT", nullable=False),
FieldDef("given_name", "TEXT"),
FieldDef("family_name", "TEXT"),
FieldDef("phonetic_key", "TEXT", indexed=True),
],
display_name_field="full_name",
ngram_source_fields=["name_normalized"],
text_form_pairs=[("normalized", "name_normalized")],
phonetic_form_pairs=[("phonetic", "phonetic_key")],
candidate_form_extractor=person_candidate_forms,
)2. Register it in EntityTypeRegistry.default().
3. Done. The engine auto-creates tables and exposes /v1/person/search, /v1/person/match, etc.
| Strategy | Weight | What it measures |
|---|---|---|
| Jaro-Winkler | 0.30 | Character-level similarity with prefix bonus |
| Levenshtein | 0.25 | Normalized edit distance |
| Token Sort | 0.25 | Similarity after alphabetical token reordering |
| Phonetic | 0.20 | Soundex-style phonetic key comparison |
The ensemble scorer tries every compatible (query form, candidate form) pair per strategy and takes the best score. The final score is a weighted average across all strategies.
To avoid O(n^2) comparisons, the engine uses a two-phase blocking strategy:
- Trigram overlap -- character trigrams from the query are matched against a prebuilt ngram index
- Phonetic key -- exact and prefix matching on Soundex-style keys
This narrows 100k+ entities to ~200 candidates before scoring.
For Japanese queries:
ζ ͺεΌδΌη€Ύγ½γγΌ β [strip suffix] β γ½γγΌ β [transliterate] β sonii β [phonetic] β S500xx
For English queries:
Sony Corporation β [strip suffix] β sony β [normalize] β sony β [phonetic] β S500xx
Both produce comparable phonetic keys (S500xx), enabling cross-language matching.
| Component | Technology |
|---|---|
| API | FastAPI, Pydantic v2, Uvicorn |
| Database | SQLite (async via aiosqlite) |
| Matching | RapidFuzz, custom phonetic encoder |
| Japanese NLP | pykakasi (transliteration), langdetect |
| Logging | structlog (JSON) |
| Linting | Ruff, mypy |
| Testing | pytest, pytest-asyncio |
| Container | Docker (multi-stage, non-root) |
make install # Install with dev dependencies
make dev # Start dev server with hot reload
make test # Run test suite (138 tests)
make lint # Lint with ruff
make format # Auto-format with ruff
make typecheck # Type check with mypy
make check # All of the above
make coverage # Tests with HTML coverage report
make seed # Seed DB with NTA corporate registry
make clean # Remove build artifactssrc/entity_resolution/
entity_types/ # Pluggable entity type definitions
config.py # EntityTypeConfig, EntityRecord, Registry
company.py # Company entity config (JP corporate registry)
api/
routers/
entity.py # Generic /v1/{entity_type}/ routes
search.py # Backward-compat /search
match.py # Backward-compat /match
batch.py # Backward-compat /batch
health.py # /health, /stats
schemas.py # Pydantic request/response models
middleware.py # Logging + security headers
matching/
base.py # MatchStrategy ABC
ensemble.py # Weighted multi-strategy scorer
registry.py # Strategy registry
jaro_winkler.py # Jaro-Winkler strategy
levenshtein.py # Levenshtein strategy
token_sort.py # Token sort strategy
phonetic_match.py # Phonetic key strategy
normalization/
normalizer.py # NFKC, suffix stripping, whitespace
transliterator.py # Japanese β romaji (pykakasi)
phonetic.py # Soundex-style phonetic encoder
language.py # Language/script detection
pipeline/
pipeline.py # Main orchestrator
blocker.py # Trigram + phonetic candidate blocking
explainer.py # Step-by-step explanation builder
batch/
manager.py # Async batch job queue
db/
database.py # Async SQLite wrapper
query_builder.py # Dynamic SQL from entity config
models.py # Legacy Company model
queries.py # Legacy hardcoded SQL
core/
config.py # Pydantic settings
dependencies.py # FastAPI DI singletons
logging.py # structlog setup