Domain similarity and typosquatting detection service for phishing analysis.
- Typosquatting search - Generate and match domain variations
- Similarity search - Levenshtein, Jaro-Winkler algorithms
- Keyword search - Find domains containing specific keywords
- Homograph detection - Unicode/IDN attack detection
- Optimized queries - Length-based pre-filtering for performance
- Python 3.11+
- MongoDB 5.0+ (shared database with zone-collector)
pip install -r requirements.txtCreate a .env file:
MONGODB_URL=mongodb://user:pass@localhost:27017/
DATABASE_NAME=icann_tlds_dbuvicorn app.main:app --reload --port 8003docker build -t similarity-engine .
docker run -p 8003:8000 --env-file .env similarity-engine| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/algorithms |
GET | Available algorithms |
| Endpoint | Method | Description |
|---|---|---|
/search/typosquatting |
POST | Typosquatting variations |
/search/similarity |
POST | String similarity search |
/search/keyword |
POST | Keyword inclusion search |
Search for typosquatting variations of a brand.
{
"brand_name": "google.com",
"days_back": 7,
"algorithms": ["omission", "homoglyph"],
"tlds": ["com", "net"]
}{
"brand": "google.com",
"brand_extracted": "google",
"total_variations": 150,
"matched_variations": 5,
"variations": [
{
"variation": "gogle",
"matched": true,
"matches": [{ "fqdn": "gogle.com", "first_seen": "..." }]
}
]
}Search for similar domains using string similarity algorithms.
{
"brand_name": "google.com",
"days_back": 7,
"levenshtein_threshold": 0.70,
"jaro_winkler_threshold": 0.75,
"homograph_enabled": true,
"tlds": ["com", "net"]
}{
"brand_extracted": "google",
"domains_scanned": 1500,
"results": {
"levenshtein": [{ "domain": "gogle", "similarity": 0.83 }],
"jaro_winkler": [{ "domain": "googel", "similarity": 0.94 }],
"homograph": [{ "domain": "gооgle", "risk_level": "critical" }]
}
}Similarity search uses length-based pre-filtering:
brand: "google" (6 chars)
length_tolerance: 3
→ Only scans domains with 3-9 characters
→ Reduces computation by 90%+
Find domains CONTAINING a specific keyword.
{
"keyword": "google",
"days_back": 7,
"tlds": ["com", "net"],
"limit": 500
}{
"keyword": "google",
"total_matches": 25,
"matches": [
{ "domain": "mygoogle", "fqdn": "mygoogle.com" },
{ "domain": "google-login", "fqdn": "google-login.net" }
]
}| Algorithm | Example |
|---|---|
omission |
google → gogle |
repetition |
google → gooogle |
replacement |
google → goagle |
homoglyph |
google → g00gle |
addition |
google → googlee |
vowel_swap |
google → guugle |
numeral_swap |
one → 1 |
Edit distance similarity. Default threshold: 0.70
"google" vs "gogle" → 0.83 (1 deletion)
Prefix-weighted similarity. Default threshold: 0.75
"google" vs "googel" → 0.94 (strong prefix)
Unicode look-alike detection:
"google" vs "gооgle" (Cyrillic о) → Detected
Risk levels: critical, high, medium, low
similarity-engine/
├── app/
│ ├── main.py
│ ├── config.py
│ ├── api/routes.py # Endpoints + request models
│ ├── database/mongodb.py # Optimized queries
│ └── services/
│ ├── typosquatting.py # Variation generator
│ └── string_similarity.py # Similarity algorithms
└── requirements.txt
| Operation | Time |
|---|---|
| Typosquatting | ~100ms |
| Similarity (with filter) | ~500ms |
| Keyword search | ~50ms |
