Feat/semantica triplet dedup v2 336 by ZohaibHassan16 · Pull Request #340 · Hawksight-AI/semantica

ZohaibHassan16 · 2026-02-21T13:38:36Z

Description

This PR completes the Deduplication v2 Epic by introducing an opt-in semantic relationship deduplication mode (semantica_v2). This moves triplet matching away from naive string comparisons toward canonicalization and weighted semantic scoring, all backed by a highly optimized fast-path hash check.

Type of Change

✅ New feature (non-breaking change which adds functionality)
✅Performance improvement
✅ Code refactoring

Related Issues

Resolves [Sub-Issue 3] Semantic Relationship + Triplet Dedup v2 #336
Completes Epic [EPIC] Deduplication v2: Higher Accuracy, Lower Latency, Backward Compatible #333

Changes Made

1. Canonicalization Engine

Implemented predicate_synonym_map and literal_normalization_enabled support to align semantic equivalents (e.g., works_for == employed_by).

2. Fast Path Hash Matching

Updated detect_relationship_duplicates to pre-compute and compare canonical signatures in $O(1)$ time. This allows the system to skip heavy string math when exact canonical matches exist.

3. Weighted Semantic Path

Developed a weighted confidence composition (60% predicate / 40% object) in _relationships_are_duplicates as a fallback for fuzzy matches, including explainable semantic_match_score metadata.

4. Merge Alignment & API

Updated _merge_relationships to group by canonical keys rather than raw strings.
Exposed dedup_triplets in methods.py as a first-class entry point.

Benchmark Results

The canonical hash path yields massive performance gains alongside accuracy improvements:

Mode	Mean Time (ms)	Iterations	Result
Legacy	~525.5	50	--
Semantic V2	~85.6	50	~83% Faster

Testing & Quality Assurance

✅ Benchmarked: Added test_relationship_dedup_speed to track synonym resolution and hash-path speed.
✅ Build: python -m build successful.
✅ Backward Compatible: Defaults to legacy mode to preserve existing SPO matching behavior.

Additional Notes

This is the third PR in the Deduplication v2 sequence. Please note:

This branch (feat/semantic-triplet-dedup-v2-336) is sub-branched from feat/prefilter-logic-v2-335.
It carries forward all logic from the previous two sub-issues to ensure the full integration of the Epic works as expected.
I'll handle any rebasing needed once the previous PRs are merged to keep the main history clean!

…wksight-AI#336)

- Add name check to prevent function from calling itself recursively - Fixes crash when using semantic deduplication mode - Maintains all existing functionality while preventing stack overflow

KaifAhmad1

PR Review: Feat/Semantica Triplet Dedup v2 (#340)

Status: ✅ APPROVED

Performance

6.98x speedup: Semantic V2 (~83ms) vs Legacy (~579ms)
O(1) hash matching: Fast-path optimization
All benchmarks passed: 13/13 tests

Key Features

✅ Semantic deduplication: Predicate synonyms + literal normalization
✅ Weighted scoring: 60% predicate + 40% object composition
✅ Explainable AI: semantic_match_score metadata
✅ API: dedup_triplets() function
✅ Backward compatible: Opt-in semantic_v2 mode

Critical Fix

Infinite recursion: Fixed self-calling in dedup_triplets()
Solution: Added name check in registry lookup
Status: ✅ Fixed and verified

Testing

✅ Functionality: Semantic dedup working (3/3 duplicates)
✅ Compatibility: Legacy mode preserved
✅ No recursion: Function completes successfully
✅ Integration: Merge strategy working

Files Modified

duplicate_detector.py: +203 lines
methods.py: +35 lines (includes fix)
merge_strategy.py: +35 lines
test_deduplication.py: +57 lines

Impact: 7x performance improvement, semantic intelligence, full backward compatibility.

@ZohaibHassan16

…n v2 features - Added comprehensive changelog entry for Semantic Relationship Deduplication v2 - Documented 6.98x performance improvement and key features - Included contributor credits (@ZohaibHassan16) and fix credits (@KaifAhmad1) - Listed all technical implementations and benchmarks - Noted critical infinite recursion bug fix

ZohaibHassan16 requested a review from KaifAhmad1 February 21, 2026 18:13

feat(dedup): implement semantic relationship and triplet dedup v2 (Ha…

91ba521

…wksight-AI#336)

ZohaibHassan16 force-pushed the feat/semantica-triplet-dedup-v2-336 branch from ca88cc3 to 91ba521 Compare February 22, 2026 06:58

fix: prevent infinite recursion in dedup_triplets function

a1b85e0

- Add name check to prevent function from calling itself recursively - Fixes crash when using semantic deduplication mode - Maintains all existing functionality while preventing stack overflow

KaifAhmad1 approved these changes Feb 25, 2026

View reviewed changes

KaifAhmad1 and others added 2 commits February 25, 2026 16:52

Merge branch 'main' into feat/semantica-triplet-dedup-v2-336

095ba13

KaifAhmad1 merged commit fcaebe9 into Hawksight-AI:main Feb 25, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/semantica triplet dedup v2 336#340

Feat/semantica triplet dedup v2 336#340
KaifAhmad1 merged 4 commits intoHawksight-AI:mainfrom
ZohaibHassan16:feat/semantica-triplet-dedup-v2-336

ZohaibHassan16 commented Feb 21, 2026 •

edited

Loading

Uh oh!

KaifAhmad1 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ZohaibHassan16 commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Changes Made

1. Canonicalization Engine

2. Fast Path Hash Matching

3. Weighted Semantic Path

4. Merge Alignment & API

Benchmark Results

Testing & Quality Assurance

Additional Notes

Uh oh!

KaifAhmad1 left a comment

Choose a reason for hiding this comment

PR Review: Feat/Semantica Triplet Dedup v2 (#340)

Status: ✅ APPROVED

Performance

Key Features

Critical Fix

Testing

Files Modified

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ZohaibHassan16 commented Feb 21, 2026 •

edited

Loading