Skip to content

Feat/semantica triplet dedup v2 336#340

Merged
KaifAhmad1 merged 4 commits intoHawksight-AI:mainfrom
ZohaibHassan16:feat/semantica-triplet-dedup-v2-336
Feb 25, 2026
Merged

Feat/semantica triplet dedup v2 336#340
KaifAhmad1 merged 4 commits intoHawksight-AI:mainfrom
ZohaibHassan16:feat/semantica-triplet-dedup-v2-336

Conversation

@ZohaibHassan16
Copy link
Collaborator

@ZohaibHassan16 ZohaibHassan16 commented Feb 21, 2026

Description

This PR completes the Deduplication v2 Epic by introducing an opt-in semantic relationship deduplication mode (semantica_v2). This moves triplet matching away from naive string comparisons toward canonicalization and weighted semantic scoring, all backed by a highly optimized fast-path hash check.

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Performance improvement
  • Code refactoring

Related Issues


Changes Made

1. Canonicalization Engine

Implemented predicate_synonym_map and literal_normalization_enabled support to align semantic equivalents (e.g., works_for == employed_by).

2. Fast Path Hash Matching

Updated detect_relationship_duplicates to pre-compute and compare canonical signatures in $O(1)$ time. This allows the system to skip heavy string math when exact canonical matches exist.

3. Weighted Semantic Path

Developed a weighted confidence composition (60% predicate / 40% object) in _relationships_are_duplicates as a fallback for fuzzy matches, including explainable semantic_match_score metadata.

4. Merge Alignment & API

  • Updated _merge_relationships to group by canonical keys rather than raw strings.
  • Exposed dedup_triplets in methods.py as a first-class entry point.

Benchmark Results

The canonical hash path yields massive performance gains alongside accuracy improvements:

Mode Mean Time (ms) Iterations Result
Legacy ~525.5 50 --
Semantic V2 ~85.6 50 ~83% Faster

Testing & Quality Assurance

  • Benchmarked: Added test_relationship_dedup_speed to track synonym resolution and hash-path speed.
  • Build: python -m build successful.
  • Backward Compatible: Defaults to legacy mode to preserve existing SPO matching behavior.

Additional Notes

This is the third PR in the Deduplication v2 sequence. Please note:

  • This branch (feat/semantic-triplet-dedup-v2-336) is sub-branched from feat/prefilter-logic-v2-335.
  • It carries forward all logic from the previous two sub-issues to ensure the full integration of the Epic works as expected.
  • I'll handle any rebasing needed once the previous PRs are merged to keep the main history clean!

@ZohaibHassan16 ZohaibHassan16 force-pushed the feat/semantica-triplet-dedup-v2-336 branch from ca88cc3 to 91ba521 Compare February 22, 2026 06:58
- Add name check to prevent function from calling itself recursively
- Fixes crash when using semantic deduplication mode
- Maintains all existing functionality while preventing stack overflow
Copy link
Contributor

@KaifAhmad1 KaifAhmad1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Feat/Semantica Triplet Dedup v2 (#340)

Status: ✅ APPROVED

Performance

  • 6.98x speedup: Semantic V2 (~83ms) vs Legacy (~579ms)
  • O(1) hash matching: Fast-path optimization
  • All benchmarks passed: 13/13 tests

Key Features

  • Semantic deduplication: Predicate synonyms + literal normalization
  • Weighted scoring: 60% predicate + 40% object composition
  • Explainable AI: semantic_match_score metadata
  • API: dedup_triplets() function
  • Backward compatible: Opt-in semantic_v2 mode

Critical Fix

  • Infinite recursion: Fixed self-calling in dedup_triplets()
  • Solution: Added name check in registry lookup
  • Status: ✅ Fixed and verified

Testing

  • Functionality: Semantic dedup working (3/3 duplicates)
  • Compatibility: Legacy mode preserved
  • No recursion: Function completes successfully
  • Integration: Merge strategy working

Files Modified

  • duplicate_detector.py: +203 lines
  • methods.py: +35 lines (includes fix)
  • merge_strategy.py: +35 lines
  • test_deduplication.py: +57 lines

Impact: 7x performance improvement, semantic intelligence, full backward compatibility.

KaifAhmad1 and others added 2 commits February 25, 2026 16:52
…n v2 features

- Added comprehensive changelog entry for Semantic Relationship Deduplication v2
- Documented 6.98x performance improvement and key features
- Included contributor credits (@ZohaibHassan16) and fix credits (@KaifAhmad1)
- Listed all technical implementations and benchmarks
- Noted critical infinite recursion bug fix
@KaifAhmad1 KaifAhmad1 merged commit fcaebe9 into Hawksight-AI:main Feb 25, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants