Skip to content

fix: add stopword filtering and overlap ratio to roleMatch (#157)#194

Closed
SAY-5 wants to merge 1 commit intosantifer:mainfrom
SAY-5:fix/dedup-roleMatch-stopwords
Closed

fix: add stopword filtering and overlap ratio to roleMatch (#157)#194
SAY-5 wants to merge 1 commit intosantifer:mainfrom
SAY-5:fix/dedup-roleMatch-stopwords

Conversation

@SAY-5
Copy link
Copy Markdown

@SAY-5 SAY-5 commented Apr 11, 2026

Adds stopword filtering and an overlap ratio threshold to roleMatch to reduce false positives. Previously, common words like "senior" or "engineer" could inflate match scores between unrelated roles.

…tifer#157)

roleMatch() was treating location words (tokyo, japan) and generic
seniority words (lead, manager, director) as role-identifying, causing
false-positive dedup matches that silently destroyed distinct entries.

Add ROLE_STOPWORDS set and require overlap to be both ≥2 content words
AND ≥60% of the shorter title's content words. Verified against all 8
test cases from issue santifer#157.
@SAY-5 SAY-5 closed this Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant