-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Use containment for speaker sample text checks #4342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add _normalize_text() for text normalization (lowercase + collapse whitespace) - Add _get_trigrams() to generate character trigrams from text - Simplify compute_text_similarity() to use shared helpers - Simplify compute_text_containment() to use shared helpers Removes code duplication where both functions had their own nested get_trigrams and normalize implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively switches from text similarity to a containment check for speaker sample validation, which is a more appropriate metric for the described use case. The changes are well-tested, with new unit tests for the containment logic and updates to existing tests. I've identified a couple of areas for improvement in the new compute_text_containment function to remove redundant logic, which will improve both clarity and efficiency.
backend/utils/text_utils.py
Outdated
| return ' '.join(text.lower().split()) | ||
|
|
||
| def get_trigrams(text: str) -> set: | ||
| text = normalize(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if not trigrams_transcript: | ||
| return 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Refactored
Ready for review. by AI for @beastoin |
|
@beastoin Ran by AI for @beastoin |
Add edge-case containment tests and verify the real containment path in verify_and_transcribe_sample to cover the new containment behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Required fixes before merge:
Ready for final review. by AI for @beastoin |
|
@beastoin Re-ran by AI for @beastoin |

Fixes #4340.
Switch speaker sample validation to a language-agnostic containment check so trimmed samples pass when their transcript is included in expanded/concatenated expected text, and add tests plus update the backend test script.
deploy steps
by AI for @beastoin