Skip to content

Conversation

@beastoin
Copy link
Collaborator

@beastoin beastoin commented Jan 22, 2026

Fixes #4340.

Switch speaker sample validation to a language-agnostic containment check so trimmed samples pass when their transcript is included in expanded/concatenated expected text, and add tests plus update the backend test script.

deploy steps


by AI for @beastoin

beastoin and others added 2 commits January 22, 2026 18:30
- Add _normalize_text() for text normalization (lowercase + collapse whitespace)
- Add _get_trigrams() to generate character trigrams from text
- Simplify compute_text_similarity() to use shared helpers
- Simplify compute_text_containment() to use shared helpers

Removes code duplication where both functions had their own
nested get_trigrams and normalize implementations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively switches from text similarity to a containment check for speaker sample validation, which is a more appropriate metric for the described use case. The changes are well-tested, with new unit tests for the containment logic and updates to existing tests. I've identified a couple of areas for improvement in the new compute_text_containment function to remove redundant logic, which will improve both clarity and efficiency.

return ' '.join(text.lower().split())

def get_trigrams(text: str) -> set:
text = normalize(text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The text passed to this get_trigrams function has already been normalized on lines 54-55. This second normalization is redundant and can be removed for efficiency.

Comment on lines 65 to 66
if not trigrams_transcript:
return 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This check is unreachable. Given the preceding checks on lines 57 and 59, transcript will always have a length of at least 3 when get_trigrams is called. For such strings, get_trigrams will always return a non-empty set. This block can be safely removed.

@beastoin
Copy link
Collaborator Author

beastoin commented Jan 22, 2026

Refactored text_utils.py to remove code duplication:

  • Extracted _normalize_text() and _get_trigrams() as shared helpers
  • Both public functions now use these shared helpers

Ready for review.


by AI for @beastoin

@beastoin
Copy link
Collaborator Author

@beastoin Ran backend/test.sh locally and it passed; the containment shift matches #4340 and should stop trimmed samples from being dropped while keeping existing checks intact. Only minor thought is whether set-based containment over-accepts highly repetitive text (e.g., “aaaaaa”), but if that’s acceptable, looks good. Can you sanity-check with a real trimmed sample before merge?


by AI for @beastoin

Add edge-case containment tests and verify the real containment path in
verify_and_transcribe_sample to cover the new containment behavior.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@beastoin
Copy link
Collaborator Author

Required fixes before merge:

  • Add containment edge-case coverage in compute_text_containment.
  • Exercise real containment path in verify_and_transcribe_sample.

Ready for final review.


by AI for @beastoin

@beastoin
Copy link
Collaborator Author

@beastoin Re-ran backend/test.sh and re-reviewed; no changes since last pass, containment still looks right for #4340 while preserving existing checks, and only minor consideration is the set-based containment edge on highly repetitive text (probably fine). Can you go ahead and merge when ready?


by AI for @beastoin

@beastoin beastoin merged commit 7abc21c into main Jan 22, 2026
1 check passed
@beastoin beastoin deleted the fix/speaker-sample-containment branch January 22, 2026 12:07
@beastoin
Copy link
Collaborator Author

Screenshot 2026-01-22 at 20 01 20

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speaker sample text check drops trimmed samples

2 participants