add speaker sample quality verification before storage #4291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

beastoin merged 19 commits into main from e8w2h_speaker_identification

Jan 22, 2026

Collaborator

beastoin commented Jan 19, 2026 •

edited

Loading

Fixes #4253.

Adds transcript capture/storage and People settings UI for speech samples (from #4322), and enforces stricter verification before saving samples (min 5 words, ≥70% single-speaker dominance via diarization, and ≥60% trigram Jaccard similarity) to avoid low‑quality or mixed‑speaker data.

deploy steps

deploy backend(s) https://github.com/BasedHardware/omi/actions/runs/21244368759 https://github.com/BasedHardware/omi/actions/runs/21244371941
deploy mobile app https://codemagic.io/app/66c95e6ec76853c447b8bcbb/build/6971f70668776391b4be43cb

This pr was drafted by AI on behalf of @beastoin

beastoin changed the title ~~extract text_similarity to utils/text_utils.py for testability~~ add speaker sample quality verification before storage

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request successfully refactors the compute_text_similarity function into a separate text_utils.py module, which resolves import dependency issues and improves testability. The changes include comprehensive new unit tests for the similarity function and its integration into the speaker sample quality verification process. My review focuses on improving logging practices and resource management in the new and modified functions. I've suggested replacing print statements with the standard logging module for better observability in production and ensuring BytesIO resources are properly handled using a with statement.

backend/utils/speaker_identification.py Show resolved Hide resolved

backend/utils/stt/pre_recorded.py Show resolved Hide resolved

backend/utils/stt/pre_recorded.py Show resolved Hide resolved

backend/utils/stt/pre_recorded.py Show resolved Hide resolved

beastoin marked this pull request as draft

January 19, 2026 04:28

beastoin mentioned this pull request

feat: display speech sample transcripts in People settings #4322

Merged

Collaborator Author

beastoin commented Jan 22, 2026 •

edited

Loading

Required fixes before merge:

Prevent migration from deleting samples on transient Deepgram failures by distinguishing “transcription failed” vs “low-quality” (touchpoints: backend/utils/speaker_sample_migration.py:75-93, backend/utils/speaker_sample.py:22-67, backend/utils/stt/pre_recorded.py:106-177).
Ensure transcript arrays stay aligned with samples when adding a transcript to existing v1 data (backend/database/users.py:103-127).

Let me know when these are fixed and I’ll re-review.

This comment was drafted by AI on behalf of @beastoin

Collaborator Author

beastoin commented Jan 22, 2026 •

edited

Loading

Done. Addressed both issues:

Transient failure handling: Deepgram raises RuntimeError on API failure; migration skips instead of deletes
Transcript array alignment: pads with None for existing v1 samples

Tests pass. Ready for re-review.

This comment was drafted by AI on behalf of @beastoin

Collaborator Author

beastoin commented Jan 22, 2026

Required fixes before merge:

Migration can delete some samples, then hit a transient failure later and return early without updating Firestore, leaving deleted blob paths in speech_samples. Defer deletions until after you know there are no transient failures, or record deletions and still update the document in the retry path (touchpoints: backend/utils/speaker_sample_migration.py:76-105).
Padding transcripts with None will break the client parse/UI (Dart List<String>.from and isNotEmpty assume non-null strings). Either pad with empty strings or make the client tolerate nulls (touchpoints: backend/database/users.py:122-131, app/lib/backend/schema/person.dart:55-58, app/lib/pages/settings/people.dart:307-313).

Please fix these and I’ll re-review.

This comment was drafted by AI on behalf of @beastoin

Collaborator Author

beastoin commented Jan 22, 2026

Done. Fixed both issues:

Deferred blob deletions until after confirming no transient failures
Changed transcript padding from None to empty strings for Dart compatibility

Tests pass. Ready for re-review.

This comment was drafted by AI on behalf of @beastoin

Collaborator Author

beastoin commented Jan 22, 2026

@beastoin Re-reviewed the latest changes — the deferred deletion flow and empty-string padding address the issues cleanly, and I don’t see any new regressions. Please go ahead and merge when ready.

This comment was drafted by AI on behalf of @beastoin

Collaborator Author

beastoin commented Jan 22, 2026

wait, let me test it first.

Collaborator Author

beastoin commented Jan 22, 2026

mobile app ui(s)

Screenshot 2026-01-22 at 16 19 11

adding new person

Screenshot 2026-01-22 at 16 18 17

migration(auto) on the old person

Screenshot 2026-01-22 at 16 18 08

looks good.

Collaborator Author

beastoin commented Jan 22, 2026 •

edited

Loading

Sorry for missing the tests earlier. Added 21 unit tests to make future maintenance easier:

test_speaker_sample.py: 15 tests for verification logic and boundary cases
test_speaker_sample_migration.py: 5 tests for migration and deferred deletions
test_users_add_sample_transaction.py: 1 test for transcript padding

All 70 tests pass. Ready for re-review.

Drafted by AI for @beastoin

Collaborator Author

beastoin commented Jan 22, 2026 •

edited

Loading

@beastoin Re‑reviewed after the new tests; coverage looks good. Please merge when ready.

By AI for @beastoin

beastoin and others added 17 commits

January 22, 2026 16:57


          extract text_similarity to utils/text_utils.py for testability

4726f65

Move compute_text_similarity to a standalone module without database
dependencies so unit tests can import the real function instead of
duplicating it locally.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          docs: add PRD and progress checklist for speech sample transcripts

eb6fc13

Add technical implementation plan (PRD.MD) and progress tracking
checklist (progress.txt) for the speech sample transcripts feature.

This feature will display transcripts of speech samples in the
Settings > People page, leveraging the existing Deepgram transcription
from speaker sample verification.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          feat: add centralized speaker sample migration utility

a43d4d4

Create backend/utils/speaker_sample_migration.py with:
- verify_and_transcribe_sample(): Transcribe audio and verify quality
- migrate_person_samples_v1_to_v2(): Migrate samples from v1 to v2 format
- download_sample_audio(): Download speech sample from GCS
- delete_sample_from_storage(): Delete speech sample from GCS

This centralizes the verification logic from speaker_identification.py
for reuse in lazy migration. Part of speech sample transcripts feature.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          docs: mark task 1 complete in progress checklist

cd6a770

Created backend/utils/speaker_sample_migration.py with all four required
functions: verify_and_transcribe_sample, migrate_person_samples_v1_to_v2,
download_sample_audio, and delete_sample_from_storage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          feat: add speech_sample_transcripts and speech_samples_version to Per…

9f4d3c2

…son model

Add new fields to support storing transcripts alongside speech samples:
- speech_sample_transcripts: Optional[List[str]] for parallel transcript array
- speech_samples_version: int defaulting to 1 for migration tracking

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          feat: add database functions for speech sample transcripts

bae97ee

Update add_person_speech_sample() to accept transcript parameter and
store it in parallel array. Update remove_person_speech_sample() to
remove by index to keep samples and transcripts arrays in sync.

Add new functions for migration support:
- set_person_speech_sample_transcript()
- update_person_speech_samples_after_migration()
- clear_person_speaker_embedding()
- update_person_speech_samples_version()

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          feat: refactor speaker identification to use centralized verification

983aa89

Replace inline _verify_sample_quality function with the centralized
verify_and_transcribe_sample from speaker_sample_migration module.
Now passes transcript to add_person_speech_sample() to store transcripts
alongside speech samples.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          feat: add lazy migration to people API endpoints

32dc783

Make get_all_people() and get_single_person() async to support lazy
migration of v1 speech samples to v2 with transcripts when fetching
people data.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          feat: add JSON parsing for speech sample transcripts in Flutter Perso…

380d4dc

…n model

Update fromJson() and toJson() methods to properly parse/serialize
speech_sample_transcripts and speech_samples_version fields that were
already defined but not being serialized.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          feat: display speech sample transcripts in People settings UI

750786b

Show transcript text in italic below each speech sample in the Settings > People page.
Handles null/missing transcripts gracefully by only displaying when available.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          test: verify backend tests pass (49 tests)

0d800ae

All existing backend unit tests pass:
- 22 transcript_segment tests
- 27 text_similarity tests

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          test: verify Flutter tests (pre-existing environment issue)

3d1b4aa

Flutter tests fail due to missing provider/path_provider dependencies
in the test environment. Verified same failures occur on main branch,
confirming this is not caused by PR changes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          fix: address PR review - move migration to extraction, add locks, use…

3f3f772

… transaction

Per beastoin's review on PR #4322:
- Move v1→v2 migration from GET endpoints to speaker extraction flow
- Add in-process asyncio lock per uid/person_id to prevent double migration
- Use Firestore transaction in add_person_speech_sample for atomic array updates
- Remove unused google.cloud.storage import
- Delete PRD.MD and progress.txt files

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          fix: run migration before sample count check in extract_speaker_samples

Move migration to run before the early return guard so v1 users at the
sample limit still get migrated. Migration may drop invalid samples,
freeing up space for new ones.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          refactor: split speaker_sample_migration into two modules

7f18879

Per reviewer feedback, extract verification + GCS helpers into
speaker_sample.py for cleaner reuse:
- speaker_sample.py: verify_and_transcribe_sample, download_sample_audio,
  delete_sample_from_storage
- speaker_sample_migration.py: migration logic + locking

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          refactor: move GCS helpers to storage.py, add speech-profile wrappers

eb57491

Per reviewer feedback:
- Add generic helpers: download_blob_bytes, delete_blob
- Add speech-profile wrappers: download_speech_profile_bytes, delete_speech_profile_blob
- Update speaker_sample.py to use only the wrappers (no direct bucket/client usage)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          fix: prevent migration from deleting samples on transient failures

09830b4

Per reviewer feedback:
- Deepgram now raises RuntimeError on transcription failure instead of
  returning empty list, allowing callers to distinguish API failures
  from low-quality samples
- Migration skips samples with transient failures (keeps them as v1)
  instead of deleting them
- Transcript array is padded with None when adding transcript to
  existing v1 data to maintain alignment with samples array

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

beastoin and others added 2 commits

January 22, 2026 17:00


          fix: defer blob deletions and use empty strings for transcript padding

fdc372d

Per reviewer feedback:
- Defer blob deletions until after confirming no transient failures,
  preventing orphaned paths in Firestore on early return
- Use empty strings instead of None for transcript padding to avoid
  breaking Dart's List<String>.from and isNotEmpty checks

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


          test: add unit tests for speaker sample verification and migration

066c672

- test_speaker_sample.py: 15 tests covering verification logic,
  boundary cases, transient failures, and edge cases
- test_speaker_sample_migration.py: 5 tests for v1→v2 migration,
  transient failure handling, and deferred deletions
- test_users_add_sample_transaction.py: 1 test for transcript
  array padding with empty strings
- test.sh: add ENCRYPTION_SECRET and new test commands

Total: 70 tests passing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

beastoin force-pushed the e8w2h_speaker_identification branch from c97d2a8 to 066c672 Compare

January 22, 2026 10:00

beastoin marked this pull request as ready for review

January 22, 2026 10:01

beastoin merged commit 19fa8c8 into main

1 check passed

beastoin deleted the e8w2h_speaker_identification branch

January 22, 2026 10:08

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet