Speech2IPA Noise Robustness

As pointed out in #11, transcription is very sensitive to background noise. We have run some preliminary experiments:

1) [Using webrtcvad to filter out non-speech segments](https://github.com/KoelLabs/ML/blob/main/browser_tests/stream_to_python/server.py): doesn't handle overlapping speech/sounds
2) [Training models on various kinds of augmented noisy speech](https://huggingface.co/collections/KoelLabs/xlsr-north-american-english-speech-to-ipa-683a45b980cc67601388453a): does not generalize well to unseen types of noise

<img width="536" height="219" alt="Image" src="https://github.com/user-attachments/assets/465e540a-e572-4b6b-8776-cc58a524af19" />

We need a low latency approach to remove noise that generalizes well to different types of noise. Some ideas that different PRs can explore:

1) Enable [noise suppression via Web API](https://developer.mozilla.org/en-US/docs/Web/API/MediaTrackConstraints/noiseSuppression) on supported devices/browsers
2) Evaluate various open-source noise suppression models, see which we can run on-device in the browser and which would need to be hosted on a server with a GPU
3) Look into more advanced noise suppression based on [binaural audio and/or speaker specific embeddings]( https://github.com/vb000/LookOnceToHear)
4) Look into more noise robust Speech2IPA architectures, training objectives/regularization, and data augmentation approaches

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speech2IPA Noise Robustness #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Speech2IPA Noise Robustness #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions