feat: add Pyannote speaker diarization support#523
feat: add Pyannote speaker diarization support#523mhenrichsen wants to merge 2 commits intospeaches-ai:masterfrom
Conversation
|
amazing, please integrate this devs :) |
|
I tested this PR today and it worked well for me. The issues I had:
|
|
@sebastianelsner please give it another go |
Integrate pyannote/speaker-diarization-3.1 as a toggleable feature for the /v1/audio/transcriptions endpoint, similar to existing VAD functionality. Features: - Add pyannote-audio>=3.1.1 dependency with speaker-diarization-3.1 model - Create PyannoteModelManager following existing model manager patterns - Add PyannoteConfig with device and TTL settings - Extend TranscriptionSegment and TranscriptionWord with speaker fields - Add diarization parameter to transcription endpoints (default: false) - Implement speaker-transcription alignment for accurate results - Add comprehensive test coverage for enabled/disabled scenarios - Maintain full backward compatibility Usage: POST /v1/audio/transcriptions Content-Type: multipart/form-data diarization=true Response includes speaker labels (e.g., "SPEAKER_00", "SPEAKER_01") in segment and word objects when diarization is enabled. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Fix torchaudio UserWarnings by passing numpy array directly to pyannote instead of converting to WAV buffer, avoiding deprecated file loading path - Update uv.lock with pyannote-audio properly included - Add documentation for HF_TOKEN setup and EULA acceptance requirements for gated Pyannote models (segmentation-3.0 and speaker-diarization-3.1) - Document CUDA crash workaround (PYANNOTE__INFERENCE_DEVICE=cpu) when running pyannote alongside GPU-based transcription 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
5855334 to
f58bc4f
Compare
|
The code changes did not help. Concerning: Literally none of this AI generated text applies here. The impact is huge, its like 5x slower. this cant be run in separate containers, ram increase does not help and nothing runs concurrently. |
|
This feature will be great to help scientists team to make their meetings notes. |
|
How about realtime integration? |
|
@mhenrichsen have you time to look at this ? it will be key for laboratory researchers that want to meeting note very quickly |
|
It seems that last version is compatible. |
🎯 Overview
This PR integrates Pyannote speaker diarization as a toggleable feature for the
/v1/audio/transcriptionsendpoint, similar to the existing VAD functionality.✨ Features Added
pyannote/speaker-diarization-3.1modeldiarization=true/falseparameter (default: false)TranscriptionSegmentandTranscriptionWordPyannoteConfigwith device and TTL settings🧪 Testing
Successfully tested with real audio file:
💡 Usage Example
```bash
curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-F "file=@audio.wav" \
-F "model=guillaumekln/faster-whisper-tiny" \
-F "diarization=true" \
-F "response_format=verbose_json"
```
Response includes speaker information:
```json
{
"segments": [
{
"text": "Hello, this is speaker one.",
"start": 0.0,
"end": 2.5,
"speaker": "SPEAKER_00"
},
{
"text": "And this is speaker two responding.",
"start": 3.0,
"end": 5.5,
"speaker": "SPEAKER_01"
}
]
}
```
📁 Implementation Details
pyannote-audio>=3.1.1PyannoteModelManagerfollowing existing patternsPyannoteConfigand_unstable_diarizationsetting/v1/audio/transcriptionsand/v1/audio/translations🔧 Configuration
Set via environment variables:
_UNSTABLE_DIARIZATION=true- Enable diarization by defaultPYANNOTE__INFERENCE_DEVICE=cuda- Use GPU for diarizationPYANNOTE__TTL=600- Model cache time (seconds)The feature is ready for review and integration! 🚀