feat: add Pyannote speaker diarization support by mhenrichsen · Pull Request #523 · speaches-ai/speaches

mhenrichsen · 2025-09-20T20:51:33Z

🎯 Overview

This PR integrates Pyannote speaker diarization as a toggleable feature for the /v1/audio/transcriptions endpoint, similar to the existing VAD functionality.

✨ Features Added

🎯 Speaker Diarization: Integrates pyannote/speaker-diarization-3.1 model
🔧 Toggleable: Add diarization=true/false parameter (default: false)
🏗️ Model Management: Full PyannoteModelManager with lifecycle management
📊 API Extension: Speaker fields added to TranscriptionSegment and TranscriptionWord
⚙️ Configuration: PyannoteConfig with device and TTL settings
✅ Backward Compatible: Existing API calls work unchanged

🧪 Testing

Successfully tested with real audio file:

✅ Basic transcription: Works correctly
✅ Diarization enabled: Identifies 2 speakers (SPEAKER_00, SPEAKER_01)
✅ Speaker assignment: Correctly maps speakers to transcript segments
✅ API compatibility: No breaking changes to existing functionality

💡 Usage Example

```bash
curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-F "file=@audio.wav" \
-F "model=guillaumekln/faster-whisper-tiny" \
-F "diarization=true" \
-F "response_format=verbose_json"
```

Response includes speaker information:
```json
{
"segments": [
{
"text": "Hello, this is speaker one.",
"start": 0.0,
"end": 2.5,
"speaker": "SPEAKER_00"
},
{
"text": "And this is speaker two responding.",
"start": 3.0,
"end": 5.5,
"speaker": "SPEAKER_01"
}
]
}
```

📁 Implementation Details

Dependencies: Added pyannote-audio>=3.1.1
Model Manager: Created PyannoteModelManager following existing patterns
Configuration: Added PyannoteConfig and _unstable_diarization setting
API Types: Extended with speaker fields maintaining backward compatibility
Endpoints: Updated /v1/audio/transcriptions and /v1/audio/translations
Tests: Comprehensive test coverage for all scenarios

🔧 Configuration

Set via environment variables:

_UNSTABLE_DIARIZATION=true - Enable diarization by default
PYANNOTE__INFERENCE_DEVICE=cuda - Use GPU for diarization
PYANNOTE__TTL=600 - Model cache time (seconds)

The feature is ready for review and integration! 🚀

dotmobo · 2025-09-30T08:25:32Z

amazing, please integrate this devs :)

sebastianelsner · 2025-10-02T11:25:36Z

I tested this PR today and it worked well for me. The issues I had:

lots and lots of UserWarnings, probably one for each chunk of audio to be diaritized:
torchaudio/_backend/utils.py:213: UserWarning: In 2.9, this function's implementation will be changed to use torchaudio.load_with_torchcodec` under the hood. Some parameters like ``normalize``, ``format``, ``buffer_size``, and ``backend`` will be ignored. We recommend that you port your code to rely directly on TorchCodec's decoder instead: https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.decoders.AudioDecoder.html#torchcodec.decoders.AudioDecoder.
The uv.lock file needs updating with pyannote-audio uv add pyannote-audio otherwise the docker containers do not build
Add a section to the documentation, that the user needs to setup the HF_TOKEN and accept the EULA for the gated models pyannote/segmentation-3.0 and pyannote/speaker-diarization-3.1
When I ran pyannote with cuda, the transcription with the curl command given aboce would relieably crash the container. When running with PYANNOTE__INFERENCE_DEVICE=cpu, so that pyannote ran with cpu and transcription with gpu, the container did not crash.

mhenrichsen · 2025-10-02T11:53:15Z

@sebastianelsner please give it another go

Integrate pyannote/speaker-diarization-3.1 as a toggleable feature for the /v1/audio/transcriptions endpoint, similar to existing VAD functionality. Features: - Add pyannote-audio>=3.1.1 dependency with speaker-diarization-3.1 model - Create PyannoteModelManager following existing model manager patterns - Add PyannoteConfig with device and TTL settings - Extend TranscriptionSegment and TranscriptionWord with speaker fields - Add diarization parameter to transcription endpoints (default: false) - Implement speaker-transcription alignment for accurate results - Add comprehensive test coverage for enabled/disabled scenarios - Maintain full backward compatibility Usage: POST /v1/audio/transcriptions Content-Type: multipart/form-data diarization=true Response includes speaker labels (e.g., "SPEAKER_00", "SPEAKER_01") in segment and word objects when diarization is enabled. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Fix torchaudio UserWarnings by passing numpy array directly to pyannote instead of converting to WAV buffer, avoiding deprecated file loading path - Update uv.lock with pyannote-audio properly included - Add documentation for HF_TOKEN setup and EULA acceptance requirements for gated Pyannote models (segmentation-3.0 and speaker-diarization-3.1) - Document CUDA crash workaround (PYANNOTE__INFERENCE_DEVICE=cpu) when running pyannote alongside GPU-based transcription 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

sebastianelsner · 2025-10-02T13:01:14Z

The code changes did not help.

Concerning:

The performance impact on diarization is typically minimal compared to the stability benefit. If you need maximum performance for both models, consider:
- Running them in separate containers with dedicated GPU memory allocations
- Increasing GPU memory if available
- Running operations sequentially instead of concurrently

Literally none of this AI generated text applies here. The impact is huge, its like 5x slower. this cant be run in separate containers, ram increase does not help and nothing runs concurrently.

flefevre · 2025-10-10T04:37:19Z

This feature will be great to help scientists team to make their meetings notes.
Hope you will integrate it

grungkers · 2025-10-14T15:57:09Z

How about realtime integration?

flefevre · 2025-11-21T08:57:57Z

@mhenrichsen have you time to look at this ? it will be key for laboratory researchers that want to meeting note very quickly
how can we help

flefevre · 2026-01-01T23:34:14Z

It seems that last version is compatible.
Could you check with speaker diarization #582 Thanks and happy new year

mhenrichsen and others added 2 commits October 2, 2025 13:54

mhenrichsen force-pushed the feature/pyannote-speaker-diarization branch from 5855334 to f58bc4f Compare October 2, 2025 11:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Pyannote speaker diarization support#523

feat: add Pyannote speaker diarization support#523
mhenrichsen wants to merge 2 commits intospeaches-ai:masterfrom
mhenrichsen:feature/pyannote-speaker-diarization

mhenrichsen commented Sep 20, 2025

Uh oh!

dotmobo commented Sep 30, 2025

Uh oh!

sebastianelsner commented Oct 2, 2025

Uh oh!

mhenrichsen commented Oct 2, 2025

Uh oh!

sebastianelsner commented Oct 2, 2025

Uh oh!

flefevre commented Oct 10, 2025

Uh oh!

grungkers commented Oct 14, 2025

Uh oh!

flefevre commented Nov 21, 2025

Uh oh!

flefevre commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mhenrichsen commented Sep 20, 2025

🎯 Overview

✨ Features Added

🧪 Testing

💡 Usage Example

📁 Implementation Details

🔧 Configuration

Uh oh!

dotmobo commented Sep 30, 2025

Uh oh!

sebastianelsner commented Oct 2, 2025

Uh oh!

mhenrichsen commented Oct 2, 2025

Uh oh!

sebastianelsner commented Oct 2, 2025

Uh oh!

flefevre commented Oct 10, 2025

Uh oh!

grungkers commented Oct 14, 2025

Uh oh!

flefevre commented Nov 21, 2025

Uh oh!

flefevre commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants