Skip to content

feat: add Pyannote speaker diarization support#523

Open
mhenrichsen wants to merge 2 commits intospeaches-ai:masterfrom
mhenrichsen:feature/pyannote-speaker-diarization
Open

feat: add Pyannote speaker diarization support#523
mhenrichsen wants to merge 2 commits intospeaches-ai:masterfrom
mhenrichsen:feature/pyannote-speaker-diarization

Conversation

@mhenrichsen
Copy link
Copy Markdown

🎯 Overview

This PR integrates Pyannote speaker diarization as a toggleable feature for the /v1/audio/transcriptions endpoint, similar to the existing VAD functionality.

✨ Features Added

  • 🎯 Speaker Diarization: Integrates pyannote/speaker-diarization-3.1 model
  • 🔧 Toggleable: Add diarization=true/false parameter (default: false)
  • 🏗️ Model Management: Full PyannoteModelManager with lifecycle management
  • 📊 API Extension: Speaker fields added to TranscriptionSegment and TranscriptionWord
  • ⚙️ Configuration: PyannoteConfig with device and TTL settings
  • ✅ Backward Compatible: Existing API calls work unchanged

🧪 Testing

Successfully tested with real audio file:

  • ✅ Basic transcription: Works correctly
  • ✅ Diarization enabled: Identifies 2 speakers (SPEAKER_00, SPEAKER_01)
  • ✅ Speaker assignment: Correctly maps speakers to transcript segments
  • ✅ API compatibility: No breaking changes to existing functionality

💡 Usage Example

```bash
curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-F "file=@audio.wav" \
-F "model=guillaumekln/faster-whisper-tiny" \
-F "diarization=true" \
-F "response_format=verbose_json"
```

Response includes speaker information:
```json
{
"segments": [
{
"text": "Hello, this is speaker one.",
"start": 0.0,
"end": 2.5,
"speaker": "SPEAKER_00"
},
{
"text": "And this is speaker two responding.",
"start": 3.0,
"end": 5.5,
"speaker": "SPEAKER_01"
}
]
}
```

📁 Implementation Details

  • Dependencies: Added pyannote-audio>=3.1.1
  • Model Manager: Created PyannoteModelManager following existing patterns
  • Configuration: Added PyannoteConfig and _unstable_diarization setting
  • API Types: Extended with speaker fields maintaining backward compatibility
  • Endpoints: Updated /v1/audio/transcriptions and /v1/audio/translations
  • Tests: Comprehensive test coverage for all scenarios

🔧 Configuration

Set via environment variables:

  • _UNSTABLE_DIARIZATION=true - Enable diarization by default
  • PYANNOTE__INFERENCE_DEVICE=cuda - Use GPU for diarization
  • PYANNOTE__TTL=600 - Model cache time (seconds)

The feature is ready for review and integration! 🚀

@dotmobo
Copy link
Copy Markdown

dotmobo commented Sep 30, 2025

amazing, please integrate this devs :)

@sebastianelsner
Copy link
Copy Markdown

I tested this PR today and it worked well for me. The issues I had:

  1. lots and lots of UserWarnings, probably one for each chunk of audio to be diaritized:
    torchaudio/_backend/utils.py:213: UserWarning: In 2.9, this function's implementation will be changed to use torchaudio.load_with_torchcodec` under the hood. Some parameters like ``normalize``, ``format``, ``buffer_size``, and ``backend`` will be ignored. We recommend that you port your code to rely directly on TorchCodec's decoder instead: https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.decoders.AudioDecoder.html#torchcodec.decoders.AudioDecoder.
  2. The uv.lock file needs updating with pyannote-audio uv add pyannote-audio otherwise the docker containers do not build
  3. Add a section to the documentation, that the user needs to setup the HF_TOKEN and accept the EULA for the gated models pyannote/segmentation-3.0 and pyannote/speaker-diarization-3.1
  4. When I ran pyannote with cuda, the transcription with the curl command given aboce would relieably crash the container. When running with PYANNOTE__INFERENCE_DEVICE=cpu, so that pyannote ran with cpu and transcription with gpu, the container did not crash.

@mhenrichsen
Copy link
Copy Markdown
Author

@sebastianelsner please give it another go

mhenrichsen and others added 2 commits October 2, 2025 13:54
Integrate pyannote/speaker-diarization-3.1 as a toggleable feature for
the /v1/audio/transcriptions endpoint, similar to existing VAD functionality.

Features:
- Add pyannote-audio>=3.1.1 dependency with speaker-diarization-3.1 model
- Create PyannoteModelManager following existing model manager patterns
- Add PyannoteConfig with device and TTL settings
- Extend TranscriptionSegment and TranscriptionWord with speaker fields
- Add diarization parameter to transcription endpoints (default: false)
- Implement speaker-transcription alignment for accurate results
- Add comprehensive test coverage for enabled/disabled scenarios
- Maintain full backward compatibility

Usage:
POST /v1/audio/transcriptions
Content-Type: multipart/form-data
diarization=true

Response includes speaker labels (e.g., "SPEAKER_00", "SPEAKER_01")
in segment and word objects when diarization is enabled.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Fix torchaudio UserWarnings by passing numpy array directly to pyannote
  instead of converting to WAV buffer, avoiding deprecated file loading path
- Update uv.lock with pyannote-audio properly included
- Add documentation for HF_TOKEN setup and EULA acceptance requirements
  for gated Pyannote models (segmentation-3.0 and speaker-diarization-3.1)
- Document CUDA crash workaround (PYANNOTE__INFERENCE_DEVICE=cpu) when
  running pyannote alongside GPU-based transcription

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@mhenrichsen mhenrichsen force-pushed the feature/pyannote-speaker-diarization branch from 5855334 to f58bc4f Compare October 2, 2025 11:55
@sebastianelsner
Copy link
Copy Markdown

The code changes did not help.

Concerning:

The performance impact on diarization is typically minimal compared to the stability benefit. If you need maximum performance for both models, consider:
- Running them in separate containers with dedicated GPU memory allocations
- Increasing GPU memory if available
- Running operations sequentially instead of concurrently

Literally none of this AI generated text applies here. The impact is huge, its like 5x slower. this cant be run in separate containers, ram increase does not help and nothing runs concurrently.

@flefevre
Copy link
Copy Markdown

This feature will be great to help scientists team to make their meetings notes.
Hope you will integrate it

@grungkers
Copy link
Copy Markdown

How about realtime integration?

@flefevre
Copy link
Copy Markdown

@mhenrichsen have you time to look at this ? it will be key for laboratory researchers that want to meeting note very quickly
how can we help

@flefevre
Copy link
Copy Markdown

flefevre commented Jan 1, 2026

It seems that last version is compatible.
Could you check with speaker diarization #582 Thanks and happy new year

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants