A turnkey solution for lossless batch extracting multiple specific speakers from mixed audio. Powered by TitanNet & Silero VAD.
基于参考音频的多目标说话人自动提取与无损切分工具。支持批量处理、GPU加速与高保真导出。
- Multi-Target Extraction: Extract multiple speakers from mixed audio simultaneously
- Batch Processing: Process entire audio datasets in one run
- GPU Acceleration: Batch inference with CUDA support
- High Accuracy: NVIDIA TitanNet-Large for speaker embeddings
- Fast VAD: Silero VAD with ONNX acceleration
- Lossless Export: Preserves original sample rate and audio channels
# Clone the repository | 克隆仓库
git clone https://github.com/alexpsz/Multi-Target-Speaker-Extraction.git
cd Multi-Target-Speaker-Extraction
# Install dependencies | 安装依赖
pip install -r requirements.txtNote: For GPU support, install PyTorch with CUDA first: https://pytorch.org/get-started/locally/
注意:如需GPU加速,请先安装CUDA版本的PyTorch
enrollment_audio/
├── SpeakerA/
│ ├── sample1.wav
│ └── sample2.wav
└── SpeakerB/
└── sample1.wav
- Create a folder for each speaker
- Add clean audio samples (
.wavformat recommended) - More samples = better accuracy (3-10 samples recommended)
- 每个说话人创建一个文件夹
- 添加纯净的语音样本(推荐
.wav格式) - 更多样本 = 更高准确率(推荐3-10个样本)
Place audio files to process in input_audio/:
input_audio/
├── audio1.wav
└── audio2.wav
Windows:
run_windows.bat
# Or | 或者
python run.pyLinux / macOS: (Without testing)
# 1. Install dependencies
pip install -r requirements.txt
# 2. Run the application
python run.pyoutput/
├── SpeakerA/
│ └── segments/
│ └── 0.85_audio1_seg_0_1.23_4.56.wav
├── SpeakerB/
│ └── segments/
├── metadata/
│ └── audio1.json
└── summary.json
- Filename format:
{similarity}_{source}_{segment_info}.wav - Sort by filename (descending) to view by similarity score
- 文件名格式:
{相似度}_{源文件}_{片段信息}.wav - 按文件名降序排列可按相似度从高到低查看
Edit config.yaml to customize:
verification:
similarity_threshold: 0.70 # 0.65-0.80 recommended | 推荐范围
performance:
batch_size: 32 # Increase for faster processing | 增大可提速
prefetch_workers: 2 # Audio prefetch threads | 预加载线程数
speaker_management:
skip_speakers: [] # Speakers to skip | 要跳过的说话人
include_only: [] # Process only these | 仅处理这些说话人| Parameter | Default | Description |
|---|---|---|
similarity_threshold |
0.70 | Cosine similarity threshold (0.65-0.80) |
min_duration |
0.5s | Minimum segment duration |
merge_gap |
0.3s | Gap for merging adjacent segments |
batch_size |
32 | Batch size for GPU inference |
prefetch_workers |
2 | Audio prefetch thread count |
| Model | Purpose | Device |
|---|---|---|
| NVIDIA TitanNet-Large | Speaker embedding extraction | GPU |
| Silero VAD (ONNX) | Voice activity detection | CPU |
1. Extract reference embeddings → Compute average speaker vectors (L2 normalized)
2. VAD detection → Locate speech segments in input audio
3. Speaker identification → Extract segment embeddings, compute cosine similarity
4. Filtering → Keep segments above threshold
5. Merging → Merge adjacent segments from same speaker
6. Export → Save segments with original quality
Multi-Target-Speaker-Extraction/
├── run.py # Entry point | 启动入口
├── speaker_verification.py # Core logic | 核心逻辑
├── speaker_state_manager.py # Speaker filtering | 说话人过滤
├── config.yaml # Configuration | 配置文件
├── requirements.txt # Dependencies | 依赖列表
├── run_windows.bat # Windows launcher | Windows启动脚本
├── LICENSE # MIT License
├── enrollment_audio/ # Reference audio | 参考音频
├── input_audio/ # Input files | 待处理音频
└── output/ # Results | 处理结果
-
GPU Memory: Adjust
batch_sizebased on your VRAM- 8GB VRAM:
batch_size: 32 - 16GB VRAM:
batch_size: 64-96
- 8GB VRAM:
-
Speed: Enable prefetching with
prefetch_workers: 2-4 -
Accuracy: Use more reference samples per speaker
Q: CUDA out of memory
A: Reduce batch_size in config.yaml
Q: No speakers detected
A: Lower similarity_threshold (try 0.65)
Q: Poor accuracy A: Add more clean reference samples
This project is licensed under the MIT License - see the LICENSE file for details.
This project uses components under the following licenses:
- NVIDIA NeMo: Apache License 2.0
- Silero VAD: MIT License
- PyTorch: BSD-style License
- NVIDIA NeMo for TitanNet speaker verification model
- Silero VAD for voice activity detection
- Sample audio files are from the LibriSpeech corpus (CC BY 4.0).
多目标说话人提取工具 (MTSE) - 从混合音频中批量识别并提取多个指定说话人的语音片段。
- 支持多说话人同时处理
- GPU批量推理加速
- 高精度NVIDIA TitanNet-Large模型
- Silero VAD + ONNX快速语音检测
- 保持原始音频质量(采样率、声道)
- 跨平台支持(Windows/Linux/macOS)
- 准备参考音频:在
enrollment_audio/下为每个说话人创建文件夹,放入纯净语音样本 - 放置待处理音频:将待处理的音频文件放入
input_audio/ - 运行:双击
run_windows.bat或运行python run.py - 查看结果:输出在
output/目录,按相似度排序的片段
similarity_threshold: 相似度阈值,推荐0.65-0.80batch_size: 批量大小,显存越大可设越高prefetch_workers: 预加载线程,推荐2-4
- 显存不足:降低
batch_size - 漏检:降低
similarity_threshold - 误检:增加参考样本数量或提高阈值