Meeting Transcription with Speaker Diarization
AI-Dictation is a powerful tool that transcribes audio recordings while identifying different speakers in the conversation. The application leverages state-of-the-art speech recognition models to provide accurate transcriptions with speaker labels.
- Multi-speaker diarization (speaker identification)
- Automatic speech recognition in 100+ languages
- Translation capability for cross-language understanding
- Speaker identification with custom naming
- Export transcripts as text or SRT format
- Real-time memory usage monitoring
- GPU acceleration support
Note: A detailed architecture diagram would be included here in future updates.
The application integrates several key components in a modular pipeline:
- Audio Processing: Handles various input formats and preprocessing via librosa/torchaudio
- Speaker Diarization: Uses PyAnnote's speaker-diarization-3.1 model to segment audio by speaker
- Speech Recognition: Routes audio segments through the selected ASR model (Whisper/Wav2Vec2/Seamless)
- Translation Layer: For multilingual content, connects to appropriate translation backend
- Streamlit UI: Provides the interactive frontend for all operations
The pipeline uses a segment-based approach rather than processing the entire audio at once, enabling efficient memory usage and precise speaker attribution. Temporary files are managed with unique identifiers to prevent collisions during parallel processing.
We've experimented with multiple speech recognition models including Wav2Vec2 and Whisper, finding superior results with OpenAI's Whisper. Research published in this paper confirms our experience, showing Whisper consistently outperforming Wav2Vec2 for most speech recognition tasks.
Due to these findings, Wav2Vec2 will be removed in an upcoming release to streamline the application and focus on the best performing models.
Note on Language Support: We are actively working on properly mapping supported languages for speech-to-text and text-to-text translation in the code. This will be fixed soon. For the most common languages encountered in US business, speech-to-text and text-to-text translation is already working excellently.
| Model | Parameters | Disk Size | RAM/VRAM Required | Relative Speed | Supported Languages |
|---|---|---|---|---|---|
| Whisper tiny | 39M | ~150MB | ~1GB | ~10x | 96+ |
| Whisper base | 74M | ~290MB | ~1GB | ~7x | 96+ |
| Whisper small | 244M | ~970MB | ~2GB | ~4x | 96+ |
| Whisper medium | 769M | ~3.1GB | ~5GB | ~2x | 96+ |
| Whisper large-v3 | 1.55B | ~6.2GB | ~10GB | 1x (baseline) | 96+ |
| Wav2Vec2 base | 95M | ~360MB | ~1GB | ~6x | English-focused |
| Wav2Vec2 large | 317M | ~1.2GB | ~3GB | ~3x | English-focused |
| Wav2Vec2 XLS-R 300M | 300M | ~1.2GB | ~3GB | ~3x | 128+ |
| Seamless M4T v2 large | 1.2B | ~4.8GB | ~8GB | ~2x | 100+ |
Processing performance (approximate, RTX 4090): (UPDATE WITH REAL METRICS ON VARIOUS HARDWARE)
- Whisper large-v3: ~12x faster than real-time (1 minute audio processed in ~5 seconds)
- Seamless M4T: ~6x faster than real-time (1 minute audio processed in ~10 seconds)
- Speaker diarization: ...
run.bat
./run.sh
- Create a virtual environment:
python -m venv venv
-
Activate the virtual environment:
- Windows:
venv\Scripts\activate - macOS/Linux:
source venv/bin/activate
- Windows:
-
Install requirements:
pip install -r requirements.txt
- Launch the application:
streamlit run dictate.py
The application offers several speech recognition models:
- Recommended option for most use cases
- Excellent accuracy across many languages
- Size options from tiny to large-v3, offering speed vs. accuracy trade-offs
- Handles ambient noise, accents, and technical vocabulary well
- Model sizes:
- Tiny: Fastest option, less accurate but good for quick drafts
- Base: Good balance of speed and accuracy
- Small: Better quality with reasonable speed
- Medium: High quality transcription
- Large-v3: Best quality, but slowest and requires more GPU memory
Whisper's approach: A Transformer sequence-to-sequence model trained on various speech processing tasks
- Provides uppercase output without punctuation
- Options include English-specific models and multilingual models
- Generally requires more post-processing
- Note: To be removed in future releases
- Specialized for multilingual environments and cross-language translation
- End-to-end speech-to-speech and speech-to-text translation
- Supports 100+ languages with a single model
- Better handling of code-switching (multiple languages in one conversation)
- Particularly useful for:
- International meetings with multiple languages
- Direct translation without intermediate steps
- Content requiring high-quality translation between languages
- Implementation varies by platform:
- Linux: Uses native
seamless_communicationpackage - Windows/macOS: Uses Hugging Face Transformers implementation
- Linux: Uses native
| Feature | Whisper | Seamless M4T | Wav2Vec2 |
|---|---|---|---|
| Best for general transcription | ✅ | ❌ | |
| Best for multilingual content | ✅ | ❌ | |
| Best for direct translation | ❌ | ✅ | ❌ |
| Processing speed | Fast | Medium | Fast |
| Memory requirements | Moderate | High | Low |
| Punctuation & capitalization | ✅ | ✅ | ❌ |
| Language identification | ✅ | ✅ | ❌ |
| Handles technical vocabulary | ✅ |
- Whisper: Choose for most everyday transcription needs, especially when working in a single language or need the best balance of accuracy and speed
- Seamless M4T: Choose when working with multiple languages in the same recording or when direct translation is the primary goal
- Wav2Vec2: Legacy option, only use if you specifically need uppercase-only output or have compatibility requirements
- Controls the shortest audio segment that will be transcribed (0.5-5.0 seconds)
- Lower values capture more speech but may introduce more errors
- Higher values focus on longer, more meaningful utterances
- Default (1.5s) works well for most conversations
- Use librosa: More reliable but slightly slower audio processing
- Clean temporary files: Automatically removes temporary files after processing
- Enables faster processing on CUDA-compatible GPUs
- Automatically detects available hardware
- Shows real-time memory usage
- Upload an audio file (WAV, MP3, FLAC, OGG)
- The system identifies different speakers
- Assign names to each speaker
- View the complete transcript
- Translate segments to your preferred language if needed
- Download the transcript as text or SRT format
- GPU vs CPU: Using a GPU dramatically speeds up processing, especially for large files
- Model Size: Larger models provide better accuracy but require more memory and processing time
- File Length: Longer recordings require more processing time and resources
- Memory Usage: The performance meter helps monitor resource utilization
- Platform-specific considerations:
- Seamless M4T performs best on Linux with the native implementation
- Windows/macOS users may see better performance with Whisper for large files
- Uses PyAnnote's ECAPA-TDNN embedding model with 192 dimensions
- 3.1 version includes improved handling of overlapping speech
- Diarization clustering is performed using agglomerative hierarchical clustering (AHC)
- Post-processing includes a 0.5s minimum segment duration filter
- Whisper uses CTranslate2 optimized runtime via faster-whisper
- 8-bit quantization for CPU inference, 16-bit float for GPU inference
- Custom segment extraction optimized for speaker-diarized input
- Fallback mechanisms to handle transcription failures
- Context window of 30 seconds for long-form content
- Two-stage pipeline for unsupported language combinations
- Neural Machine Translation implemented with NLLB-200 distilled models
- Language detection verification to prevent unnecessary translations
- Batching system for efficient processing of multiple segments
- Speaker Diarization: Performance degrades when speakers have similar voices or with significant background noise
- Overlapping Speech: Accuracy drops significantly when multiple people speak simultaneously
- Domain-specific Terminology: Technical, medical, or specialized vocabulary may be transcribed incorrectly
- Heavy Accents: Non-standard accents can reduce transcription accuracy, particularly with smaller models
- Very Long Files: Files exceeding 2 hours may require significant memory and processing time
- Low-quality Audio: Recordings with sampling rates below 16kHz or significant compression artifacts show reduced accuracy
While not included in the UI, advanced users can fine-tune the models for domain-specific vocabulary:
- For Whisper, see the OpenAI fine-tuning documentation
- For Seamless M4T, see Meta's fine-tuning guide
The implementation is based on the following research papers:
- Whisper: Robust Speech Recognition via Large-Scale Weak Supervision (Radford et al., 2022)
- Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020)
- Seamless M4T: Massively Multilingual & Multimodal Machine Translation (Barrault et al., 2023)
- PyAnnote Diarization: End-to-end speaker segmentation for overlap-aware resegmentation (Bredin et al., 2021)
dictate.py: Main application file with all functionality- Key components:
- Audio processing (lines 70-180)
- Model loading functions (lines 180-250)
- Transcription functions (lines 250-350)
- Translation implementations (lines 350-450)
- Streamlit UI components (lines 450+)
- Two-phase diarization: Speaker segmentation followed by speaker clustering
- Optimized segment transcription: Uses a sliding window approach for consistent results
- Language detection: Confidence-based language identification with fallback mechanisms
- Translation memory: Caches translations to avoid redundant processing
To add a new ASR model:
- Create a new model loading function following the pattern of existing loaders
- Implement a segment transcription function specific to the model
- Add UI elements to select the new model
- Update the pipeline to handle the new model type
- Developed by OpenAI and released in September 2022
- Trained on 680,000 hours of multilingual and multitask supervised data collected from the web
- Uses a "weakly supervised" approach that enables robust performance across many languages
- Distinguished by its ability to handle a variety of acoustic environments and accents
- The large-v3 model (released in 2023) achieves near human-level accuracy in English speech recognition
- Open-sourced with an MIT license, making it widely accessible for developers
- Developed by Meta AI (Facebook) and released in August 2023
- Stands for "Massively Multilingual & Multimodal Machine Translation"
- Part of Meta's "No Language Left Behind" initiative to create AI systems for all languages
- First unified model capable of speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation
- Seamless M4T v2 (released in 2024) significantly improved quality and expanded language support
- Notable for preserving speakers' voices and emotion in translated speech
- Released under an open license to encourage research and application development
- Developed by Facebook AI Research (now Meta AI)
- Initial Wav2Vec model released in 2019, with Wav2Vec 2.0 following in 2020
- Pioneered self-supervised learning for speech recognition
- Breakthrough approach that could be fine-tuned with very small amounts of labeled data
- Made significant advancements in low-resource speech recognition
- Enabled speech recognition systems for languages with limited training data
- Served as the foundation for many subsequent speech recognition models
- Released with the MIT license as part of the HuggingFace Transformers library
- Automatic speaker name identification using information from introductions in meetings
- Advanced diarization to identify overlapping speakers
- Voice cloning for speech-to-speech and text to speech translation
- Transcript to speech output for many lanugage translation
- Real-time transcription mode for live meetings
- Integration with semantic analysis for topic extraction and summarization
Contributions are welcome! Please feel free to submit a Pull Request.
If you encounter any issues or have questions, please open an issue on GitHub.