AI-Dictation

Meeting Transcription with Speaker Diarization

AI-Dictation is a powerful tool that transcribes audio recordings while identifying different speakers in the conversation. The application leverages state-of-the-art speech recognition models to provide accurate transcriptions with speaker labels.

Features

Multi-speaker diarization (speaker identification)
Automatic speech recognition in 100+ languages
Translation capability for cross-language understanding
Speaker identification with custom naming
Export transcripts as text or SRT format
Real-time memory usage monitoring
GPU acceleration support

Technical Architecture

Note: A detailed architecture diagram would be included here in future updates.

The application integrates several key components in a modular pipeline:

Audio Processing: Handles various input formats and preprocessing via librosa/torchaudio
Speaker Diarization: Uses PyAnnote's speaker-diarization-3.1 model to segment audio by speaker
Speech Recognition: Routes audio segments through the selected ASR model (Whisper/Wav2Vec2/Seamless)
Translation Layer: For multilingual content, connects to appropriate translation backend
Streamlit UI: Provides the interactive frontend for all operations

The pipeline uses a segment-based approach rather than processing the entire audio at once, enabling efficient memory usage and precise speaker attribution. Temporary files are managed with unique identifiers to prevent collisions during parallel processing.

Model Comparison

We've experimented with multiple speech recognition models including Wav2Vec2 and Whisper, finding superior results with OpenAI's Whisper. Research published in this paper confirms our experience, showing Whisper consistently outperforming Wav2Vec2 for most speech recognition tasks.

Due to these findings, Wav2Vec2 will be removed in an upcoming release to streamline the application and focus on the best performing models.

Note on Language Support: We are actively working on properly mapping supported languages for speech-to-text and text-to-text translation in the code. This will be fixed soon. For the most common languages encountered in US business, speech-to-text and text-to-text translation is already working excellently.

Model Specifications

Model	Parameters	Disk Size	RAM/VRAM Required	Relative Speed	Supported Languages
Whisper tiny	39M	~150MB	~1GB	~10x	96+
Whisper base	74M	~290MB	~1GB	~7x	96+
Whisper small	244M	~970MB	~2GB	~4x	96+
Whisper medium	769M	~3.1GB	~5GB	~2x	96+
Whisper large-v3	1.55B	~6.2GB	~10GB	1x (baseline)	96+
Wav2Vec2 base	95M	~360MB	~1GB	~6x	English-focused
Wav2Vec2 large	317M	~1.2GB	~3GB	~3x	English-focused
Wav2Vec2 XLS-R 300M	300M	~1.2GB	~3GB	~3x	128+
Seamless M4T v2 large	1.2B	~4.8GB	~8GB	~2x	100+

Processing performance (approximate, RTX 4090): (UPDATE WITH REAL METRICS ON VARIOUS HARDWARE)

Whisper large-v3: ~12x faster than real-time (1 minute audio processed in ~5 seconds)
Seamless M4T: ~6x faster than real-time (1 minute audio processed in ~10 seconds)
Speaker diarization: ...

Installation

Easy Installation (Recommended)

Windows

run.bat

macOS/Linux

./run.sh

Manual Installation

Create a virtual environment:

python -m venv venv

Activate the virtual environment:
- Windows: venv\Scripts\activate
- macOS/Linux: source venv/bin/activate
Install requirements:

pip install -r requirements.txt

Launch the application:

streamlit run dictate.py

Usage Guide

Transcription Engine Options

The application offers several speech recognition models:

Whisper (Optimized)

Recommended option for most use cases
Excellent accuracy across many languages
Size options from tiny to large-v3, offering speed vs. accuracy trade-offs
Handles ambient noise, accents, and technical vocabulary well
Model sizes:
- Tiny: Fastest option, less accurate but good for quick drafts
- Base: Good balance of speed and accuracy
- Small: Better quality with reasonable speed
- Medium: High quality transcription
- Large-v3: Best quality, but slowest and requires more GPU memory

Whisper's approach: A Transformer sequence-to-sequence model trained on various speech processing tasks

Wav2Vec2

Provides uppercase output without punctuation
Options include English-specific models and multilingual models
Generally requires more post-processing
Note: To be removed in future releases

Seamless M4T

Specialized for multilingual environments and cross-language translation
End-to-end speech-to-speech and speech-to-text translation
Supports 100+ languages with a single model
Better handling of code-switching (multiple languages in one conversation)
Particularly useful for:
- International meetings with multiple languages
- Direct translation without intermediate steps
- Content requiring high-quality translation between languages
Implementation varies by platform:
- Linux: Uses native seamless_communication package
- Windows/macOS: Uses Hugging Face Transformers implementation

When to Choose Each Model

Feature	Whisper	Seamless M4T	Wav2Vec2
Best for general transcription	✅	⚠️	❌
Best for multilingual content	⚠️	✅	❌
Best for direct translation	❌	✅	❌
Processing speed	Fast	Medium	Fast
Memory requirements	Moderate	High	Low
Punctuation & capitalization	✅	✅	❌
Language identification	✅	✅	❌
Handles technical vocabulary	✅	⚠️	⚠️

Whisper: Choose for most everyday transcription needs, especially when working in a single language or need the best balance of accuracy and speed
Seamless M4T: Choose when working with multiple languages in the same recording or when direct translation is the primary goal
Wav2Vec2: Legacy option, only use if you specifically need uppercase-only output or have compatibility requirements

Advanced Settings

Minimum Segment Duration

Controls the shortest audio segment that will be transcribed (0.5-5.0 seconds)
Lower values capture more speech but may introduce more errors
Higher values focus on longer, more meaningful utterances
Default (1.5s) works well for most conversations

Audio Loading Options

Use librosa: More reliable but slightly slower audio processing
Clean temporary files: Automatically removes temporary files after processing

GPU Acceleration

Enables faster processing on CUDA-compatible GPUs
Automatically detects available hardware
Shows real-time memory usage

Workflow

Upload an audio file (WAV, MP3, FLAC, OGG)
The system identifies different speakers
Assign names to each speaker
View the complete transcript
Translate segments to your preferred language if needed
Download the transcript as text or SRT format

Performance Considerations

GPU vs CPU: Using a GPU dramatically speeds up processing, especially for large files
Model Size: Larger models provide better accuracy but require more memory and processing time
File Length: Longer recordings require more processing time and resources
Memory Usage: The performance meter helps monitor resource utilization
Platform-specific considerations:
- Seamless M4T performs best on Linux with the native implementation
- Windows/macOS users may see better performance with Whisper for large files

Implementation Details

Speaker Diarization

Uses PyAnnote's ECAPA-TDNN embedding model with 192 dimensions
3.1 version includes improved handling of overlapping speech
Diarization clustering is performed using agglomerative hierarchical clustering (AHC)
Post-processing includes a 0.5s minimum segment duration filter

ASR Implementation

Whisper uses CTranslate2 optimized runtime via faster-whisper
8-bit quantization for CPU inference, 16-bit float for GPU inference
Custom segment extraction optimized for speaker-diarized input
Fallback mechanisms to handle transcription failures
Context window of 30 seconds for long-form content

Translation Pipeline

Two-stage pipeline for unsupported language combinations
Neural Machine Translation implemented with NLLB-200 distilled models
Language detection verification to prevent unnecessary translations
Batching system for efficient processing of multiple segments

Known Limitations

Speaker Diarization: Performance degrades when speakers have similar voices or with significant background noise
Overlapping Speech: Accuracy drops significantly when multiple people speak simultaneously
Domain-specific Terminology: Technical, medical, or specialized vocabulary may be transcribed incorrectly
Heavy Accents: Non-standard accents can reduce transcription accuracy, particularly with smaller models
Very Long Files: Files exceeding 2 hours may require significant memory and processing time
Low-quality Audio: Recordings with sampling rates below 16kHz or significant compression artifacts show reduced accuracy

Customization Options

Fine-tuning for Domain-specific Vocabulary

While not included in the UI, advanced users can fine-tune the models for domain-specific vocabulary:

For Whisper, see the OpenAI fine-tuning documentation
For Seamless M4T, see Meta's fine-tuning guide

Research Citations

The implementation is based on the following research papers:

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision (Radford et al., 2022)
Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020)
Seamless M4T: Massively Multilingual & Multimodal Machine Translation (Barrault et al., 2023)
PyAnnote Diarization: End-to-end speaker segmentation for overlap-aware resegmentation (Bredin et al., 2021)

Developer Guide

Code Organization

dictate.py: Main application file with all functionality
Key components:
- Audio processing (lines 70-180)
- Model loading functions (lines 180-250)
- Transcription functions (lines 250-350)
- Translation implementations (lines 350-450)
- Streamlit UI components (lines 450+)

Key Algorithms

Two-phase diarization: Speaker segmentation followed by speaker clustering
Optimized segment transcription: Uses a sliding window approach for consistent results
Language detection: Confidence-based language identification with fallback mechanisms
Translation memory: Caches translations to avoid redundant processing

Extending with New Models

To add a new ASR model:

Create a new model loading function following the pattern of existing loaders
Implement a segment transcription function specific to the model
Add UI elements to select the new model
Update the pipeline to handle the new model type

History of Speech Recognition Models

Whisper

Developed by OpenAI and released in September 2022
Trained on 680,000 hours of multilingual and multitask supervised data collected from the web
Uses a "weakly supervised" approach that enables robust performance across many languages
Distinguished by its ability to handle a variety of acoustic environments and accents
The large-v3 model (released in 2023) achieves near human-level accuracy in English speech recognition
Open-sourced with an MIT license, making it widely accessible for developers

Seamless M4T

Developed by Meta AI (Facebook) and released in August 2023
Stands for "Massively Multilingual & Multimodal Machine Translation"
Part of Meta's "No Language Left Behind" initiative to create AI systems for all languages
First unified model capable of speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation
Seamless M4T v2 (released in 2024) significantly improved quality and expanded language support
Notable for preserving speakers' voices and emotion in translated speech
Released under an open license to encourage research and application development

Wav2Vec2

Developed by Facebook AI Research (now Meta AI)
Initial Wav2Vec model released in 2019, with Wav2Vec 2.0 following in 2020
Pioneered self-supervised learning for speech recognition
Breakthrough approach that could be fine-tuned with very small amounts of labeled data
Made significant advancements in low-resource speech recognition
Enabled speech recognition systems for languages with limited training data
Served as the foundation for many subsequent speech recognition models
Released with the MIT license as part of the HuggingFace Transformers library

Future Roadmap

Automatic speaker name identification using information from introductions in meetings
Advanced diarization to identify overlapping speakers
Voice cloning for speech-to-speech and text to speech translation
Transcript to speech output for many lanugage translation
Real-time transcription mode for live meetings
Integration with semantic analysis for topic extraction and summarization

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

If you encounter any issues or have questions, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs/images		docs/images
.gitignore		.gitignore
README.md		README.md
dictate.py		dictate.py
requirements.txt		requirements.txt
run.bat		run.bat
run.sh		run.sh

robertvandervoort/ai-dictation

Folders and files

Latest commit

History

Repository files navigation

AI-Dictation

Features

Technical Architecture

Model Comparison

Model Specifications

Installation

Easy Installation (Recommended)

Windows

macOS/Linux

Manual Installation

Usage Guide

Transcription Engine Options

Whisper (Optimized)

Wav2Vec2

Seamless M4T

When to Choose Each Model

Advanced Settings

Minimum Segment Duration

Audio Loading Options

GPU Acceleration

Workflow

Performance Considerations

Implementation Details

Speaker Diarization

ASR Implementation

Translation Pipeline

Known Limitations

Customization Options

Fine-tuning for Domain-specific Vocabulary

Research Citations

Developer Guide

Code Organization

Key Algorithms

Extending with New Models

History of Speech Recognition Models

Whisper

Seamless M4T

Wav2Vec2

Future Roadmap

License

Contributing

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages