ποΈ A speech transcription and speaker diarization tool using MLX Whisper and pyannote.audio, optimized for Apple Silicon Macs.
Features a user-friendly graphical interface with drag-and-drop support, real-time progress display, subtitle editing, AWS Bedrock summarization, and more.
- Drag and drop files or click to select
- Choose Whisper model size
- Set language and output format
- Enable/disable speaker diarization
- Real-time display of processing progress and logs
- Automatic speaker identification (SPEAKER_00, SPEAKER_01...)
- Built-in subtitle editor with double-click editing
- Audio player with subtitle click-to-jump
- Generate meeting summaries using AWS Bedrock Claude
- Support for multiple AWS regions
- Customizable prompts
- One-click structured summary generation
- π₯οΈ User-Friendly GUI - Drag-and-drop files, real-time progress, subtitle editing
- β‘ Apple Silicon Optimized - 6x faster processing with MLX and MPS GPU acceleration
- π― Speaker Diarization - Automatically identify and label different speakers
- π Multi-Language Support - Supports Chinese, English, Japanese, and more
- π Subtitle Editing - Built-in subtitle editor for real-time modifications and saving
- π΅ Audio Playback - Synchronized audio playback with subtitle click-to-jump
- πΎ Multiple Formats - Output in SRT or TXT format
- π€ AI Summarization - Integrated AWS Bedrock for automatic meeting summaries
# 1. Clone the project
git clone https://github.com/KenexAtWork/MultiSpeakerASRwithAppleSilicon.git
cd MultiSpeakerASRwithAppleSilicon
# 2. Run installation script
./install.sh
# 3. Edit .env file and add your Hugging Face Token
nano .env # or use another editor
# 4. Launch GUI
./run_gui.sh# 1. Install system dependencies (if not already installed)
brew install ffmpeg
# 2. Install uv (Python package manager, 10-100x faster than pip)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 3. Clone the project
git clone https://github.com/KenexAtWork/MultiSpeakerASRwithAppleSilicon.git
cd MultiSpeakerASRwithAppleSilicon
# 4. Create virtual environment and install packages
uv venv --python 3.10
source .venv/bin/activate
uv pip install -e ".[all]" # Install all features (GUI + AWS)
# 5. Configure environment variables
cp .env.example .env
nano .env # Add your HF_TOKENSpeaker diarization requires a Hugging Face token:
- Go to https://huggingface.co/settings/tokens to create a token
- Accept the model usage terms:
- Add the token to your
.envfile:HF_TOKEN=your_token_here
Detailed Tutorial: How to Apply for Hugging Face Token
- Whisper speech recognition model: ~1.5 GB
- Speaker Diarization model: ~200 MB
- Download time: 2-8 minutes (depending on network speed)
- GUI may appear unresponsive during download, this is normal
- Models are automatically cached, subsequent launches are fast (< 5 seconds)
- Select File - Drag and drop video/audio file, or click "Select File" button
- Configure Parameters - Choose Whisper model, language, output format
- Start Transcription - Click "Start Transcription" button
- View Results - Subtitles will appear below after transcription completes
- Edit Subtitles - Double-click subtitles to edit, click "Save SRT" after modifications
- Play Audio - Click subtitles to jump to corresponding timestamp
| Parameter | Description | Recommended Value |
|---|---|---|
| Whisper Model | Affects accuracy and speed | medium (default) or small |
| Language | Primary audio language | auto (auto-detect) or zh (Chinese) |
| Output Format | Subtitle file format | SRT (standard subtitle format) |
| Speaker Diarization | Whether to identify different speakers | Checked (default) |
| Region | AWS region (for summarization) | us-west-2 |
- Drag and Drop - Supports mp4, m4a, mov, avi, mkv, wav, mp3, and more
- Real-time Progress - Displays processing stage and progress percentage
- Log Display - Real-time display of processing steps and error messages
- Subtitle Editing - Double-click subtitles to edit content, supports multi-line text
- Audio Playback - Synchronized audio playback, click subtitles to jump to timestamp
- Open Folder - Quick access to output file location
- AI Summarization - Generate meeting summaries using AWS Bedrock Claude
- Automatically reads AWS configuration (uses
defaultprofile) - Support for multiple AWS region selection
- Customizable prompt templates
- One-click structured summary generation
- Automatically reads AWS configuration (uses
To use AWS Bedrock summarization:
- AWS Account - Need an AWS account with Bedrock service enabled
- AWS CLI Configuration - Configure AWS credentials:
# Install AWS CLI (if not already installed)
brew install awscli
# Configure AWS credentials (using default profile)
aws configure
# Enter:
# AWS Access Key ID
# AWS Secret Access Key
# Default region name (e.g., us-west-2)
# Default output format (json)-
Bedrock Permissions - Ensure IAM user has Bedrock access permissions:
bedrock:InvokeModel- Recommended model:
anthropic.claude-3-sonnet-20240229-v1:0
-
Region Selection - Select a region with Bedrock service in the GUI:
us-east-1(N. Virginia)us-west-2(Oregon)ap-northeast-1(Tokyo)- Other Bedrock-supported regions
Note: The program automatically reads the default profile from ~/.aws/credentials and ~/.aws/config.
For detailed usage instructions, see GUI Usage Guide.
- macOS with Apple Silicon (M-series chips: M1, M2, M3, M4 or newer)
- Python 3.10+
- ffmpeg
- Hugging Face account (for speaker diarization)
- At least 8GB RAM (16GB recommended)
The project includes sample videos and pre-processed output results:
# Test with sample video
./run_gui.sh
# Then drag examples/sample-01.mp4 into the GUI
# Or test via command line
./scripts/asr.sh examples/sample-01.mp4
# View sample output
cat examples/sample-output/sample-01_transcription.srtThe examples directory contains:
sample-01.mp4- ~1 minute multi-speaker conversation videosample-output/- Pre-processed output results and screenshots
For detailed instructions, see examples/README.md
MLX Whisper supports multiple model sizes, choose based on memory and accuracy needs:
| Model | Memory Usage | Accuracy | Speed | Use Case |
|---|---|---|---|---|
tiny |
~1-2 GB | ββ | Fastest | Quick testing, drafts |
base |
~2-3 GB | βββ | Very fast | Simple conversations, memory-constrained |
small |
~3-4 GB | ββββ | Fast | General meetings, recommended |
medium |
~5-7 GB | βββββ | Medium | Default, high-quality needs |
large |
~8-10 GB | βββββ | Slower | Highest accuracy needs |
Selection Recommendations:
- 8GB RAM Mac: Recommend
smallorbase - 16GB RAM Mac: Recommend
smallormedium(default) - 32GB+ RAM Mac: Can use
large - Mixed Chinese-English speech: Recommend at least
smallor above
Supports multiple languages, including:
auto: Auto-detect (recommended for mixed languages)zh: Chineseen: Englishja: Japanesees: Spanishfr: Frenchde: Germanit: Italianpt: Portugueseru: Russianko: Korean
Mixed Language Support:
- For Chinese-English mixed audio, recommend selecting
auto(auto-detect) - If audio is primarily one language, specify that language
- Whisper can handle mixed-language audio, but accuracy depends on mixing ratio
Test results on M1 Pro (8-core CPU, 14-core GPU, 16GB RAM):
| Video Length | Processing Time | Speedup | Model |
|---|---|---|---|
| 90 seconds | ~45 seconds | 2.0x | medium |
| 3 minutes | ~55 seconds | 3.3x | medium |
| 14 minutes | ~4.5 minutes | 3.1x | medium |
| 31 minutes | ~10 minutes | 3.1x | medium |
Performance Optimization:
- GPU acceleration is ~6x faster than CPU
- Using
smallmodel can further improve speed (~1.5x) - Disabling speaker diarization saves ~30% time
Standard subtitle format, can be used directly in video players:
1
00:00:00,000 --> 00:00:02,500
[SPEAKER_00] Please Peter, come explain
2
00:00:02,500 --> 00:00:04,200
[SPEAKER_00] I'll hand it over to Peter
3
00:00:06,239 --> 00:00:06,639
[SPEAKER_01] Okay
Plain text format, suitable for reading and editing:
[SPEAKER_00] 00:00:00,000 --> 00:00:02,500
Please Peter, come explain
[SPEAKER_00] 00:00:02,500 --> 00:00:04,200
I'll hand it over to Peter
[SPEAKER_01] 00:00:06,239 --> 00:00:06,639
Okay
If you prefer using the command line or need batch processing, use the command line version:
# Basic usage (auto-generate output filename)
./scripts/asr.sh video.mp4
# Specify output filename and format
./scripts/asr.sh video.mp4 output.srt
./scripts/asr.sh video.mp4 output.txt --format txt
# Specify language and model
./scripts/asr.sh video.mp4 output.srt --language zh --model small
# Skip speaker diarization (faster)
./scripts/asr.sh video.mp4 output.srt --skip-diarization
# Use CPU (no GPU)
./scripts/asr.sh video.mp4 output.srt --no-gpusource .venv/bin/activate
python asr_multi_speaker_v5_fast.py \
--input video.mp4 \
--output output.srt \
--language zh \
--model medium \
--hf-token YOUR_TOKEN# Process all videos in a folder
for video in videos/*.mp4; do
./scripts/asr.sh "$video"
doneProblem: GUI won't start
# Confirm PyQt6 is installed
source .venv/bin/activate
uv pip install PyQt6
# Check Python version (needs 3.10+)
python --versionProblem: Drag and drop not responding
- Confirm file format is supported (mp4, m4a, mov, avi, mkv, wav, mp3)
- Check if file path contains special characters
- Check log window for error messages
Problem: Audio won't play
- Confirm file path is correct
- Check for Chinese or special characters (now supported, but may need to reselect file)
- Check log window for error messages
Problem: Can't save edited subtitles
- Confirm write permissions
- Check if output path exists
- Try manually specifying output filename
Problem: ffmpeg not found
brew install ffmpegProblem: mlx module not found
Confirm using ARM64 native Python:
file $(which python)
# Should show: Mach-O 64-bit executable arm64If x86_64, recreate environment:
uv venv --python 3.10
source .venv/bin/activate
uv pip install -e .Problem: Speaker diarization fails
- Confirm HF_TOKEN is set (in .env file or GUI)
- Confirm model usage terms accepted:
- Try unchecking "Speaker Diarization" option
- Check network connection (first use requires model download)
Problem: Slow processing
- Confirm GPU acceleration is enabled (default)
- Try using smaller model (
smallorbase) - Unchecking "Speaker Diarization" saves ~30% time
- Close other resource-intensive applications
Problem: Out of memory
- Use smaller model:
small(3-4 GB) orbase(2-3 GB) - Uncheck "Speaker Diarization"
- Close other applications to free memory
- Consider upgrading RAM (16GB recommended)
Problem: Inaccurate transcription
- Try using larger model (
mediumorlarge) - Confirm language setting is correct (or use
auto) - Check audio quality (background noise, volume)
- For Chinese-English mixed, use
autolanguage setting
Problem: First run is very slow
First run downloads models (~1.7 GB), takes 5-15 minutes. After model download completes, subsequent use is fast.
Problem: Tests fail
# Run test diagnostics
cd asr
./run_tests.sh --fast --verbose
# Check environment
source .venv/bin/activate
python -c "import mlx_whisper; print('MLX OK')"
python -c "import PyQt6; print('PyQt6 OK')"# Quick test (~20 seconds)
./run_tests.sh --fast
# Full test (~60 seconds)
./run_tests.sh
# Run specific test only
./run_tests.sh --test pipeline
./run_tests.sh --test merge
# View test coverage
cat tests/TEST_COVERAGE.mdFor detailed testing instructions, see tests/README.md
The project uses GitHub Actions for automated testing:
- Automatically runs on push to main/develop branches
- Test time ~16 seconds
- View test results: https://github.com/KenexAtWork/MultiSpeakerASRwithAppleSilicon/actions
asr/
βββ gui/ # GUI application
β βββ main.py # GUI main program
β βββ ui/ # UI components
β β βββ main_window.py # Main window
β βββ core/ # Core logic
β βββ asr_worker.py # ASR processing thread
β βββ summary_worker.py # Summary generation thread
βββ tests/ # Automated tests
β βββ test_pipeline_e2e.py # Pipeline end-to-end test
β βββ test_merge_srt.py # Merge logic test
β βββ test_gui_display.py # GUI display test
β βββ test_gui_processing.py # GUI processing test
β βββ test_gui_media_url.py # GUI media player test
β βββ test_srt_parser.py # SRT parser test
β βββ test_error_handling.py # Error handling test
β βββ TEST_COVERAGE.md # Test coverage documentation
βββ scripts/ # Script tools
β βββ asr.sh # Command line convenience script
β βββ asr_chunked.sh # Chunked processing script
β βββ benchmark.sh # Performance test script
β βββ ci_test.sh # CI/CD test script
β βββ cleanup_for_git.sh # Git cleanup script
βββ examples/ # Example files
β βββ sample-01.mp4 # Sample video
β βββ sample-output/ # Sample output results
βββ screenshots/ # GUI screenshots
β βββ 01-main-interface.png
β βββ 02-transcription-result.png
β βββ 03-aws-summary.png
βββ asr_multi_speaker_v5_fast.py # Main program (command line version)
βββ merge_srt.py # SRT merge module
βββ install.sh # Auto installation script
βββ run_gui.sh # GUI launch script
βββ run_tests.sh # Test execution script
βββ pyproject.toml # Python project configuration
βββ .env.example # Environment variable example
βββ .gitignore # Git ignore file
βββ README.md # This file
βββ COMPARISON.md # Comparison with WhisperX
βββ LICENSE # MIT License
Issues and Pull Requests are welcome!
- Fork the project
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
Before submitting PR, ensure:
- All tests pass (
./run_tests.sh) - New features have corresponding tests
- Code follows project style
This project and WhisperX both use pyannote.audio for speaker diarization, but have the following key differences:
| Feature | This Project | WhisperX |
|---|---|---|
| Hardware Optimization | Apple Silicon (MPS) | NVIDIA GPU (CUDA) |
| User Interface | GUI + Command Line | Command Line |
| Timestamp Precision | Β±0.1-0.5 seconds | Β±0.01-0.05 seconds (forced alignment) |
| Cross-Platform | macOS only (Apple Silicon) | Linux, Windows, macOS |
| Processing Speed | Fast (M1 native) | Very fast (CUDA) |
| Features | ASR + Speaker Diarization + Subtitle Editing | ASR + Speaker Diarization + Translation + Batch |
| Installation | Simple | Medium |
Choose this project if you:
- β Use Apple Silicon Mac
- β Want a user-friendly GUI
- β Need subtitle editing features
- β Need fastest processing speed on Mac
Choose WhisperX if you:
- β Use NVIDIA GPU
- β Need millisecond-level timestamp precision
- β Need translation and batch processing features
For detailed comparison, see COMPARISON.md
MIT License - See LICENSE file
- MLX Whisper - Apple Silicon optimized Whisper implementation
- pyannote.audio - Speaker diarization model
- OpenAI Whisper - Original Whisper model
- PyQt6 - GUI framework
- β¨ Added PyQt6 graphical interface
- β¨ Support for subtitle editing and audio playback
- β¨ Added paragraph merging (same speaker)
- β¨ Complete automated test suite (54% coverage)
- β¨ GitHub Actions CI/CD integration
- π Fixed Chinese filename audio playback issue
- π Fixed language=auto crash issue
- β¨ Added MPS GPU acceleration support
- β‘ Increased CPU threads to 8
- π 6x performance improvement
- β¨ Support for pyannote.audio 4.x API
- π Fixed PyTorch 2.6+ weights_only issue
- π§ Optimized for M1 Mac
- β¨ Use subprocess to isolate speaker diarization
- π Fixed segmentation fault issue
- β¨ Initial version
- π― Support for multi-speaker diarization
Project Maintainer: @KenexAtWork
Issue Reporting: https://github.com/KenexAtWork/MultiSpeakerASRwithAppleSilicon/issues