Skip to content

feat: Implement native Apple Silicon transcription with MLX-Whisper or whisper.cpp #48

@davidamacey

Description

@davidamacey

Summary

OpenTranscribe currently uses WhisperX for transcription, which doesn't fully utilize Apple Silicon's GPU capabilities (falls back to CPU on MPS devices). Implementing a native Apple Silicon solution using MLX-Whisper or whisper.cpp would provide significant performance improvements for macOS users with M1/M2/M3 chips.

Current State

  • WhisperX is configured to use CPU on Apple Silicon (MPS) devices due to compatibility issues
  • Performance is suboptimal compared to native implementations
  • Users with expensive Apple Silicon hardware aren't getting full value from their GPU

Proposed Solutions

Option 1: MLX-Whisper (Recommended)

Pros:

  • 30-40% faster than whisper.cpp on Apple Silicon
  • Native MLX framework designed specifically for Apple Silicon
  • Excellent GPU utilization on M1/M2/M3 chips
  • Python-based, easier integration with existing codebase
  • Active development and Apple support

Cons:

  • Only works on Apple Silicon (need to maintain WhisperX for other platforms)
  • Smaller community compared to whisper.cpp
  • May require significant refactoring of transcription service

Option 2: Lightning-Whisper-MLX

Pros:

  • Claims 10x faster than whisper.cpp
  • 4x faster than standard MLX-Whisper
  • Optimized specifically for Apple Silicon
  • Best-in-class performance for macOS

Cons:

  • Very new project, stability concerns
  • Limited documentation
  • May lack features compared to WhisperX

Option 3: whisper.cpp

Pros:

  • Cross-platform (works on Apple Silicon, CUDA, CPU)
  • Very mature and stable
  • 6-7x faster than vanilla Whisper on CPU
  • Good Apple Silicon support via Metal
  • Could potentially replace WhisperX entirely

Cons:

  • Requires C++ integration (more complex)
  • 30-40% slower than MLX-Whisper on Apple Silicon
  • Would need Python bindings or subprocess calls

Performance Benchmarks (2024)

Based on recent benchmarks:

  • MLX-Whisper: ~50% faster than vanilla Whisper on Apple Silicon
  • Lightning-Whisper-MLX: 10x faster than whisper.cpp (claimed)
  • whisper.cpp: 6-7x faster than vanilla Whisper on CPU, good Metal support

Implementation Plan

Phase 1: Research & Prototype

  • Benchmark MLX-Whisper vs whisper.cpp on M1/M2/M3 hardware
  • Test feature parity with WhisperX (timestamps, speaker alignment)
  • Evaluate integration complexity for each option
  • Create proof-of-concept implementation

Phase 2: Architecture Design

  • Design abstraction layer for multiple transcription backends
  • Create platform detection logic (Apple Silicon vs CUDA vs CPU)
  • Plan migration strategy from WhisperX
  • Design configuration system for backend selection

Phase 3: Implementation

  • Implement chosen solution (likely MLX-Whisper)
  • Create fallback mechanism to WhisperX for non-Apple platforms
  • Update Docker configurations for Apple Silicon
  • Implement proper error handling and logging

Phase 4: Testing & Optimization

  • Comprehensive testing on M1, M2, M3 hardware
  • Performance benchmarking vs current WhisperX implementation
  • Memory usage optimization
  • Edge case testing (long files, multiple speakers)

Phase 5: Documentation & Deployment

  • Update documentation for Apple Silicon users
  • Create migration guide
  • Update setup scripts for macOS
  • Release with clear performance expectations

Technical Requirements

Core Features to Maintain

  • Word-level timestamps
  • Speaker diarization compatibility
  • Multiple language support
  • Batch processing capability
  • Progress callbacks for UI updates

New Requirements

  • Automatic backend selection based on hardware
  • Configurable backend via environment variables
  • Performance metrics logging
  • Graceful fallback on errors

Code Changes Required

Backend Changes

backend/app/tasks/transcription/
├── base_transcriber.py          # New: Abstract base class
├── whisperx_transcriber.py      # Refactored from whisperx_service.py
├── mlx_transcriber.py           # New: MLX-Whisper implementation
├── whisper_cpp_transcriber.py   # New: Optional whisper.cpp implementation
└── transcriber_factory.py       # New: Factory for backend selection

Configuration Updates

  • Add TRANSCRIPTION_BACKEND environment variable
  • Add APPLE_SILICON_OPTIMIZATION flag
  • Update hardware detection to identify MLX availability
  • Add backend-specific configuration options

Docker Updates

  • Create Apple Silicon specific Dockerfile variant
  • Update docker-compose with platform-specific service definitions
  • Ensure proper MLX installation in containers

Acceptance Criteria

  • 2x or better performance improvement on Apple Silicon vs current implementation
  • No regression in transcription quality
  • Seamless fallback to WhisperX on non-Apple hardware
  • All existing features continue to work
  • Clear documentation for users
  • Automated backend selection based on hardware

Performance Targets

  • M1 Pro: < 5 minutes for 1-hour audio (currently ~10 minutes)
  • M2/M3: < 4 minutes for 1-hour audio
  • Memory usage: < 8GB for large model
  • GPU utilization: > 80% during transcription

Related Issues

References

Labels

enhancement, performance, macos, apple-silicon, transcription, mlx

Priority

High - Significant performance improvement for growing Apple Silicon user base

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions