Skip to content

Optimize Whisper transcription performance for large audio files #46

@krisoye

Description

@krisoye

Observation

During testing with an 18MB audio file (~19 minutes), observed:

  • Transcription started at ~60 frames/sec
  • Degraded to ~19 frames/sec at 75% completion
  • Total processing time: ~2 hours for 19-minute audio
  • Model: Whisper large-v3 (1.5B parameters)

Performance Data

  • Model download: 2.88GB (one-time)
  • Audio file: 18MB m4a
  • Total frames: 286,596
  • Speed degradation: 60 fps → 19 fps

Potential Optimizations

1. Model Selection

  • Current: large-v3 (best accuracy, slowest)
  • Consider: medium (769M params, good accuracy, faster)
  • Option: Make model configurable per-request

2. Hardware Acceleration

  • Check if CUDA/GPU acceleration is available
  • Verify WHISPER_DEVICE=auto is selecting optimal device
  • Consider batch processing optimizations

3. Memory Management

  • Investigate memory usage patterns
  • Check for memory leaks causing slowdown
  • Profile CPU/memory usage during transcription

4. Chunking Strategy

  • Evaluate if audio chunking could improve performance
  • Consider parallel processing of chunks
  • Balance between accuracy and speed

5. Caching

  • Cache frequently used model weights
  • Implement audio preprocessing cache
  • Store intermediate results for retries

Investigation Needed

  1. Profile transcription to identify bottleneck
  2. Monitor resource usage during processing
  3. Test with different model sizes
  4. Check GPU availability and utilization

Success Criteria

  • Identify root cause of performance degradation
  • Document trade-offs between model sizes
  • Provide configuration options for performance tuning
  • Target: <30 minutes for 19-minute audio with medium model

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions