Skip to content

Whisper Base Model Failing on Rpi5 and Hailo-8L #369

@JankBR

Description

@JankBR

System Configuration

Hardware

  • Platform: Raspberry Pi 5 Model B Rev 1.1
  • Architecture: aarch64 (ARM64)
  • Hailo Device: Hailo-8L
  • Device ID: 0001:01:00.0

Software Versions

  • OS: Debian GNU/Linux 12 (bookworm)
  • HailoRT: 4.20.0-1
  • Firmware Version: 4.20.0 (release, app, extended context switch buffer)
  • Python: 3.11
  • Repository Commit: 1dad6e4 (2025-12-10) - "add h8l to paddleocr, obb readme fix (add h8l to paddleocr, obb readme fix #368)"

Installed Packages

hailort: 4.20.0-1
hailo-all: 4.20.0
hailofw: 4.20.0-1
python3-hailort: 4.20.0-1

Python Dependencies (in virtual environment)

transformers: 4.50.1
sounddevice: 0.5.1
torch: 2.6.0
scipy: 1.9.3
numpy: 1.24.2

Problem Description

The Whisper base model produces empty or inconsistent transcriptions on Hailo-8L hardware, while the system appears to be correctly configured and operational.

Testing Performed

1. Installation & Setup

  • ✅ Ran python3 setup.py successfully
  • ✅ Downloaded all required HEF files for Hailo-8L using download_resources.py
  • ✅ Re-downloaded fresh HEF files to rule out corruption
  • ✅ All dependencies installed correctly in virtual environment

2. Hardware Verification

  • ✅ Device detected: hailortcli scan shows device 0001:01:00.0
  • ✅ Firmware identified correctly
  • ✅ HailoRT service running properly
  • ✅ Restarted HailoRT service - no improvement

3. Audio Recording Tests

  • ✅ Audio recording functional (various quality microphones tested)
  • ✅ Audio levels verified (ranging from 0.28 to 0.99 max level)
  • ✅ Audio preprocessing working (VAD detecting speech correctly)
  • ✅ Mel spectrogram generation successful

4. Model Testing

Base Model (5-second encoder)

Command: python3 -m app.app_hailo_whisper --hw-arch hailo8l --variant base --duration 5

Results:

  • Inconsistent performance - occasionally produces partial transcriptions, mostly empty
  • Example successful transcription (1 out of 10+ attempts): "testing 123"
  • Example partial transcription: "is a 5 2" (from "This is a 5 second recording")
  • Majority of attempts: Empty string '' returned from decoder

Sample Output:

Audio loaded: 78674 samples, max level: 0.9793
After preprocessing: start_time=1.2, audio length: 78674 samples
Chunk offset: 1.00s
Raw transcription: ' is a 5 2'
Cleaned transcription: 'is a 5 2.'

Then subsequent recordings:

Audio loaded: 78535 samples, max level: 0.3499
After preprocessing: start_time=1.0, audio length: 78535 samples
Chunk offset: 0.50s
Raw transcription: ''
Cleaned transcription: '.'

Tiny Model (10-second encoder)

Command: python3 -m app.app_hailo_whisper --hw-arch hailo8l --variant tiny --duration 10

Results:

  • Garbled output with unicode characters and random tokens
  • Example: '%。...............,......'
  • Example: '..... other alert hurt�... other........�..'
  • Example: '%,,, ", to [,,,, " w,," [ st, -- "告诉 ',,, a, w'

5. Configuration Variations Tested

  • ✅ VAD enabled (default)
  • ✅ VAD disabled (--no-vad flag)
  • ✅ Different chunk offsets (0.2s, 0.5s buffer before detected speech)
  • ✅ Different audio durations (5s, 10s)
  • ✅ Reuse audio mode (--reuse-audio)
  • ✅ Fresh recordings with various microphone qualities

Observed Behavior

What Works

  1. ✅ Hardware detection and initialization
  2. ✅ Audio recording and loading
  3. ✅ Voice Activity Detection (VAD)
  4. ✅ Audio preprocessing and gain adjustment
  5. ✅ Mel spectrogram generation
  6. ✅ Encoder processing (no errors)
  7. ✅ Pipeline completes without crashes

What Fails

  1. Decoder output is empty or corrupted in >90% of attempts
  2. ❌ Base model produces mostly empty strings
  3. ❌ Tiny model produces garbled unicode and random tokens
  4. ❌ Extremely inconsistent - same audio produces different results

Evidence

Successful Transcription (happened once)

Raw transcription: ' testing 123'
Cleaned transcription: 'testing 123.'

Typical Failed Output (common)

Raw transcription: ''
Cleaned transcription: '.'

Garbled Output (tiny model)

Raw transcription: '..... other alert hurt�... other........�..'
Raw transcription: '%,,, ", to [,,,, " w,," [ st, -- "告诉 ',,, a, w'

Reproduction Steps

  1. Setup Raspberry Pi 5 with Hailo-8L (AI Kit)
  2. Install HailoRT 4.20.0
  3. Clone latest Hailo-Application-Code-Examples repository
  4. Run python3 setup.py in speech_recognition directory
  5. Run: python3 -m app.app_hailo_whisper --hw-arch hailo8l --variant base --duration 5
  6. Record clear English speech
  7. Observe empty or garbled transcription output

Any ideas?
Should I try HailoRt 4.21 ?

Tested with 2 mics and different volummes.
Generated .wav file is good, voice clear.

Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions