Whisper Base Model Failing on Rpi5 and Hailo-8L

## System Configuration

### Hardware
- **Platform**: Raspberry Pi 5 Model B Rev 1.1
- **Architecture**: aarch64 (ARM64)
- **Hailo Device**: Hailo-8L
- **Device ID**: 0001:01:00.0

### Software Versions
- **OS**: Debian GNU/Linux 12 (bookworm)
- **HailoRT**: 4.20.0-1
- **Firmware Version**: 4.20.0 (release, app, extended context switch buffer)
- **Python**: 3.11
- **Repository Commit**: 1dad6e4 (2025-12-10) - "add h8l to paddleocr, obb readme fix (#368)"

### Installed Packages
```
hailort: 4.20.0-1
hailo-all: 4.20.0
hailofw: 4.20.0-1
python3-hailort: 4.20.0-1
```

### Python Dependencies (in virtual environment)
```
transformers: 4.50.1
sounddevice: 0.5.1
torch: 2.6.0
scipy: 1.9.3
numpy: 1.24.2
```

## Problem Description

The **Whisper base model produces empty or inconsistent transcriptions** on Hailo-8L hardware, while the system appears to be correctly configured and operational.

## Testing Performed

### 1. Installation & Setup
- ✅ Ran `python3 setup.py` successfully
- ✅ Downloaded all required HEF files for Hailo-8L using `download_resources.py`
- ✅ Re-downloaded fresh HEF files to rule out corruption
- ✅ All dependencies installed correctly in virtual environment

### 2. Hardware Verification
- ✅ Device detected: `hailortcli scan` shows device 0001:01:00.0
- ✅ Firmware identified correctly
- ✅ HailoRT service running properly
- ✅ Restarted HailoRT service - no improvement

### 3. Audio Recording Tests
- ✅ Audio recording functional (various quality microphones tested)
- ✅ Audio levels verified (ranging from 0.28 to 0.99 max level)
- ✅ Audio preprocessing working (VAD detecting speech correctly)
- ✅ Mel spectrogram generation successful

### 4. Model Testing

#### Base Model (5-second encoder)
**Command**: `python3 -m app.app_hailo_whisper --hw-arch hailo8l --variant base --duration 5`

**Results**: 
- **Inconsistent performance** - occasionally produces partial transcriptions, mostly empty
- Example successful transcription (1 out of 10+ attempts): `"testing 123"`
- Example partial transcription: `"is a 5 2"` (from "This is a 5 second recording")
- **Majority of attempts**: Empty string `''` returned from decoder

**Sample Output**:
```
Audio loaded: 78674 samples, max level: 0.9793
After preprocessing: start_time=1.2, audio length: 78674 samples
Chunk offset: 1.00s
Raw transcription: ' is a 5 2'
Cleaned transcription: 'is a 5 2.'
```

Then subsequent recordings:
```
Audio loaded: 78535 samples, max level: 0.3499
After preprocessing: start_time=1.0, audio length: 78535 samples
Chunk offset: 0.50s
Raw transcription: ''
Cleaned transcription: '.'
```

#### Tiny Model (10-second encoder)
**Command**: `python3 -m app.app_hailo_whisper --hw-arch hailo8l --variant tiny --duration 10`

**Results**:
- **Garbled output** with unicode characters and random tokens
- Example: `'%。...............,......'`
- Example: `'..... other alert hurt�... other........�..'`
- Example: `'%,,, ", to [,,,, " w,," [ st, -- "告诉 ',,, a, w'`

### 5. Configuration Variations Tested
- ✅ VAD enabled (default)
- ✅ VAD disabled (`--no-vad` flag)
- ✅ Different chunk offsets (0.2s, 0.5s buffer before detected speech)
- ✅ Different audio durations (5s, 10s)
- ✅ Reuse audio mode (`--reuse-audio`)
- ✅ Fresh recordings with various microphone qualities

## Observed Behavior

### What Works
1. ✅ Hardware detection and initialization
2. ✅ Audio recording and loading
3. ✅ Voice Activity Detection (VAD)
4. ✅ Audio preprocessing and gain adjustment
5. ✅ Mel spectrogram generation
6. ✅ Encoder processing (no errors)
7. ✅ Pipeline completes without crashes

### What Fails
1. ❌ **Decoder output is empty or corrupted in >90% of attempts**
2. ❌ Base model produces mostly empty strings
3. ❌ Tiny model produces garbled unicode and random tokens
4. ❌ Extremely inconsistent - same audio produces different results

## Evidence

### Successful Transcription (happened once)
```
Raw transcription: ' testing 123'
Cleaned transcription: 'testing 123.'
```

### Typical Failed Output (common)
```
Raw transcription: ''
Cleaned transcription: '.'
```

### Garbled Output (tiny model)
```
Raw transcription: '..... other alert hurt�... other........�..'
Raw transcription: '%,,, ", to [,,,, " w,," [ st, -- "告诉 ',,, a, w'
```

## Reproduction Steps

1. Setup Raspberry Pi 5 with Hailo-8L (AI Kit)
2. Install HailoRT 4.20.0
3. Clone latest Hailo-Application-Code-Examples repository
4. Run `python3 setup.py` in speech_recognition directory
5. Run: `python3 -m app.app_hailo_whisper --hw-arch hailo8l --variant base --duration 5`
6. Record clear English speech
7. Observe empty or garbled transcription output




Any ideas?
Should I try HailoRt 4.21 ?

Tested with 2 mics and different volummes.
Generated .wav file is good, voice clear.

Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper Base Model Failing on Rpi5 and Hailo-8L #369

System Configuration

Hardware

Software Versions

Installed Packages

Python Dependencies (in virtual environment)

Problem Description

Testing Performed

1. Installation & Setup

2. Hardware Verification

3. Audio Recording Tests

4. Model Testing

Base Model (5-second encoder)

Tiny Model (10-second encoder)

5. Configuration Variations Tested

Observed Behavior

What Works

What Fails

Evidence

Successful Transcription (happened once)

Typical Failed Output (common)

Garbled Output (tiny model)

Reproduction Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Whisper Base Model Failing on Rpi5 and Hailo-8L #369

Description

System Configuration

Hardware

Software Versions

Installed Packages

Python Dependencies (in virtual environment)

Problem Description

Testing Performed

1. Installation & Setup

2. Hardware Verification

3. Audio Recording Tests

4. Model Testing

Base Model (5-second encoder)

Tiny Model (10-second encoder)

5. Configuration Variations Tested

Observed Behavior

What Works

What Fails

Evidence

Successful Transcription (happened once)

Typical Failed Output (common)

Garbled Output (tiny model)

Reproduction Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions