Skip to content

feat: add Qwen3-ASR engine with timestamp support and unit tests#16

Open
phonk2682 wants to merge 4 commits intomainfrom
feature/add_qwen3_asr_engine
Open

feat: add Qwen3-ASR engine with timestamp support and unit tests#16
phonk2682 wants to merge 4 commits intomainfrom
feature/add_qwen3_asr_engine

Conversation

@phonk2682
Copy link
Copy Markdown
Collaborator

Description

Integrates the Qwen3-ASR engine into VoiceCore, supporting high-accuracy multilingual speech recognition (52 languages). This engine supports both batch and streaming transcription modes and includes word-level timestamp support via the Qwen3-ForcedAligner model.

Key changes:

  • Added the qwen3asr engine implementation in app/engines/stt/qwen3asr/.
  • Fixed an AttributeError when extracting timestamps from ForcedAlignItem (the qwen-asr library updated these from dictionaries to dataclasses).
  • Updated engines.yaml with a sample configuration for Qwen3-ASR.
  • Added a comprehensive unit test suite for the new engine, including timestamp verification for streaming.

Type of Change

  • Bug fix (Fixed attribute access for ForcedAlignItem)
  • New feature (Added word-level timestamp support for Qwen3)
  • New Engine (Qwen3-ASR STT provider)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (Cleaned up debug print statements)
  • Documentation update
  • Performance improvement

Checklist

  • I have read the CONTRIBUTING guide
  • My code follows the project's code style (make format)
  • Linting passes (make lint)
  • Tests pass (make test)
  • Documentation updated (Updated docstrings and engine implementation)
  • No sensitive information (API keys, secrets) included

Related Issues

Closes #

Testing & Verification

Automated Tests

  • Unit tests added/updated: tests/unit/engines/stt/qwen3asr/test_qwen3asr_engine.py (42 tests passed)
  • All existing tests pass: A total of 396 tests passed successfully (make test).

Manual Verification

Verified on a GPU-enabled environment:

  1. Model loaded successfully using the vLLM backend.
  2. Transcription returned accurate results (the AttributeError is resolved).
  3. Word-level timestamps are correctly extracted into Segment objects.

API Endpoints Tested

  • Batch endpoint (POST /api/v1/stt/transcribe)
  • SSE streaming (POST /api/v1/stt/transcribe/stream)

Engine-Specific Tests

  • Engine type: STT
  • Provider: Qwen3-ASR
  • Model: Qwen3-ASR-1.7B + Qwen3-ForcedAligner-0.6B
  • Device tested: cuda

Security Impact

  • No security implications
  • Security impact (please describe below)

Copilot AI review requested due to automatic review settings February 1, 2026 02:38
@phonk2682 phonk2682 requested review from minhsaco99 and removed request for Copilot February 1, 2026 02:41
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this file


# Process with Qwen3-ASR
processing_start = time.time()
results = self._model.transcribe(
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this model not supported stream ?

Copilot AI review requested due to automatic review settings February 12, 2026 10:09
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new STT provider to VoiceCore by integrating a Qwen3-ASR engine (batch + streaming) with optional word-level timestamp extraction, plus configuration and a dedicated unit test suite.

Changes:

  • Introduces Qwen3ASREngine and Qwen3ASRConfig under app/engines/stt/qwen3asr/.
  • Adds a qwen3-asr uv dependency group and a sample engines.yaml entry.
  • Adds unit tests covering config validation, lifecycle, batch transcription, streaming, and metrics.

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
app/engines/stt/qwen3asr/engine.py New Qwen3-ASR STT engine implementation (batch + streaming, metrics, optional timestamps).
app/engines/stt/qwen3asr/config.py New Pydantic config model for Qwen3-ASR/vLLM and streaming parameters.
app/engines/stt/qwen3asr/__init__.py Exports the new engine and config.
engines.yaml Adds a sample Qwen3-ASR engine configuration entry.
pyproject.toml Adds a qwen3-asr dependency group for qwen-asr[vllm].
tests/unit/engines/stt/qwen3asr/test_qwen3asr_engine.py New unit tests for Qwen3-ASR engine behavior and timestamps.
tests/unit/engines/stt/qwen3asr/__init__.py Test package marker for the new unit test directory.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

),
time_to_first_token_ms=time_to_first_token_ms,
total_stream_duration_ms=total_duration_ms,
total_chunks=int(len(audio_array) / chunk_size_samples) + 1,
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_chunks is computed as int(len(audio_array) / chunk_size_samples) + 1, which is off by one when len(audio_array) is an exact multiple of chunk_size_samples. Track the actual number of yielded STTChunks (or use a ceiling division) to avoid incorrect metrics.

Suggested change
total_chunks=int(len(audio_array) / chunk_size_samples) + 1,
total_chunks=(len(audio_array) + chunk_size_samples - 1) // chunk_size_samples,

Copilot uses AI. Check for mistakes.
Comment on lines +294 to +299
yield STTResponse(
text=final_text,
language=detected_language,
segments=None, # Streaming doesn't return full segments list usually
performance_metrics=metrics,
)
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final streaming STTResponse hard-codes segments=None, but the PR description and the new unit tests expect word-level timestamps in streaming mode. Either populate segments when forced_aligner/timestamps are enabled, or update the tests/docs to reflect that streaming does not return segments.

Copilot uses AI. Check for mistakes.
from unittest.mock import MagicMock, patch

import numpy as np
import pytest
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These unit tests patch qwen_asr.Qwen3ASRModel, which will raise ModuleNotFoundError if the optional qwen-asr dependency group isn’t installed (e.g., after uv sync --dev). Add a module-level pytest.importorskip("qwen_asr", ...) (or similar guarding) so the overall test suite can run without optional deps.

Suggested change
import pytest
import pytest
pytest.importorskip("qwen_asr", reason="qwen-asr package is not installed")

Copilot uses AI. Check for mistakes.
Comment on lines +711 to +751
async def test_transcribe_stream_returns_segments_with_timestamps(
self, qwen3asr_config, mock_qwen3asr_model
):
"""Should extract and return segments in streaming mode"""
from app.engines.stt.qwen3asr.engine import Qwen3ASREngine

engine = Qwen3ASREngine(qwen3asr_config)

mock_result = MagicMock()
mock_result.text = "Hello world"
mock_result.language = "English"

# Mock qwen-asr ForcedAlignItem mocks
item1 = MagicMock()
item1.text = "Hello"
item1.start_time = 0.0
item1.end_time = 0.5

item2 = MagicMock()
item2.text = "world"
item2.start_time = 0.5
item2.end_time = 1.0

mock_result.time_stamps = [item1, item2]
mock_qwen3asr_model.transcribe.return_value = [mock_result]

with patch.object(engine, "_audio_processor") as mock_processor:
mock_processor.to_numpy.return_value = (np.array([0.1, 0.2]), 16000)
mock_processor.get_duration_ms.return_value = 1000.0

chunks = []
async for item in engine.transcribe_stream(np.array([0.1, 0.2])):
chunks.append(item)

final_response = chunks[-1]
assert final_response.segments is not None
assert len(final_response.segments) == 2
assert final_response.segments[0].text == "Hello"
assert final_response.segments[0].start == 0.0
assert final_response.segments[0].end == 0.5

Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qwen3ASREngine.transcribe_stream() never calls self._model.transcribe() and the implementation currently yields a final STTResponse with segments=None. This test sets up mock_qwen3asr_model.transcribe.return_value and then asserts final_response.segments contains timestamps, which will not be true with the current streaming implementation. Update the test to reflect actual streaming behavior, or extend the engine to extract/return segments in streaming mode.

Copilot uses AI. Check for mistakes.
)
except ImportError as e:
raise EngineNotReadyError(
"qwen-asr package not installed. Run: pip install qwen-asr[vllm]"
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ImportError message suggests pip install qwen-asr[vllm], but this repo’s install flow uses uv dependency groups (see engines.yaml / pyproject.toml). Consider updating the message to also mention uv sync --group qwen3-asr so users follow the project’s standard setup path.

Suggested change
"qwen-asr package not installed. Run: pip install qwen-asr[vllm]"
"qwen-asr package not installed. Install via: uv sync --group qwen3-asr (recommended) or pip install qwen-asr[vllm]"

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +72
async def transcribe(
self, audio_data: AudioInput, language: str | None = None, **kwargs
) -> STTResponse:
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

**kwargs is accepted but never used. If callers pass engine-specific params via engine_params, they’ll be silently ignored. Consider forwarding supported kwargs into self._model.transcribe(...) (and/or streaming calls) or remove **kwargs to avoid a misleading API.

Copilot uses AI. Check for mistakes.
Comment on lines +248 to +251
# Capture first token time
if first_token_time is None:
first_token_time = time.time()

Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In streaming mode, first_token_time is set on the first processed chunk regardless of whether any text was produced. This makes time_to_first_token_ms inaccurate when initial chunks yield empty text; align with the Whisper engine behavior by setting this timestamp only when state.text first becomes non-empty (or changes from empty).

Copilot uses AI. Check for mistakes.
Comment on lines +256 to +262
chunk_latency_ms = (time.time() - processing_start) * 1000
yield STTChunk(
text=current_text,
timestamp=float(pos) / sample_rate,
confidence=None,
chunk_latency_ms=chunk_latency_ms,
)
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_latency_ms is computed from processing_start, so it grows cumulatively over the stream. The STTChunk.chunk_latency_ms field is described as per-chunk latency; measure just the time spent processing the current chunk (similar to the Whisper streaming implementation).

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +18
qwen3-asr:
enabled: true
engine_class: "app.engines.stt.qwen3asr.engine.Qwen3ASREngine"
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config enables qwen3-asr by default. Since qwen-asr is in an optional dependency group, startup will log an initialization error (and the engine won’t register) unless users install the extra group. Consider setting enabled: false by default and leaving this as a sample config users can opt into.

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +46
gt=0,
description="Maximum batch size for inference. -1 for unlimited.",
)
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_inference_batch_size is documented as supporting -1 for unlimited, but the field is validated with gt=0, which rejects -1. Either allow -1 in validation (e.g., ge=-1 with a custom validator) or adjust the description/tests to match the enforced constraint.

Suggested change
gt=0,
description="Maximum batch size for inference. -1 for unlimited.",
)
ge=-1,
description="Maximum batch size for inference. -1 for unlimited.",
)
@field_validator("max_inference_batch_size")
@classmethod
def validate_max_inference_batch_size(cls, v: int) -> int:
"""
Ensure max_inference_batch_size is either -1 (for unlimited) or a positive integer.
"""
if v == -1 or v > 0:
return v
raise ValueError(
"max_inference_batch_size must be -1 (for unlimited) or a positive integer"
)

Copilot uses AI. Check for mistakes.
@phonk2682 phonk2682 requested a review from minhsaco99 February 13, 2026 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants