feat(plugins): add Parakeet TDT speech-to-text plugin#281
Conversation
Add a new native plugin for fast English speech recognition using NVIDIA's Parakeet TDT (Token-and-Duration Transducer) 0.6B model via sherpa-onnx. Parakeet TDT is approximately 10x faster than Whisper on consumer hardware with competitive accuracy (#1 on HuggingFace ASR leaderboard). Plugin implementation: - Offline transducer recognizer (encoder/decoder/joiner) via sherpa-onnx C API - Silero VAD v6 for streaming speech segmentation - Recognizer caching keyed on (model_dir, num_threads, execution_provider) - Configurable VAD threshold, silence duration, and max segment length - 16kHz mono f32 audio input, transcription output Justfile additions: - build-plugin-native-parakeet: build the plugin - download-parakeet-models: download INT8 quantized model (~660MB) - setup-parakeet: full setup (sherpa-onnx + models + VAD) - Added parakeet to copy-plugins-native loop Includes sample oneshot pipeline (parakeet-stt.yml) and plugin.yml manifest. Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
There was a problem hiding this comment.
🚩 download-models target doesn't include download-parakeet-models
The download-models target at justfile line 822 doesn't include download-parakeet-models. Looking at the pattern, NLLB and pocket-tts are also excluded (NLLB for licensing, pocket-tts for gating). Parakeet models are CC-BY-4.0 (not restrictive), so unlike NLLB they could be included. However, at ~660MB they are significantly larger than most other models. This may be an intentional omission to keep default download size manageable, or it may be an oversight. Worth confirming with the author.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
Good catch. At ~660MB the Parakeet model is substantially larger than most other models, so I've kept it out of the default download-models target (similar to pocket-tts) and added it as an optional suggestion in the output message. Fixed in 9f03df7.
- Add build-plugin-native-parakeet to build-plugins-native target - Fix plugin.yml repo_id to match actual HuggingFace source repos (csukuangfj/sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8 for model, streamkit/sensevoice-models for silero-vad) - Regenerate marketplace/official-plugins.json with parakeet entry - Add download-parakeet-models as optional in download-models output (skipped by default due to ~660MB size, similar to pocket-tts) Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
- Add plugin docs page (plugin-native-parakeet.md) with parameters, example pipeline, and JSON schema - Update plugin index to include parakeet (10 → 11 official plugins) - Fix model download: individual files from HuggingFace instead of non-existent tar.bz2 archive - Add per-file sha256 checksums via file_checksums field (matching ModelSpec struct) for integrity verification - Fix expected_size_bytes to actual total (661190513) - Regenerate marketplace/official-plugins.json Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
There was a problem hiding this comment.
🚩 PORTABILITY_REVIEW.md referenced in AGENTS.md but deleted from repo
The AGENTS.md plugin checklist requires updating marketplace/PORTABILITY_REVIEW.md, but this file was deleted in a prior commit (9548c98) and doesn't exist on main. Since it's a stale reference in AGENTS.md rather than a file this PR should have modified, this isn't flagged as a bug against this PR, but the AGENTS.md instruction should be updated or the file recreated.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
Acknowledged — as noted, marketplace/PORTABILITY_REVIEW.md doesn't exist in the repo yet and no other plugin has created it either. This is aspirational documentation from the AGENTS.md checklist. Not actionable for this PR.
The content-type backward walk in run_oneshot_pipeline walks backwards through the pipeline graph to find a node that declares a content_type. When no node in the chain returns a content_type (e.g. STT pipelines ending in json_serialize), the walk reaches streamkit::http_input which is a synthetic node not in the registry, causing a 500 error. Skip synthetic oneshot nodes (http_input/http_output) in the backward walk since they are handled separately by the engine and are not registered in the node registry. This fixes all STT-style oneshot pipelines (parakeet-stt, sensevoice-stt, speech_to_text, etc.) that use json_serialize → http_output. Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
| /// Field order must match sherpa-onnx v1.12.17 c-api.h exactly. | ||
| #[repr(C)] | ||
| pub struct SherpaOnnxOfflineModelConfig { | ||
| pub transducer: SherpaOnnxOfflineTransducerModelConfig, | ||
| pub paraformer: SherpaOnnxOfflineParaformerModelConfig, | ||
| pub nemo_ctc: SherpaOnnxOfflineNemoEncDecCtcModelConfig, | ||
| pub whisper: SherpaOnnxOfflineWhisperModelConfig, | ||
| pub tdnn: SherpaOnnxOfflineTdnnModelConfig, | ||
| pub tokens: *const c_char, | ||
| pub num_threads: c_int, | ||
| pub debug: c_int, | ||
| pub provider: *const c_char, | ||
| pub model_type: *const c_char, | ||
| pub modeling_unit: *const c_char, | ||
| pub bpe_vocab: *const c_char, | ||
| pub telespeech_ctc: *const c_char, | ||
| pub sense_voice: SherpaOnnxOfflineSenseVoiceModelConfig, | ||
| pub moonshine: SherpaOnnxOfflineMoonshineModelConfig, | ||
| pub fire_red_asr: SherpaOnnxOfflineFireRedAsrModelConfig, | ||
| pub dolphin: SherpaOnnxOfflineDolphinModelConfig, | ||
| pub zipformer_ctc: SherpaOnnxOfflineZipformerCtcModelConfig, | ||
| pub canary: SherpaOnnxOfflineCanaryModelConfig, | ||
| pub wenet_ctc: SherpaOnnxOfflineWenetCtcModelConfig, | ||
| pub omnilingual: SherpaOnnxOfflineOmnilingualAsrCtcModelConfig, | ||
| } |
There was a problem hiding this comment.
🚩 FFI struct layout must exactly match sherpa-onnx c-api.h for the pinned version
The FFI bindings in ffi.rs declare #[repr(C)] structs whose field order must match sherpa-onnx v1.12.17's c-api.h exactly. Any mismatch (added/removed/reordered fields in a newer sherpa-onnx version) would cause silent memory corruption at the ABI boundary. The code pins to a specific version in comments but the actual sherpa-onnx version is determined by whatever libsherpa-onnx-c-api.so is installed via just install-sherpa-onnx. If that installation script is updated to a newer sherpa-onnx version, these bindings may need updating. This matches the approach in other sherpa-onnx-based plugins in the repo.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
Acknowledged — this is the same approach used by the sensevoice plugin. The struct layout has been verified against sherpa-onnx v1.12.17's c-api.h and tested end-to-end (recognizer creation + transcription both succeed). The version coupling to just install-sherpa-onnx is noted in the PR review checklist for the human reviewer.
Update the parakeet plugin.yml to point to the controlled streamkit/parakeet-models HuggingFace repo instead of the external csukuangfj repo. Regenerate marketplace metadata accordingly. Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
Point the justfile download target and README references to streamkit/parakeet-models instead of the external csukuangfj repo. Original export attribution preserved in README. Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
| }; | ||
|
|
||
| // Calculate silence threshold in frames (each frame is 32ms) | ||
| let silence_threshold_frames = (config.min_silence_duration_ms / 32) as usize; |
There was a problem hiding this comment.
🟡 Integer division truncates silence threshold, making it shorter than configured
The silence threshold in frames is computed via config.min_silence_duration_ms / 32, which uses integer (floor) division. For the default min_silence_duration_ms = 700, the result is 700 / 32 = 21 frames × 32ms = 672ms — 28ms shorter than configured. For the minimum allowed value of 100ms, the result is 100 / 32 = 3 frames = 96ms. The truncation error can be up to 31ms, causing the plugin to trigger transcription slightly earlier than the user specified. Ceiling division should be used to ensure the actual silence duration is at least the configured value.
| let silence_threshold_frames = (config.min_silence_duration_ms / 32) as usize; | |
| let silence_threshold_frames = ((config.min_silence_duration_ms + 31) / 32) as usize; |
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
- Point silero-vad repo_id to streamkit/parakeet-models instead of streamkit/sensevoice-models to avoid cross-plugin dependency - Remove unused cc build-dependency - Remove unused once_cell dependency (code uses std::sync::LazyLock) - Fix misleading update_params comment that claimed VAD params could be updated at runtime - Remove const from set_threshold (f32::clamp is not const-stable) Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
Summary
Adds a new Parakeet TDT native plugin for fast English speech-to-text using NVIDIA's Parakeet TDT 0.6B model via sherpa-onnx. Approximately 10x faster than Whisper on consumer CPU hardware with competitive accuracy.
What's included
plugins/native/parakeet/(FFI bindings, VAD-based segmentation, recognizer caching)plugin.yml) with model checksums pointing tostreamkit/parakeet-modelsHF repobuild-plugin-native-parakeet,download-parakeet-models,setup-parakeet,upload-parakeet-pluginsamples/pipelines/oneshot/parakeet-stt.ymlEngine bugfix (separate concern, same PR)
Fixed a pre-existing bug in
crates/engine/src/oneshot.rswhere the content-type backward walk crashed when reaching synthetic nodes (streamkit::http_input/streamkit::http_output), which aren't in the registry. This affected ALL STT-style oneshot pipelines (parakeet, sensevoice, whisper). The fix skips synthetic nodes in the walk.End-to-end tested
sample.ogg): correct transcriptionspeech_2m.opus): 30 natural VAD segments with accurate textReview & Testing Checklist for Human
ffi.rsmatches sherpa-onnx v1.12.17 C headers (transducer encoder/decoder/joiner field offsets)just setup-parakeet && just skitthen test withcurl -X POST -H "Content-Type: audio/ogg" --data-binary @sample.ogg http://localhost:4545/api/v1/oneshot/parakeet-sttstreamkit/parakeet-modelsHF repo:just download-parakeet-modelsoneshot.rs— the synthetic node skip is intentionally conservative (breaks on first synthetic node)download-parakeet-modelsis intentionally excluded from thedownload-modelsumbrella target (models are ~660MB)Notes
Skit / LintCI failure is a pre-existingcargo denywasmtime vulnerability, unrelated to this PRstreamkit/parakeet-modelson HuggingFace (CC-BY-4.0 licensed)vad.rs) is duplicated from sensevoice — could be extracted to a shared crate in a follow-upLink to Devin session: https://staging.itsdev.in/sessions/6d151ddf78024d91ade096e7b8f4d9ce
Requested by: @streamer45