Skip to content

Commit 304c0b9

Browse files
staging-devin-ai-integration[bot]streamkit-devinstreamer45
authored
feat(plugins): add Parakeet TDT speech-to-text plugin (#281)
* feat(plugins): add Parakeet TDT speech-to-text plugin Add a new native plugin for fast English speech recognition using NVIDIA's Parakeet TDT (Token-and-Duration Transducer) 0.6B model via sherpa-onnx. Parakeet TDT is approximately 10x faster than Whisper on consumer hardware with competitive accuracy (#1 on HuggingFace ASR leaderboard). Plugin implementation: - Offline transducer recognizer (encoder/decoder/joiner) via sherpa-onnx C API - Silero VAD v6 for streaming speech segmentation - Recognizer caching keyed on (model_dir, num_threads, execution_provider) - Configurable VAD threshold, silence duration, and max segment length - 16kHz mono f32 audio input, transcription output Justfile additions: - build-plugin-native-parakeet: build the plugin - download-parakeet-models: download INT8 quantized model (~660MB) - setup-parakeet: full setup (sherpa-onnx + models + VAD) - Added parakeet to copy-plugins-native loop Includes sample oneshot pipeline (parakeet-stt.yml) and plugin.yml manifest. Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com> * fix(plugins): address review feedback for parakeet plugin - Add build-plugin-native-parakeet to build-plugins-native target - Fix plugin.yml repo_id to match actual HuggingFace source repos (csukuangfj/sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8 for model, streamkit/sensevoice-models for silero-vad) - Regenerate marketplace/official-plugins.json with parakeet entry - Add download-parakeet-models as optional in download-models output (skipped by default due to ~660MB size, similar to pocket-tts) Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com> * docs(plugins): add parakeet docs page, fix model checksums and download - Add plugin docs page (plugin-native-parakeet.md) with parameters, example pipeline, and JSON schema - Update plugin index to include parakeet (10 → 11 official plugins) - Fix model download: individual files from HuggingFace instead of non-existent tar.bz2 archive - Add per-file sha256 checksums via file_checksums field (matching ModelSpec struct) for integrity verification - Fix expected_size_bytes to actual total (661190513) - Regenerate marketplace/official-plugins.json Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com> * fix(engine): skip synthetic nodes in oneshot content-type backward walk The content-type backward walk in run_oneshot_pipeline walks backwards through the pipeline graph to find a node that declares a content_type. When no node in the chain returns a content_type (e.g. STT pipelines ending in json_serialize), the walk reaches streamkit::http_input which is a synthetic node not in the registry, causing a 500 error. Skip synthetic oneshot nodes (http_input/http_output) in the backward walk since they are handled separately by the engine and are not registered in the node registry. This fixes all STT-style oneshot pipelines (parakeet-stt, sensevoice-stt, speech_to_text, etc.) that use json_serialize → http_output. Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com> * style(engine): format oneshot backward walk Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com> * docs(plugins): add README for parakeet plugin Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com> * fix(plugins): update parakeet model repo_id to streamkit/parakeet-models Update the parakeet plugin.yml to point to the controlled streamkit/parakeet-models HuggingFace repo instead of the external csukuangfj repo. Regenerate marketplace metadata accordingly. Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com> * fix(plugins): update parakeet model download URL to streamkit HF space Point the justfile download target and README references to streamkit/parakeet-models instead of the external csukuangfj repo. Original export attribution preserved in README. Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com> * fix(plugins): address parakeet review feedback - Point silero-vad repo_id to streamkit/parakeet-models instead of streamkit/sensevoice-models to avoid cross-plugin dependency - Remove unused cc build-dependency - Remove unused once_cell dependency (code uses std::sync::LazyLock) - Fix misleading update_params comment that claimed VAD params could be updated at runtime - Remove const from set_threshold (f32::clamp is not const-stable) Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com> --------- Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-authored-by: StreamKit Devin <devin@streamkit.dev> Co-authored-by: Claudio Costa <cstcld91@gmail.com>
1 parent 9f2742e commit 304c0b9

File tree

16 files changed

+3012
-3
lines changed

16 files changed

+3012
-3
lines changed

crates/engine/src/oneshot.rs

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -384,6 +384,11 @@ impl Engine {
384384
for _ in 0..max_steps {
385385
steps += 1;
386386
if let Some(def) = definition.nodes.get(cursor) {
387+
// Skip synthetic oneshot nodes — they are not in the
388+
// registry and are handled separately by the engine.
389+
if def.kind == "streamkit::http_input" || def.kind == "streamkit::http_output" {
390+
break;
391+
}
387392
let temp = registry.create_node(&def.kind, def.params.as_ref())?;
388393
if let Some(ct) = temp.content_type() {
389394
found = Some(ct);

docs/src/content/docs/reference/plugins/index.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,13 @@ curl http://localhost:4545/api/v1/plugins
1414
curl http://localhost:4545/api/v1/schema/nodes | jq '.[] | select(.kind | startswith("plugin::"))'
1515
```
1616

17-
## Official plugins (10)
17+
## Official plugins (11)
1818

1919
- [`plugin::native::helsinki`](./plugin-native-helsinki/) (original kind: `helsinki`)
2020
- [`plugin::native::kokoro`](./plugin-native-kokoro/) (original kind: `kokoro`)
2121
- [`plugin::native::matcha`](./plugin-native-matcha/) (original kind: `matcha`)
2222
- [`plugin::native::nllb`](./plugin-native-nllb/) (original kind: `nllb`)
23+
- [`plugin::native::parakeet`](./plugin-native-parakeet/) (original kind: `parakeet`)
2324
- [`plugin::native::piper`](./plugin-native-piper/) (original kind: `piper`)
2425
- [`plugin::native::pocket-tts`](./plugin-native-pocket-tts/) (original kind: `pocket-tts`)
2526
- [`plugin::native::sensevoice`](./plugin-native-sensevoice/) (original kind: `sensevoice`)
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
---
2+
# SPDX-FileCopyrightText: © 2025 StreamKit Contributors
3+
# SPDX-License-Identifier: MPL-2.0
4+
title: "plugin::native::parakeet"
5+
description: "Fast speech-to-text transcription using NVIDIA Parakeet TDT, a transducer-based ASR model. Approximately 10x faster than Whisper on consumer hardware with competitive accuracy. Uses sherpa-onnx for inference. Requires 16kHz mono audio input."
6+
---
7+
8+
`kind`: `plugin::native::parakeet` (original kind: `parakeet`)
9+
10+
Fast speech-to-text transcription using NVIDIA Parakeet TDT, a transducer-based ASR model. Approximately 10x faster than Whisper on consumer hardware with competitive accuracy. Uses sherpa-onnx for inference. Requires 16kHz mono audio input.
11+
12+
Source: `target/plugins/release/libparakeet.so`
13+
14+
## Categories
15+
- `ml`
16+
- `speech`
17+
- `transcription`
18+
19+
## Pins
20+
### Inputs
21+
- `in` accepts `RawAudio(AudioFormat { sample_rate: 16000, channels: 1, sample_format: F32 })` (one)
22+
23+
### Outputs
24+
- `out` produces `Transcription` (broadcast)
25+
26+
## Parameters
27+
| Name | Type | Required | Default | Description |
28+
| --- | --- | --- | --- | --- |
29+
| `execution_provider` | `string enum[cpu, cuda, tensorrt]` | no | `cpu` | Execution provider (cpu, cuda, tensorrt) |
30+
| `max_segment_duration_secs` | `number` | no | `30.0` | Maximum segment duration before forced transcription (seconds)<br />min: `5`<br />max: `120` |
31+
| `min_silence_duration_ms` | `integer` | no | `700` | Minimum silence duration before transcription (milliseconds)<br />min: `100`<br />max: `5000` |
32+
| `model_dir` | `string` | no | `models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8` | Path to Parakeet TDT model directory (contains encoder.int8.onnx, decoder.int8.onnx, joiner.int8.onnx, tokens.txt). IMPORTANT: Input audio must be 16kHz mono f32. |
33+
| `num_threads` | `integer` | no | `4` | Number of threads for inference<br />min: `1`<br />max: `16` |
34+
| `use_vad` | `boolean` | no | `true` | Enable VAD-based segmentation |
35+
| `vad_model_path` | `string` | no | `models/silero_vad.onnx` | Path to Silero VAD ONNX model file |
36+
| `vad_threshold` | `number` | no | `0.5` | VAD speech probability threshold (0.0-1.0)<br />min: `0`<br />max: `1` |
37+
38+
## Example Pipeline
39+
40+
```yaml
41+
#
42+
# skit:input_asset_tags=speech
43+
44+
name: Speech-to-Text (Parakeet TDT)
45+
description: Fast English speech transcription using NVIDIA Parakeet TDT (~10x faster than Whisper on CPU)
46+
mode: oneshot
47+
steps:
48+
- kind: streamkit::http_input
49+
50+
- kind: containers::ogg::demuxer
51+
52+
- kind: audio::opus::decoder
53+
54+
- kind: audio::resampler
55+
params:
56+
chunk_frames: 960
57+
output_frame_size: 960
58+
target_sample_rate: 16000
59+
60+
- kind: plugin::native::parakeet
61+
params:
62+
model_dir: models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8
63+
num_threads: 4
64+
use_vad: true
65+
vad_model_path: models/silero_vad.onnx
66+
vad_threshold: 0.5
67+
min_silence_duration_ms: 700
68+
69+
- kind: core::json_serialize
70+
params:
71+
pretty: false
72+
newline_delimited: true
73+
74+
- kind: streamkit::http_output
75+
params:
76+
content_type: application/json
77+
```
78+
79+
80+
<details>
81+
<summary>Raw JSON Schema</summary>
82+
83+
```json
84+
{
85+
"properties": {
86+
"execution_provider": {
87+
"default": "cpu",
88+
"description": "Execution provider (cpu, cuda, tensorrt)",
89+
"enum": [
90+
"cpu",
91+
"cuda",
92+
"tensorrt"
93+
],
94+
"type": "string"
95+
},
96+
"max_segment_duration_secs": {
97+
"default": 30.0,
98+
"description": "Maximum segment duration before forced transcription (seconds)",
99+
"maximum": 120.0,
100+
"minimum": 5.0,
101+
"type": "number"
102+
},
103+
"min_silence_duration_ms": {
104+
"default": 700,
105+
"description": "Minimum silence duration before transcription (milliseconds)",
106+
"maximum": 5000,
107+
"minimum": 100,
108+
"type": "integer"
109+
},
110+
"model_dir": {
111+
"default": "models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8",
112+
"description": "Path to Parakeet TDT model directory (contains encoder.int8.onnx, decoder.int8.onnx, joiner.int8.onnx, tokens.txt). IMPORTANT: Input audio must be 16kHz mono f32.",
113+
"type": "string"
114+
},
115+
"num_threads": {
116+
"default": 4,
117+
"description": "Number of threads for inference",
118+
"maximum": 16,
119+
"minimum": 1,
120+
"type": "integer"
121+
},
122+
"use_vad": {
123+
"default": true,
124+
"description": "Enable VAD-based segmentation",
125+
"type": "boolean"
126+
},
127+
"vad_model_path": {
128+
"default": "models/silero_vad.onnx",
129+
"description": "Path to Silero VAD ONNX model file",
130+
"type": "string"
131+
},
132+
"vad_threshold": {
133+
"default": 0.5,
134+
"description": "VAD speech probability threshold (0.0-1.0)",
135+
"maximum": 1.0,
136+
"minimum": 0.0,
137+
"type": "number"
138+
}
139+
},
140+
"type": "object"
141+
}
142+
```
143+
144+
</details>

justfile

Lines changed: 38 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -723,6 +723,39 @@ upload-sensevoice-plugin: build-plugin-native-sensevoice
723723
@curl -X POST -F "plugin=@{{plugins_target_dir}}/release/libsensevoice.so" \
724724
http://127.0.0.1:4545/api/v1/plugins
725725

726+
# Build native Parakeet TDT STT plugin
727+
[working-directory: 'plugins/native/parakeet']
728+
build-plugin-native-parakeet:
729+
@echo "Building native Parakeet TDT STT plugin..."
730+
@CARGO_TARGET_DIR={{plugins_target_dir}} cargo build --release
731+
732+
# Upload Parakeet plugin to running server
733+
[working-directory: 'plugins/native/parakeet']
734+
upload-parakeet-plugin: build-plugin-native-parakeet
735+
@echo "Uploading Parakeet plugin to server..."
736+
@curl -X POST -F "plugin=@{{plugins_target_dir}}/release/libparakeet.so" \
737+
http://127.0.0.1:4545/api/v1/plugins
738+
739+
# Download Parakeet TDT models
740+
download-parakeet-models:
741+
@echo "Downloading Parakeet TDT models (~631MB)..."
742+
@mkdir -p models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8
743+
@HF_BASE="https://huggingface.co/streamkit/parakeet-models/resolve/main" && \
744+
MODEL_DIR="models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8" && \
745+
for f in encoder.int8.onnx decoder.int8.onnx joiner.int8.onnx tokens.txt; do \
746+
if [ -f "$MODEL_DIR/$f" ]; then \
747+
echo "✓ $f already exists"; \
748+
else \
749+
echo "Downloading $f..." && \
750+
curl -L -o "$MODEL_DIR/$f" "$HF_BASE/$f" || exit 1; \
751+
fi; \
752+
done && \
753+
echo "✓ Parakeet TDT models ready at $MODEL_DIR (English)"
754+
755+
# Setup Parakeet (install dependencies + download models)
756+
setup-parakeet: install-sherpa-onnx download-parakeet-models download-silero-vad
757+
@echo "✓ Parakeet TDT STT setup complete!"
758+
726759
# Download pre-converted NLLB models from Hugging Face
727760
download-nllb-models:
728761
@echo "Downloading pre-converted NLLB-200 models from Hugging Face..."
@@ -792,6 +825,9 @@ download-models: download-whisper-models download-silero-vad download-kokoro-mod
792825
@echo "Optional: To download Pocket TTS models (gated; requires HF_TOKEN):"
793826
@echo " just download-pocket-tts-models"
794827
@echo ""
828+
@echo "Optional: To download Parakeet TDT models (~660MB, CC-BY-4.0):"
829+
@echo " just download-parakeet-models"
830+
@echo ""
795831
@du -sh models/
796832

797833
# Setup VAD (install dependencies + download models)
@@ -979,7 +1015,7 @@ install-plugin name: (build-plugin-native name)
9791015
fi
9801016

9811017
# Build all native plugin examples
982-
build-plugins-native: build-plugin-native-gain build-plugin-native-whisper build-plugin-native-kokoro build-plugin-native-piper build-plugin-native-matcha build-plugin-native-pocket-tts build-plugin-native-sensevoice build-plugin-native-nllb build-plugin-native-vad build-plugin-native-helsinki build-plugin-native-supertonic build-plugin-native-slint build-plugin-native-aac-encoder
1018+
build-plugins-native: build-plugin-native-gain build-plugin-native-whisper build-plugin-native-kokoro build-plugin-native-piper build-plugin-native-matcha build-plugin-native-pocket-tts build-plugin-native-sensevoice build-plugin-native-nllb build-plugin-native-vad build-plugin-native-helsinki build-plugin-native-supertonic build-plugin-native-slint build-plugin-native-aac-encoder build-plugin-native-parakeet
9831019

9841020
## Combined
9851021

@@ -1042,7 +1078,7 @@ copy-plugins-native:
10421078

10431079
# Official native plugins (shared target dir).
10441080
# For most plugins the lib stem matches the plugin id.
1045-
for name in whisper kokoro piper matcha vad sensevoice nllb helsinki supertonic slint; do
1081+
for name in whisper kokoro piper matcha vad sensevoice nllb helsinki supertonic slint parakeet; do
10461082
copy_plugin "$name" "$name" "$PLUGINS_TARGET"
10471083
done
10481084

marketplace/official-plugins.json

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,59 @@
143143
}
144144
]
145145
},
146+
{
147+
"id": "parakeet",
148+
"name": "Parakeet TDT",
149+
"version": "0.1.0",
150+
"node_kind": "parakeet",
151+
"kind": "native",
152+
"entrypoint": "libparakeet.so",
153+
"artifact": "target/plugins/release/libparakeet.so",
154+
"description": "Fast speech-to-text using NVIDIA Parakeet TDT via sherpa-onnx",
155+
"license": "MPL-2.0",
156+
"homepage": "https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2",
157+
"models": [
158+
{
159+
"id": "parakeet-tdt-0.6b-v2-int8",
160+
"name": "Parakeet TDT 0.6B v2 (English, INT8)",
161+
"default": true,
162+
"source": "huggingface",
163+
"repo_id": "streamkit/parakeet-models",
164+
"revision": "main",
165+
"files": [
166+
"encoder.int8.onnx",
167+
"decoder.int8.onnx",
168+
"joiner.int8.onnx",
169+
"tokens.txt"
170+
],
171+
"expected_size_bytes": 661190513,
172+
"license": "CC-BY-4.0",
173+
"license_url": "https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2",
174+
"file_checksums": {
175+
"encoder.int8.onnx": "a32b12d17bbbc309d0686fbbcc2987b5e9b8333a7da83fa6b089f0a2acd651ab",
176+
"decoder.int8.onnx": "b6bb64963457237b900e496ee9994b59294526439fbcc1fecf705b31a15c6b4e",
177+
"joiner.int8.onnx": "7946164367946e7f9f29a122407c3252b680dbae9a51343eb2488d057c3c43d2",
178+
"tokens.txt": "ec182b70dd42113aff6c5372c75cac58c952443eb22322f57bbd7f53977d497d"
179+
}
180+
},
181+
{
182+
"id": "silero-vad",
183+
"name": "Silero VAD (v6.2)",
184+
"default": true,
185+
"source": "huggingface",
186+
"repo_id": "streamkit/parakeet-models",
187+
"revision": "main",
188+
"files": [
189+
"silero_vad.onnx"
190+
],
191+
"expected_size_bytes": 2327524,
192+
"license": "MIT",
193+
"license_url": "https://github.com/snakers4/silero-vad/blob/master/LICENSE",
194+
"sha256": "1a153a22f4509e292a94e67d6f9b85e8deb25b4988682b7e174c65279d8788e3"
195+
}
196+
],
197+
"repo": "https://github.com/streamer45/streamkit"
198+
},
146199
{
147200
"id": "piper",
148201
"name": "Piper",

0 commit comments

Comments
 (0)