Merge branch 'video' of https://git-manager.itsdev.in/proxy/github.com/streamer45/streamkit into video

streamkit-devin · streamkit-devin · commit 9fbc4b083c0b · 2026-03-15T10:09:01.000Z
diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@ SPDX-License-Identifier: MPL-2.0
   <br>
 </h1>
 <h4 align="center">Build and run real-time media pipelines on your own infrastructure</h4>
-<p align="center"><em>Speech-to-text, voice agents, live audio processing — composable, observable, self-hosted.</em></p>
+<p align="center"><em>Speech-to-text, voice agents, live audio/video processing — composable, observable, self-hosted.</em></p>
 <p align="center">
   <a href="https://streamkit.dev"><img src="https://img.shields.io/badge/docs-streamkit.dev-blue?style=flat-square" alt="Documentation"></a>
   <a href="https://demo.streamkit.dev"><img src="https://img.shields.io/badge/demo-try%20it%20live-brightgreen?style=flat-square" alt="Live Demo"></a>
@@ -24,7 +24,7 @@ SPDX-License-Identifier: MPL-2.0
 <p align="center">
   <img src="docs/public/screenshots/monitor_view.png" alt="StreamKit web UI (Monitor View): visual pipeline editor" width="800">
   <br>
-  <em>Pipeline monitor showing real-time audio processing with node metrics</em>
+  <em>Pipeline monitor showing real-time media processing with node metrics</em>
 </p>
 
 **StreamKit** is a self-hostable media processing server (written in Rust). You run a single binary (`skit`), then compose pipelines as a node graph (DAG) made from built-in nodes, plugins, and scriptable logic — via a web UI, YAML, or API.
@@ -67,6 +67,7 @@ If you try it and something feels off, please open an issue (or a small PR). For
 - **Speech pipelines** — Build a transcription service: ingest audio via MoQ, run Whisper STT, stream transcription updates to clients.
 - **Real-time translation** — Bilingual streams with live subtitles using NLLB or Helsinki translation models.
 - **Voice agents** — TTS-powered bots that respond to audio input with Kokoro, Piper, or Matcha.
+- **Video compositing** — Combine camera feeds with overlays and PiP layouts using the built-in compositor, encoded with VP9 for real-time transport.
 - **Audio processing** — Mixing, gain control, format conversion, and custom routing.
 - **Batch processing** — High-throughput file conversion or offline transcription using the Oneshot HTTP API.
 - **Your idea** — Add your own node or plugin and compose it into a pipeline
@@ -79,7 +80,7 @@ If you try it and something feels off, please open an issue (or a small PR). For
   - **Dynamic**: long-running sessions you can inspect and reconfigure while they run
 - **Transport**: real-time media over MoQ/WebTransport (QUIC) plus a WebSocket control plane for UI and automation (WebSocket transport nodes are on the roadmap; in the near term, non-media streams may also ride MoQ)
 - **Plugins**: native (C ABI, in-process) and WASM (Component Model).
-- **Media focus**: audio-first today (Opus, WAV, OGG, FLAC, MP3). Video support is on the [roadmap](ROADMAP.md).
+- **Media focus**: audio (Opus, WAV, OGG, FLAC, MP3) and basic video (VP9 encode/decode, compositing, WebM muxing). Video capabilities are expanding — see the [roadmap](ROADMAP.md).
 
 ## Quickstart (Docker)
 
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -60,10 +60,10 @@ These are in place today and will be iterated on (not “added from scratch”):
 
 ### Dynamic Video over MoQ (VP9 MVP) (P0)
 
-- **Video packet types** — First-class video packets alongside audio, with explicit timing requirements
-- **VP9 baseline** — Real-time VP9 encode/decode path suitable for browser clients; **AV1 optional later**
+- ~~**Video packet types** — First-class video packets alongside audio, with explicit timing requirements~~
+- ~~**VP9 baseline** — Real-time VP9 encode/decode path suitable for browser clients; **AV1 optional later**~~
 - **MoQ/Hang-first interop** — Start by interoperating cleanly with `@moq/hang`, then generalize to “MoQ in general”
-- **Compositor MVP (main + PiP)** — Two live video inputs → one composed output, plus simple overlays (watermark/text/images)
+- ~~**Compositor MVP (main + PiP)** — Two live video inputs → one composed output, plus simple overlays (watermark/text/images)~~
 - **Golden-path demo** — A canonical “screen share + webcam → PiP → watchers” dynamic pipeline sample
 
 ### Reliability & Developer Experience
diff --git a/crates/nodes/README.md b/crates/nodes/README.md
@@ -10,7 +10,7 @@ Built-in processing nodes for StreamKit pipelines.
 
 ## What Lives Here
 
-- Built-in node implementations (e.g. `core::*`, `audio::*`, `containers::*`, `transport::*`)
+- Built-in node implementations (e.g. `core::*`, `audio::*`, `video::*`, `containers::*`, `transport::*`)
 - Node parameter schemas (used by the UI for validation and editor controls)
 - Node-level tests and fixtures
 
diff --git a/docs/src/content/docs/architecture/overview.md b/docs/src/content/docs/architecture/overview.md
@@ -19,7 +19,7 @@ StreamKit has three major pieces:
 
 ## Extensibility
 
-- **Built-in nodes** (core, audio, containers, transport).
+- **Built-in nodes** (core, audio, video, containers, transport).
 - **Plugins**: native (in-process C ABI) and WASM (sandboxed Component Model).
 - **Script node**: sandboxed JavaScript (QuickJS) for lightweight integration and text processing.
 
diff --git a/docs/src/content/docs/guides/performance.md b/docs/src/content/docs/guides/performance.md
@@ -122,7 +122,7 @@ moq_peer_channel_capacity = 100  # (MoQ builds) MoQ transport internal queues (p
 | `demuxer_buffer_size` | 65536 | OGG demuxer duplex buffer (bytes) |
 | `moq_peer_channel_capacity` | 100 | (MoQ builds) MoQ peer internal channels (packets) |
 
-**Warning**: Only modify these if you understand the latency/throughput implications. The defaults are tuned for typical real-time audio processing workloads.
+**Warning**: Only modify these if you understand the latency/throughput implications. The defaults are tuned for typical real-time audio/video processing workloads.
 
 ### When to Adjust
 
@@ -148,6 +148,8 @@ The core audio frame pool is preallocated with fixed defaults and cannot be conf
 
 These are optimized for common audio frame sizes (10-80ms at 48kHz) and should not need adjustment.
 
+A separate video frame pool (`VideoFramePool`) manages reusable byte buffers for raw video frames, reducing per-frame allocation overhead in video pipelines.
+
 ## Complete Example
 
 ```toml
diff --git a/docs/src/content/docs/index.mdx b/docs/src/content/docs/index.mdx
@@ -5,7 +5,7 @@ title: StreamKit
 description: Open-source real-time media processing engine
 template: splash
 hero:
-  tagline: Build and run real-time media pipelines on your own infrastructure. Speech-to-text, voice agents, live audio processing — composable, observable, self-hosted.
+  tagline: Build and run real-time media pipelines on your own infrastructure. Speech-to-text, voice agents, live audio/video processing — composable, observable, self-hosted.
   actions:
     - text: Get Started
       link: /getting-started/quick-start/
@@ -54,14 +54,15 @@ import { Card, CardGrid } from '@astrojs/starlight/components';
 
 ## Who is this for?
 
-StreamKit is built for developers who need to process real-time media — whether you're building voice features for an app, prototyping an AI audio pipeline, or self-hosting alternatives to cloud speech APIs.
+StreamKit is built for developers who need to process real-time media — whether you're building voice features, prototyping an AI audio/video pipeline, or self-hosting alternatives to cloud speech APIs.
 
 ## What you can build
 
 - **Live transcription** — Ingest audio via MoQ, run Whisper or SenseVoice STT, stream transcription updates to clients
 - **Voice agents** — TTS-powered bots using Kokoro, Piper, or Matcha that respond to audio input
 - **Real-time translation** — Bilingual streams with live subtitles using NLLB or Helsinki models
 - **Audio processing** — Mixing, gain control, format conversion, encoding/decoding pipelines
+- **Video compositing** — Combine live video inputs with text/image overlays using the built-in compositor (PiP, z-ordering, crop/zoom), encoded via VP9 for real-time transport
 - **Content analysis** — VAD for speech detection, keyword spotting, or custom safety filters
 
 ## What it is