Author: Igor Khozhanov
Contact: khozhanov@gmail.com
Copyright: Β© 2026 Igor Khozhanov. All Rights Reserved.
Processing 1440p Video Stream @ ~118 FPS on RTX 3060 Ti.
The previous phases Phase 3 (Integration) and Phase 4 (Functional Inference) completed. The pipeline now supports full end-to-end detection and tracking using TensorRT and ONNX backends with mathematically verified kernels.
Note for Reviewers: This repository is currently under active development. The pipeline is being implemented in stages to ensure memory safety and zero-host-copy verification.
| Module / Stage | Status | Notes |
|---|---|---|
| FFMpeg Source | β Stable | Handles stream connection and packet extraction. |
| Stub Detector | β Stable | Pass-through module, validated for pipeline latency profiling. |
| Output / NVJpeg | β Stable | Saves frames from GPU memory to disk as separate *.jpg images. |
| Inference Pipeline | β Stable | Connects all the stages together. |
| ONNX Detector | β Stable | Implemented with Zero-Copy input. |
| TensorRT Detector | β Stable | Engine builder & enqueueV3 implemented. |
| Object Tracker | β Stable | Kernels for position prediction, IOU matching, velocity filtering. |
| Post-Processing | β Stable | Custom CUDA kernels for YOLOv8 output decoding & NMS. |
| Windows Port | π§ WIP | Adapting CMake & CUDA |
| Jetson Port | π§ WIP | ARM64 optimization. |
This project implements a high-performance video inference pipeline designed to minimize CPU-GPU bandwidth usage. Unlike standard OpenCV implementations, this pipeline keeps data entirely on the VRAM (Zero-Host-Copy) from decoding to inference.
This repository contains the Inference Engine (MIT Licensed). It does not include pre-trained model weights.
To reproduce the demo results (Crop & Weed Detection), you must download pre-trained YOLOv8 model separately.
The model is hosted in the research repository (AGPL-3.0):
- Download:
best.onnx - License: AGPL-3.0 (Derived from Ultralytics YOLOv8)
- β Linux x64 (Verified on Ubuntu 24.04 / RTX 3060 Ti)
- π§ Windows 10/11 (Build scripts implemented, pending validation)
- π§ Nvidia Jetson Orin (CMake configuration ready, pending hardware tests)
Note: The CMakeLists.txt contains specific logic for vcpkg (Windows) and aarch64 (Jetson), but these targets are currently experimental.
- CMake 3.19+
- CUDA Toolkit (12.x)
- TensorRT 10.x+
- FFmpeg: Required.
- Linux Users: Install via package manager or build from source with
--enable-shared.
- Linux Users: Install via package manager or build from source with
- NVIDIA cuDNN: Required by ONNX Runtime CUDA provider.
- Note:
Ensure
libcudnn.sois in yourLD_LIBRARY_PATHor installed system-wide.
- Note:
Ensure
git clone https://github.com/Igkho/ZeroHostCopyInference.git
cd ZeroHostCopyInference
mkdir build
mv ~/Downloads/best.onnx ./build/
cd build
cmake ..
make -j$(nproc)./ZeroCopyInference -i ../video/Moving.mp4 --backend trt --model best.onnx -b 16 -o Moving./ZeroCopyInferenceTestsNo C++ compilation required. Requires NVIDIA Container Toolkit.
git clone https://github.com/Igkho/ZeroHostCopyInference.git
cd ZeroHostCopyInference
mkdir models
mv ~/Downloads/best.onnx ./models/
docker run --rm --gpus all \
-v $(pwd)/video:/app/video \
-v $(pwd)/models:/app/models \
ghcr.io/igkho/zerohostcopyinference:main \
-i video/Moving.mp4 \
--backend trt \
--model /app/models/best.onnx \
-b 16 \
-o video/outputdocker run --rm --gpus all \
--entrypoint ./build/ZeroCopyInferenceTests \
ghcr.io/igkho/zerohostcopyinference:mainBenchmarks performed on NVIDIA RTX 3060 Ti. Input: 1440p Video Stream. Model: YOLOv8 Medium (YOLOv8m) @ 1024x1024 Resolution.
To measure the raw overhead of the pipeline architecture (I/O latency), a pass-through (Stub) detector should be used.
| Metric | Result | Notes |
|---|---|---|
| Throughput | ~300 FPS | Maximum theoretical speed without AI model. |
| Latency | 3.3 ms | Combined Decoding + Memory Management overhead. |
Running YOLOv8m (FP16 optimized) with full object tracking and NVJpeg output.
| Metric | Result | Notes |
|---|---|---|
| Total Throughput | 118.10 FPS | Wall time (End-to-End). 2x Real-Time. |
| Pipeline Latency | ~8.5 ms | Average per frame. |
| Bottleneck | Decoding | Inference is so fast (5.5ms) that Video Decoding (7ms) becomes the primary factor. |
Workload Distribution:
- Decoding: ~7.07 ms/frame (48% load)
- Inference: ~5.58 ms/frame (38% load)
- Storage/IO: ~1.93 ms/frame (13% load)
Both TensorRT (Highly Optimized) and ONNX Runtime (Generic Compatibility) are supported.
Scenario: 1024x1024 Input Resolution on RTX 3060 Ti.
| Backend | FPS | Latency (Inf) | Speedup Factor | Notes |
|---|---|---|---|---|
| TensorRT (FP16) | 118.1 FPS | ~5.6 ms | 1.0x (Ref) | Utilizes Tensor Cores. Recommended. |
| ONNX Runtime | ~10.5 FPS | ~94.8 ms | 0.08x | Generic execution. Useful for testing new models. |
The source code of this project is licensed under the MIT License. You are free to use, modify, and distribute this infrastructure code for any purpose, including commercial applications.
While the code is MIT-licensed, the assets and models used in this repository are subject to different terms. Please review them carefully before redistributing:
- Files: Content located in the
video/directory (e.g.,Moving.mp4,Moving_annotated.gif). - Source: Generated using KlingAI (Free Tier).
- Terms: These assets are provided for demonstration and educational purposes only. They are strictly non-commercial. You may not use these specific video files in any commercial product or service.
- Attribution: The watermarks on these videos must remain intact as per the platform's Terms of Service.
- Example: If you use YOLOv8 (Ultralytics) with this pipeline, be aware that YOLOv8 is licensed under AGPL-3.0.
- Implication: Integrating an AGPL-3.0 model may legally require your entire combined application to comply with AGPL-3.0 terms (i.e., open-sourcing your entire project).
User Responsibility: This repository provides the execution engine only. No models are bundled. You are responsible for verifying and complying with the license of any specific ONNX/TensorRT model you choose to load.
