Vision Inference Framework

C++ application for computer vision inference, supporting multiple vision tasks and deep learning backends.

🚧 Status: Under Development — expect frequent updates.

Key Features

Multiple Computer Vision Tasks: Supported via vision-core library (Object Detection, Classification, Instance Segmentation, Video Classification, Optical Flow, Pose Estimation, Depth Estimation)
Switchable Inference Backends: OpenCV DNN, ONNX Runtime, TensorRT, Libtorch, OpenVINO, Libtensorflow (via neuriplo library)
Real-time Video Processing: Multiple video backends via VideoCapture library (OpenCV, GStreamer, FFmpeg)
Docker Deployment Ready: Multi-backend container support

Requirements

Core Dependencies

CMake (≥ 3.15)
C++17 compiler (GCC ≥ 8.0)
OpenCV (≥ 4.6)
```
apt install libopencv-dev
```
Google Logging (glog)
```
apt install libgoogle-glog-dev
```

Dependency Management

This project automatically fetches:

vision-core - Contains pre/post-processing and model logic.
neuriplo - Provides inference backend abstractions and version management.
videocapture - Handles video I/O.

Setup

For the selected inference backends, set up the required dependencies first:

ONNX Runtime:

./scripts/setup_dependencies.sh --backend onnx_runtime

TensorRT:

./scripts/setup_dependencies.sh --backend tensorrt

LibTorch (CPU only):

./scripts/setup_dependencies.sh --backend libtorch --compute-platform cpu

LibTorch with GPU support:

./scripts/setup_dependencies.sh --backend libtorch --compute-platform cuda
# Note: Automatically set CUDA version from `versions.neuriplo.env`

OpenVINO:

./scripts/setup_dependencies.sh --backend openvino

TensorFlow:

./scripts/setup_dependencies.sh --backend tensorflow

All backends:

./scripts/setup_dependencies.sh --backend all

Building

mkdir build && cd build
# <backend> must be one between OPENCV_DNN, ONNX_RUNTIME, LIBTORCH, TENSORRT, OPENVINO, LIBTENSORFLOW
cmake -DDEFAULT_BACKEND=<backend> -DCMAKE_BUILD_TYPE=Release ..
cmake --build .

Enabling Video Backend Support

The VideoCapture library supports multiple video processing backends with the following priority:

FFmpeg (if USE_FFMPEG=ON) - Maximum format/codec compatibility
GStreamer (if USE_GSTREAMER=ON) - Advanced pipeline capabilities
OpenCV (default) - Simple and reliable

# Enable GStreamer support
cmake -DDEFAULT_BACKEND=<backend>  -DUSE_GSTREAMER=ON -DCMAKE_BUILD_TYPE=Release ..
cmake --build .

# Enable FFmpeg support
cmake -DDEFAULT_BACKEND=<backend>  -DUSE_FFMPEG=ON -DCMAKE_BUILD_TYPE=Release ..
cmake --build .

# Enable both (FFmpeg takes priority)
cmake -DDEFAULT_BACKEND=<backend>  -DUSE_GSTREAMER=ON -DUSE_FFMPEG=ON -DCMAKE_BUILD_TYPE=Release ..
cmake --build .

Inference Backend Options

Replace <backend> with one of the supported options. See Dependency Management Guide for complete list and details.

Test Build

cmake -DENABLE_APP_TESTS=ON ..

App Usage

Command Line Options

./vision-inference \
  [--help | -h] \
  --type=<model_type> \
  --source=<input_source> \
  --labels=<labels_file> \
  --weights=<model_weights> \
  [--min_confidence=<threshold>] \
  [--nms_threshold=<threshold>] \
  [--mask_threshold=<threshold>] \
  [--batch|-b=<batch_size>] \
  [--input_sizes|-is='<input_sizes>'] \
  [--use-gpu] \
  [--warmup] \
  [--benchmark] \
  [--iterations=<number>]

Required Parameters

--type=<model_type>: Specifies the type of vision model to use. Supported categories:

The TaskFactory supports the following model type strings:

Object Detection:

"yolo", "yolov7e2e", "yolov10", "yolo26", "yolov4" - YOLO-based variants
"yolonas" - YOLO-NAS
"rtdetr" - RT-DETR family (RT-DETR v1, v2, and v4; excludes v3; includes D-FINE and DEIM v1/v2)
"rtdetrul" - RT-DETR (Ultralytics implementation)
"rfdetr" - RF-DETR

Instance Segmentation:

"yoloseg" - YOLOv5/YOLOv8/YOLO11
"yolov10seg"- YOLOv10
"yolo26seg" - YOLO26
"rfdetrseg" - RF-DETR

Classification:

"torchvision-classifier" - Torchvision models (ResNet, EfficientNet, etc.)
"tensorflow-classifier" - TensorFlow/Keras models
"vit-classifier" - Vision Transformers

Video Classification:

"videomae" - VideoMAE
"vivit" - ViViT
"timesformer" - TimeSformer

Optical Flow:

"raft" - RAFT optical flow

Pose Estimation:

"vitpose" - ViTPose

Depth Estimation:

"depth_anything_v2", "depth-anything-v2" - Depth Anything V2

Canonical copy: docs/generated/supported-model-types.md.

--source=<input_source>: Defines the input source for the object detection. It can be:
- A live feed URL, e.g., rtsp://cameraip:port/stream
- A path to a video file, e.g., path/to/video.format
- A path to an image file, e.g., path/to/image.format
--labels=<path/to/labels/file>: Specifies the path to the file containing the class labels. This file should list the labels used by the model, with each label on a new line.
--weights=<path/to/model/weights>: Defines the path to the file containing the model weights.

Optional Parameters

[--min_confidence=<confidence_value>]: Sets the minimum confidence threshold for detections. Detections with a confidence score below this value will be discarded. The default value is 0.25.
[--nms_threshold=<iou_value>]: IoU threshold used for Non-Maximum Suppression in YOLO-based detectors and segmenters. Higher values keep more overlapping boxes. The default value is 0.45.
[--mask_threshold=<value>]: Binarization threshold applied to predicted masks in instance segmentation models. Pixels above this value are considered foreground. The default value is 0.50.
[--batch | -b=<batch_size>]: Specifies the batch size for inference. Default value is 1, inference with batch size bigger than 1 is not currently supported.
[--input_sizes | -is=<input_sizes>]: Input sizes for each model input when models have dynamic axes or the backend can't retrieve input layer information (like the OpenCV DNN module). Format: CHW;CHW;.... For example:
- '3,224,224' for a single input
- '3,224,224;3,224,224' for two inputs
- '3,640,640;2' for RT-DETR/RT-DETRv2/D-FINE/DEIM/DEIMv2 models
[--use-gpu]: Activates GPU support for inference. This can significantly speed up the inference process if a compatible GPU is available. Default is false.
[--warmup]: Enables GPU warmup. Warming up the GPU before performing actual inference can help achieve more consistent and optimized performance. This parameter is relevant only if the inference is being performed on an image source. Default is false.
[--benchmark]: Enables benchmarking mode. In this mode, the application will run multiple iterations of inference to measure and report the average inference time. This is useful for evaluating the performance of the model and the inference setup. This parameter is relevant only if the inference is being performed on an image source. Default is false.
[--iterations=<number>]: Specifies the number of iterations for benchmarking. The default value is 10.

To check all available options:

./vision-inference --help

Common Use Case Examples

# Object Detection - YOLOv8 ONNX Runtime image processing
./vision-inference \
  --type=yolo \
  --source=image.png \
  --weights=models/yolov8s.onnx \
  --labels=data/coco.names

# Object Detection - RT-DETR video processing
./vision-inference \
  --type=rtdetr \
  --source=video.mp4 \
  --weights=models/rtdetr-l.onnx \
  --labels=data/coco.names \
  --min_confidence=0.4

# Classification - Image classification
./vision-inference \
  --type=torchvisionclassifier \
  --source=image.png \
  --weights=models/resnet50.onnx \
  --labels=data/imagenet_labels.txt

# Instance Segmentation - YOLO segmentation
./vision-inference \
  --type=yoloseg \
  --source=video.mp4 \
  --weights=models/yolov8s-seg.onnx \
  --labels=data/coco.names \
  --min_confidence=0.4 \
  --nms_threshold=0.5 \
  --mask_threshold=0.5 \
  --use-gpu

# Optical Flow - RAFT model
./vision-inference \
  --type=raft \
  --source=video.mp4 \
  --weights=models/raft-small.onnx

Check the .vscode folder for other examples.

Docker Deployment

Building Images

Inside the project, in the Dockerfiles folder, there will be a dockerfile for each inference backend (currently onnxruntime, libtorch, tensorrt, openvino)

# Build for specific backend
docker build --rm -t vision-inference:<backend_tag>  \
    -f docker/Dockerfile.backend .

Running Containers

Replace the wildcards with your desired options and paths:

docker run --rm \
    -v<path_host_data_folder>:/app/data \
    -v<path_host_weights_folder>:/weights \
    -v<path_host_labels_folder>:/labels \
    vision-inference:<backend_tag> \
    --type=<model_type> \
    --weights=<weight_according_your_backend> \
    --source=/app/data/<image_or_video> \
    --labels=/labels/<labels_file>

For GPU support, add --gpus all to the docker run command.

Additional Resources

Detector Architectures Guide
Supported Models
Model Export Guide
Vision-Core Export Tools - Comprehensive export utilities for all supported models

⚠️ Known Limitations

Windows builds not currently supported
Some model/backend combinations may require specific export configurations

🙏 Acknowledgments

References

https://paperswithcode.com/sota/real-time-object-detection-on-coco (No more available)
https://leaderboard.roboflow.com/

Support

Open an issue for bug reports or feature requests: contributions, corrections, and suggestions are welcome to keep this repository relevant and useful.
Check existing issues for solutions to common problems

Name		Name	Last commit message	Last commit date
Latest commit History 654 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
app		app
cmake		cmake
data		data
docker		docker
docs		docs
labels		labels
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
docker_run_inference_e2e_example_rtdetrv4.sh		docker_run_inference_e2e_example_rtdetrv4.sh
docker_run_inference_example.sh		docker_run_inference_example.sh
versions.env		versions.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Inference Framework

Key Features

Requirements

Core Dependencies

Dependency Management

Setup

Building

Enabling Video Backend Support

Inference Backend Options

Test Build

App Usage

Command Line Options

Required Parameters

Optional Parameters

To check all available options:

Common Use Case Examples

Docker Deployment

Building Images

Running Containers

Additional Resources

⚠️ Known Limitations

🙏 Acknowledgments

References

Support

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vision Inference Framework

Key Features

Requirements

Core Dependencies

Dependency Management

Setup

Building

Enabling Video Backend Support

Inference Backend Options

Test Build

App Usage

Command Line Options

Required Parameters

Optional Parameters

To check all available options:

Common Use Case Examples

Docker Deployment

Building Images

Running Containers

Additional Resources

⚠️ Known Limitations

🙏 Acknowledgments

References

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages