This is a template library for TensorRT inference that supports OpenCV's cv:: Mat type data and can support multiple input and output data.
New Features: Now supports async inference with multi-stream concurrent processing for improved performance!
- Windows 11 / Ubuntu20.04
- Visual Studio 2022 ~ 2026 / GNU
- CMake 3.20+
- TensorRT 10.x
- OpenCV > 4.5
- Cuda 11.x / 12.x
The library now supports asynchronous inference with the following features:
- Multi-Stream Concurrent Inference: Run multiple inference tasks in parallel using multiple CUDA streams
- Future-based Async API: Use
std::futurefor non-blocking inference - Callback Support: Register callback functions to handle results asynchronously
- Backward Compatible: All existing synchronous APIs still work unchanged
- Memory Pool: Pre-allocated GPU memory for reduced allocation overhead
// Enable async with 4 CUDA streams
TRTInfer model("yolov8n.engine", 4, true);
// Use default 4 streams
TRTInfer model("yolov8n.engine");
// Disable async (backward compatible)
TRTInfer model("yolov8n.engine", 1, false);// Non-blocking inference
auto future = model.infer_async(input_blob);
// Do other work here...
// Wait for result when needed
auto output = future.get();model.infer_with_callback(input_blob,
[](const auto& output) {
// Process results asynchronously
process_output(output);
}
);
// Continue with other workstd::vector<std::future<OutputType>> futures;
// Submit multiple inference tasks
for (auto& img : images) {
futures.push_back(model.infer_async(preprocess(img)));
}
// Wait for all results
for (auto& f : futures) {
auto result = f.get();
process(result);
}- 2-5x throughput improvement with 4-8 streams
- 70-90% GPU utilization compared to ~30% with single stream
- Better scalability for batch processing and video inference
// Async enabled (default: 4 streams)
TRTInfer(const std::string &engine_path, int num_streams = 4, bool enable_async = true);
// Synchronous mode (backward compatible)
TRTInfer(const std::string &engine_path);// Future-based async inference
std::future<std::unordered_map<std::string, cv::Mat>>
infer_async(const std::unordered_map<std::string, cv::Mat> &input_blob);
// Callback-based async inference
void infer_with_callback(const std::unordered_map<std::string, cv::Mat> &input_blob,
std::function<void(const std::unordered_map<std::string, cv::Mat>&)> callback);
// Wait for all pending async inferences
void wait_all();
// Get number of active streams
int num_streams() const;// Original synchronous API - still works!
std::unordered_map<std::string, cv::Mat> operator()(const std::unordered_map<std::string, cv::Mat> &input_blob);See YOLO_async.cc for a complete example demonstrating:
- Synchronous inference (original method)
- Future-based async inference
- Callback-based async inference
- Concurrent batch processing
Build and run:
cmake -S . -B build
cmake --build ./build --config release -j 12
./build/yolo_asyncBefore building, you need to configure the library paths in CMakeLists.txt:
For Windows:
set(CUDA_ROOT_DIR "E:/lib/cuda/12.1") # Path to CUDA installation
set(TensorRT_ROOT_DIR "E:/lib/TensorRT/TensorRT-10.10.0.31") # Path to TensorRT
set(OpenCV_ROOT_DIR "E:/lib/opencv/opencv-4.8.0/build/x64/vc16/lib") # Path to OpenCV
set(LIB_TYPE SHARED) # Options: SHARED (DLL) or STATICFor Linux:
set(CUDA_ROOT_DIR "/usr/local/cuda")
set(TensorRT_ROOT_DIR "/usr/local/TensorRT-10.10.0.31")
set(LIB_TYPE SHARED) # Options: SHARED (.so) or STATICLIB_TYPE = SHARED: Build as shared library (Windows:.dll, Linux:.so)LIB_TYPE = STATIC: Build as static library (Windows:.lib, Linux:.a)
-
Install Dependencies:
- Download and install CUDA Toolkit (11.x or 12.x)
- Download and install TensorRT 10.x for Windows
- Download and build OpenCV or use prebuilt binaries
-
Configure Paths in CMakeLists.txt:
set(CUDA_ROOT_DIR "Your/CUDA/Path") set(TensorRT_ROOT_DIR "Your/TensorRT/Path") set(OpenCV_ROOT_DIR "Your/OpenCV/Path")
-
Build the Project:
cmake -S . -B build cmake --build build --config release -
Output:
- Library:
build/Release/trtemplate.dllandbuild/Release/trtemplate.lib - Executables:
build/Release/yolo.exe,build/Release/fcn.exe, etc.
- Library:
-
Install Dependencies:
sudo apt update sudo apt install cuda-toolkit-12-x # or cuda-toolkit-11-x sudo apt install libopencv-dev // for opencv
-
Install TensorRT: Download TensorRT for Linux from NVIDIA website and follow the installation guide.
-
Configure Paths in CMakeLists.txt:
set(CUDA_ROOT_DIR "/usr/local/cuda") set(TensorRT_ROOT_DIR "/path/to/TensorRT")
-
Build the Project:
cmake -S . -B build cmake --build build --config release -j 12 -
Output:
- Library:
build/libtrtemplate.so - Executables:
build/yolo,build/fcn, etc.
- Library:
-
TensorRT Version Compatibility:
- TensorRT 8.x are not compatible
- Must use TensorRT 10.x for this project
-
CUDA Version Mismatch:
- Ensure CUDA version matches TensorRT requirements
- TensorRT 10.x requires CUDA 11.8 or 12.x
-
OpenCV Path Issues:
- Windows: Point to the
libdirectory containing*.libfiles - Linux: Ensure
pkg-config opencv4 --cflags --libsworks
- Windows: Point to the
-
Build Type:
- Use Release builds for production (
--config release) - Use Debug builds for development and debugging
- Use Release builds for production (
First,download the onnx file of YOLOv8 and fcn from the link GoogleDrive.
convert the onnx file to engien file like
trtexec \
--onnx=./pretrain/yolov8n.onnx \
--saveEngine=./yolov8n.enginebuild this example by later command
cmake -S . -B build
cmake --build ./build --config release -j 12run
./build/Release/yolo.exe
# or linux
./build/yolo more detail code see YOLO.cc.
convert the onnx file to engien file like
trtexec \
--onnx=./pretrain/fcn.onnx \
--saveEngine=./fcn.enginebuild this example by later command
cmake -S . -B build
cmake --build ./build --config release -j 12run
./build/Release/example.exe
# or linux
./build/fcnmore detail code see FCN.cc.And Segformer is like this.
If you are writing your own model inference acceleration, Please follow the steps below
- Export the model to ONNX
- Convert ONNX to engine
- Write the preprocess and postprocess code
The following is a preprocess and postprocess code template:
Synchronous Inference (Original):
#include "TRTinfer.h"
#include <opencv2/opencv.hpp>
namespace model{
// Preprocess for the input blob
std::unordered_map<std::string, cv::Mat> preprocess(cv::Mat &left, cv::Mat &right)
{
...
}
// Postprocess for the input blob
std::unordered_map<std::string, cv::Mat> postprocess(std::unordered_map<std::string, cv::Mat> )
{
...
}
}
int main(int argc, char *argv[])
{
cv::Mat tensor1 = cv::imread("...");
cv::Mat tensor2 = cv::imread("...");
// Preprocess
auto input_blob = model::preprocess(tensor1, tensor2);
// Model inference
TRTInfer model("*.engine");
// Output
auto output_blob = model(input_blob);
cv::Mat dst;
model::postprocess(output_blob);
// Visualization
...
return 1;
}Asynchronous Inference (New):
#include "TRTinfer.h"
#include <opencv2/opencv.hpp>
#include <vector>
namespace model{
std::unordered_map<std::string, cv::Mat> preprocess(cv::Mat &img) { ... }
void postprocess(std::unordered_map<std::string, cv::Mat> output) { ... }
}
int main(int argc, char *argv[])
{
// Initialize model with 4 streams for async
TRTInfer model("*.engine", 4, true);
// Load images
std::vector<cv::Mat> images = {...};
std::vector<std::future<std::unordered_map<std::string, cv::Mat>>> futures;
// Submit all async inference tasks
for (auto& img : images) {
auto input_blob = model::preprocess(img);
futures.push_back(model.infer_async(input_blob));
}
// Wait for all results and process
for (auto& f : futures) {
auto output = f.get();
model::postprocess(output);
}
return 0;
}Here are a few points that may likely cause errors:
- Tensor Names
- Some weight files have inconsistent input tensor names – handle with care
- Use
polygraphyto verify tensor names before implementation
- Data Types
- Internal data type conversions may be involved – proceed with caution
- Common conversions: float32 ↔ float16 (FP16), int8, uint8
- Input Data Shape
- While preprocessing includes resizing, errors may occur inside the model – pay close attention
- Ensure NCHW vs NHWC format matches model expectations
It's best to run a check with Python lib polygraphy before execution. For example, here's the command for checking YOLOv8:
$ polygraphy run yolov8n.onnx --onnxrt
[I] RUNNING | Command: /root/miniconda3/envs/dlpy310/bin/polygraphy run yolov8n.onnx --onnxrt
[I] onnxrt-runner-N0-11/21/25-13:22:08 | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[I] onnxrt-runner-N0-11/21/25-13:22:08
---- Inference Input(s) ----
{images [dtype=float32, shape=(1, 3, 640, 480)]}
[I] onnxrt-runner-N0-11/21/25-13:22:08
---- Inference Output(s) ----
{output0 [dtype=float32, shape=(1, 84, 6300)]}
[I] onnxrt-runner-N0-11/21/25-13:22:08 | Completed 1 iteration(s) in 59.11 ms | Average inference time: 59.11 ms.
[I] PASSED | Runtime: 1.037s | Command: /root/miniconda3/envs/dlpy310/bin/polygraphy run yolov8n.onnx --onnxrtThe name, type, and dimensions of the tensor can be clearly seen.
When converting ONNX to TensorRT engine, you can use various optimization options:
Basic Conversion:
trtexec --onnx=model.onnx --saveEngine=model.engineFP16 Precision (Faster, slightly lower accuracy):
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16INT8 Precision (Fastest, requires calibration):
trtexec --onnx=model.onnx --saveEngine=model.engine --int8 --calib=calibration.cacheBatch Size Configuration:
trtexec --onnx=model.onnx --saveEngine=model.engine --minShapes=input:1x3x640x640 --optShapes=input:1x3x640x640 --maxShapes=input:1x3x640x640Workspace Size:
trtexec --onnx=model.onnx --saveEngine=model.engine --workspace=4096 # in MBVerbose Output:
trtexec --onnx=model.onnx --saveEngine=model.engine --verboseThe async inference system uses the following components:
- StreamPool: Manages a pool of CUDA streams for concurrent execution
- MemoryPool: Pre-allocates GPU memory for each stream to avoid runtime allocation
- AsyncInfer: Template class providing async inference capabilities
- InferenceTask: Encapsulates individual inference tasks
- Start with 4 streams for most models
- Increase to 8 for small/fast models (YOLOv8n)
- Use 2-4 for large models (Segformer, FCN)
- Monitor GPU utilization to find optimal value
// Memory usage = (input + output) size × num_streams
// Example: YOLOv8n (640x640) × 4 streams ≈ 200MB- The
TRTInferclass is thread-safe for concurrentinfer_async()calls - Each call gets its own stream from the pool
- Synchronization is managed automatically
- Use async for batch processing: When processing multiple images/videos
- Use sync for single inference: When latency is critical and throughput doesn't matter
- Reuse model instances: Avoid creating multiple
TRTInferinstances - Profile your use case: Measure performance to find optimal stream count
// Reduce number of streams
TRTInfer model("engine", 2, true); // Use fewer streams- Ensure your GPU supports concurrent kernel execution
- Check that model is not too small to benefit from concurrency
- Verify
enable_asyncis set totrue
- Use
wait_all()strategically to batch sync operations - Consider callback-based API for pipeline parallelism
TRTInfer/
├── TRTinfer.h/cc # Main inference class
├── inference_config.h # Configuration constants
├── stream_pool.h/cc # CUDA stream management
├── memory_pool.h/cc # GPU memory management
├── async_infer.h/cc # Async inference implementation
├── inference_task.h # Task encapsulation
├── utility.h/cc # Utility functions
└── config.h # Build configuration
- Pre-allocates memory for each stream during initialization
- Each tensor has N allocations (N = num_streams)
- No runtime cudaMalloc/cudaFree during inference
- Automatic memory cleanup on destruction
- Round-robin stream allocation
- Blocking acquire when all streams are busy
- Automatic release after inference completes
- Supports 1-16 concurrent streams
- User calls
infer_async(input) - Stream acquired from pool
- Input copied to GPU (async)
- Inference enqueued (async)
- Output copied to CPU (async)
- Future returned to user
- Stream released back to pool

