-
Notifications
You must be signed in to change notification settings - Fork 6
Inference Optimization
If you would like to increase your inference speed some options are:
- Use batched inference with https://github.com/ultralytics/yolov5/issues/36
- Reduce --img-size, i.e. 1280 -> 640 -> 320
- Reduce model size, i.e. YOLOv5x -> YOLOv5l -> YOLOv5m -> YOLOv5s -> YOLOv5n
- Use half precision FP16 inference with python detect.py --half and python val.py --half
- Use a faster GPUs, i.e.: P100 -> V100 -> A100
- https://github.com/ultralytics/yolov5/issues/251 to ONNX or OpenVINO for up to 3x CPU speedup (https://github.com/ultralytics/yolov5/pull/6613)
- https://github.com/ultralytics/yolov5/issues/251 to TensorRT for up to 5x GPU speedup
- Use a free GPU backends with up to 16GB of CUDA memory:
Quantization converts 32-bit floating point numbers to 8-bit integers. It performs some or all of the operations on 8-bit integers, which can reduce the model size and memory requirements by a factor of 4.
However, there is a cost to that. In order to reduce the size of the model and improve the execution time, we will sacrifice some precision. So there will be a trade-off between model accuracy and size/latency

The DeepSparse Platform builds on top of sparsification enabling you to easily apply the techniques to your datasets and models using recipe-driven approaches. Recipes are YAML or Markdown files that SparseML uses to easily define and control the sparsification of a model. Recipes consist of a series of Modifiers that can influence the training process in different ways.