GPU and CPU Benchmark of the SamLowe/roberta-base-go_emotions model on a dataset of 10k random reddit comments, with pytorch (torch), ONNX (onnx), and O4-optimized FP16 ONNX versions (onnx-fp16).
GPU insights:
- The FP16 optimized model is up to 3X faster than torch. The gain depends on the GPU's specific FP32:FP16 ratio.
- Base ONNX with CUDA is up to ~40% faster than torch. In theory, it can be optimized further with TensorRT.
- The RTX 4090 is both 9X faster and 9X more expensive than the P40 (~$1800 vs ~$200 used) with FP16 ONNX, and only 4-5X faster with the other models.
GPU results for the normal dataset
| GPU/batch size | 1 | 2 | 4 | 8 | 16 | 32 |
|---|---|---|---|---|---|---|
| H200 (onnx-fp16) | 778.86 | 1188.25 | 1809.96 | 2253.15 | 2138.84 | 1817.96 |
| H200 (onnx) | 425.58 | 760.48 | 1184.77 | 1465.01 | 1554.82 | 1333.68 |
| H200 (torch) | 267.06 | 406.27 | 698.71 | 928.64 | 923.54 | 773.35 |
| L40S (onnx-fp16) | 874.90 | 1387.24 | 2041.68 | 2312.79 | 2052.96 | 1601.76 |
| L40S (onnx) | 512.86 | 810.75 | 1171.87 | 1185.75 | 917.84 | 618.85 |
| L40S (torch) | 271.24 | 426.93 | 700.68 | 812.81 | 697.44 | 548.10 |
| RTX 4090 (onnx-fp16) | 1042.47 | 1042.47 | 2280.61 | 2551.59 | 2346.59 | 2346.59 |
| RTX 4090 (onnx) | 595.40 | 963.06 | 1232.12 | 1183.82 | 919.05 | 646.79 |
| RTX 4090 (torch) | 323.75 | 564.39 | 857.28 | 876.10 | 668.70 | 462.63 |
| Tesla A10G (onnx-fp16) | 600.00 | 879.20 | 1094.11 | 1082.87 | 943.09 | 767.02 |
| Tesla A10G (onnx) | 326.58 | 476.80 | 556.52 | 473.00 | 365.13 | 281.95 |
| Tesla A10G (torch) | 131.10 | 236.48 | 385.63 | 402.36 | 310.15 | 231.54 |
| Tesla P40 (onnx-fp16) | 263.18 | 286.72 | 255.36 | 200.65 | 148.89 | 108.92 |
| Tesla P40 (onnx) | 212.35 | 260.29 | 247.01 | 202.54 | 155.42 | 119.59 |
| Tesla P40 (torch) | 162.19 | 218.12 | 221.68 | 177.85 | 124.72 | 80.36 |
Table 1: GPU benchmark in messages/s for the normal dataset. Results may vary due to CPU tokenizer performance.
GPU results for the filtered (>200 characters) dataset
| GPU/batch size | 1 | 2 | 4 | 8 | 16 | 32 |
|---|---|---|---|---|---|---|
| H200 (onnx-fp16) | 643.63 | 875.59 | 1199.81 | 1302.29 | 1246.55 | 1208.13 |
| H200 (onnx) | 412.22 | 598.89 | 804.16 | 950.46 | 950.46 | 901.41 |
| H200 (torch) | 240.53 | 371.92 | 544.06 | 599.08 | 550.58 | 517.23 |
| L40S (onnx-fp16) | 726.27 | 961.86 | 1273.63 | 1305.42 | 1255.20 | 1079.12 |
| L40S (onnx) | 436.19 | 630.20 | 750.88 | 631.47 | 464.44 | 359.88 |
| L40S (torch) | 255.08 | 380.23 | 490.16 | 451.38 | 392.96 | 340.52 |
| RTX 4090 (onnx-fp16) | 856.65 | 1209.98 | 1438.25 | 1513.05 | 1395.42 | 1221.52 |
| RTX 4090 (onnx) | 494.28 | 673.83 | 740.03 | 610.06 | 472.35 | 382.72 |
| RTX 4090 (torch) | 302.38 | 476.46 | 548.32 | 450.82 | 338.37 | 273.01 |
| Tesla A10G (onnx-fp16) | 463.21 | 584.19 | 624.32 | 612.12 | 554.00 | 498.06 |
| Tesla A10G (onnx) | 255.55 | 312.77 | 290.70 | 239.00 | 200.90 | 176.20 |
| Tesla A10G (torch) | 126.82 | 209.08 | 245.60 | 205.70 | 167.53 | 141.90 |
| Tesla P40 (onnx-fp16) | 154.33 | 150.74 | 126.01 | 101.90 | 81.77 | 68.15 |
| Tesla P40 (onnx) | 138.25 | 142.59 | 125.45 | 103.09 | 86.84 | 75.27 |
| Tesla P40 (torch) | 117.11 | 128.19 | 113.87 | 88.03 | 64.88 | 47.76 |
Table 2: GPU benchmark in messages/s for the filtered dataset. Results may vary due to CPU tokenizer performance.
CPU results for the normal dataset
| CPU/batch size @threads | 1 @1T | 2 @1T | 4 @1T | 1 @4T | 2 @4T | 4 @4T | @max cores* |
|---|---|---|---|---|---|---|---|
Table 3: CPU benchmark in messages/thread/s. *(@max cores) = (performance @1T)x(number of cores). It underestimates performance by disregarding hyperthreading, but overestimates by assuming same frequency at single-threaded and full load.
CPU results for the filtered dataset
| CPU/batch size @threads | 1 @1T | 2 @1T | 4 @1T | 1 @4T | 2 @4T | 4 @4T | @max cores* |
|---|---|---|---|---|---|---|---|
Table 4: CPU benchmark in messages/thread/s. *(@max cores) = (performance @1T)x(number of cores). It underestimates performance by disregarding hyperthreading, but overestimates by assuming same frequency at single-threaded and full load.
The benchmark requires a working torch installation with CUDA support, as well as transformers, optimum, pandas and tqdm. These can be installed with
pip install transformers optimum[onnxruntime-gpu] pandas tqdm --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
Alternatively, a conda environment bench with all the requirements can be created with
conda env create -f environment.yml
conda activate bench
The dataset consists of 10K randomly sampled Reddit comments from 12/2005-03/2023, from the Pushshift data dumps. It excludes comments with empty, [deleted] or [removed] content. Two options are provided:
normal: As described abovefiltered: contains only comments with>200characters.
To run the benchmarks, use the run_benchmark.py script:
python run_benchmark.py --model [torch, onnx or onnx-fp16] --device [gpu or cpu]
Arguments:
model(required): Model backend to use, either "torch" for torch or "onnx" for ONNX Runtime.device(required): Device type to use, either "gpu" or "cpu"dataset: Dataset variant to use, either "normal" or "filtered" (default: "normal").gpu: ID of the GPU to use (default: 0).batches: Comma-separated batch sizes to run (default: "1,2,4,8,16,32").threads: Specify the number of CPU threads to use (default: 1).
The scripts will output the number of messages processed per second for each batch size.