GPU Benchmark GoEmotions

GPU and CPU Benchmark of the SamLowe/roberta-base-go_emotions model on a dataset of 10k random reddit comments, with pytorch (torch), ONNX (onnx), and O4-optimized FP16 ONNX versions (onnx-fp16).

Results

GPU insights:

The FP16 optimized model is up to 3X faster than torch. The gain depends on the GPU's specific FP32:FP16 ratio.
Base ONNX with CUDA is up to ~40% faster than torch. In theory, it can be optimized further with TensorRT.
The RTX 4090 is both 9X faster and 9X more expensive than the P40 (~$1800 vs ~$200 used) with FP16 ONNX, and only 4-5X faster with the other models.

Benchmark results

GPU results for the normal dataset

GPU/batch size	1	2	4	8	16	32
H200 (onnx-fp16)	778.86	1188.25	1809.96	2253.15	2138.84	1817.96
H200 (onnx)	425.58	760.48	1184.77	1465.01	1554.82	1333.68
H200 (torch)	267.06	406.27	698.71	928.64	923.54	773.35
L40S (onnx-fp16)	874.90	1387.24	2041.68	2312.79	2052.96	1601.76
L40S (onnx)	512.86	810.75	1171.87	1185.75	917.84	618.85
L40S (torch)	271.24	426.93	700.68	812.81	697.44	548.10
RTX 4090 (onnx-fp16)	1042.47	1042.47	2280.61	2551.59	2346.59	2346.59
RTX 4090 (onnx)	595.40	963.06	1232.12	1183.82	919.05	646.79
RTX 4090 (torch)	323.75	564.39	857.28	876.10	668.70	462.63
Tesla A10G (onnx-fp16)	600.00	879.20	1094.11	1082.87	943.09	767.02
Tesla A10G (onnx)	326.58	476.80	556.52	473.00	365.13	281.95
Tesla A10G (torch)	131.10	236.48	385.63	402.36	310.15	231.54
Tesla P40 (onnx-fp16)	263.18	286.72	255.36	200.65	148.89	108.92
Tesla P40 (onnx)	212.35	260.29	247.01	202.54	155.42	119.59
Tesla P40 (torch)	162.19	218.12	221.68	177.85	124.72	80.36

Table 1: GPU benchmark in messages/s for the normal dataset. Results may vary due to CPU tokenizer performance.

GPU results for the filtered (>200 characters) dataset

GPU/batch size	1	2	4	8	16	32
H200 (onnx-fp16)	643.63	875.59	1199.81	1302.29	1246.55	1208.13
H200 (onnx)	412.22	598.89	804.16	950.46	950.46	901.41
H200 (torch)	240.53	371.92	544.06	599.08	550.58	517.23
L40S (onnx-fp16)	726.27	961.86	1273.63	1305.42	1255.20	1079.12
L40S (onnx)	436.19	630.20	750.88	631.47	464.44	359.88
L40S (torch)	255.08	380.23	490.16	451.38	392.96	340.52
RTX 4090 (onnx-fp16)	856.65	1209.98	1438.25	1513.05	1395.42	1221.52
RTX 4090 (onnx)	494.28	673.83	740.03	610.06	472.35	382.72
RTX 4090 (torch)	302.38	476.46	548.32	450.82	338.37	273.01
Tesla A10G (onnx-fp16)	463.21	584.19	624.32	612.12	554.00	498.06
Tesla A10G (onnx)	255.55	312.77	290.70	239.00	200.90	176.20
Tesla A10G (torch)	126.82	209.08	245.60	205.70	167.53	141.90
Tesla P40 (onnx-fp16)	154.33	150.74	126.01	101.90	81.77	68.15
Tesla P40 (onnx)	138.25	142.59	125.45	103.09	86.84	75.27
Tesla P40 (torch)	117.11	128.19	113.87	88.03	64.88	47.76

Table 2: GPU benchmark in messages/s for the filtered dataset. Results may vary due to CPU tokenizer performance.

CPU results for the normal dataset

CPU/batch size @threads	1 @1T	2 @1T	4 @1T	1 @4T	2 @4T	4 @4T	@max cores*

Table 3: CPU benchmark in messages/thread/s. *(@max cores) = (performance @1T)x(number of cores). It underestimates performance by disregarding hyperthreading, but overestimates by assuming same frequency at single-threaded and full load.

CPU results for the filtered dataset

CPU/batch size @threads	1 @1T	2 @1T	4 @1T	1 @4T	2 @4T	4 @4T	@max cores*

Table 4: CPU benchmark in messages/thread/s. *(@max cores) = (performance @1T)x(number of cores). It underestimates performance by disregarding hyperthreading, but overestimates by assuming same frequency at single-threaded and full load.

Documentation

Requirements

The benchmark requires a working torch installation with CUDA support, as well as transformers, optimum, pandas and tqdm. These can be installed with

pip install transformers optimum[onnxruntime-gpu] pandas tqdm --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/

Alternatively, a conda environment bench with all the requirements can be created with

conda env create -f environment.yml
conda activate bench

Dataset

The dataset consists of 10K randomly sampled Reddit comments from 12/2005-03/2023, from the Pushshift data dumps. It excludes comments with empty, [deleted] or [removed] content. Two options are provided:

normal: As described above
filtered: contains only comments with >200 characters.

Usage

To run the benchmarks, use the run_benchmark.py script:

python run_benchmark.py --model [torch, onnx or onnx-fp16] --device [gpu or cpu]

Arguments:

model (required): Model backend to use, either "torch" for torch or "onnx" for ONNX Runtime.
device (required): Device type to use, either "gpu" or "cpu"
dataset: Dataset variant to use, either "normal" or "filtered" (default: "normal").
gpu: ID of the GPU to use (default: 0).
batches: Comma-separated batch sizes to run (default: "1,2,4,8,16,32").
threads: Specify the number of CPU threads to use (default: 1).

The scripts will output the number of messages processed per second for each batch size.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
run_benchmark.py		run_benchmark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPU Benchmark GoEmotions

Results

Benchmark results

Documentation

Requirements

Dataset

Usage

About

Uh oh!

Releases

Packages

Languages

joaopn/gpu_benchmark_goemotions

Folders and files

Latest commit

History

Repository files navigation

GPU Benchmark GoEmotions

Results

Benchmark results

Documentation

Requirements

Dataset

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages