OpenCL Multi‑Device Memory Bandwidth Analyzer is a C++ benchmarking tool that measures memory bandwidth performance across all OpenCL devices available on a system.
The program automatically detects all OpenCL platforms and devices (GPU, CPU, accelerators) and performs several tests to evaluate:
• Host → Device memory bandwidth
• Device → Host memory bandwidth
• Kernel global memory throughput
This allows developers and researchers to quickly identify which compute device provides the best OpenCL memory performance.
The project is lightweight, dependency‑minimal, and designed for reproducible benchmarking.
Example execution on a laptop GPU system.
Below is a real example output produced by the program.
====================================================================================================
PLATFORM #0
====================================================================================================
Name : NVIDIA CUDA
Vendor : NVIDIA Corporation
Version : OpenCL 3.0 CUDA 13.1.86
Devices : 1
DEVICE #0
Name : NVIDIA GeForce RTX 4070 Laptop GPU
Type : GPU
Version : OpenCL 3.0 CUDA
Driver : 591.44
Compute Units : 36
Global Mem : 8187 MB
Write BW : 11.39 GB/s
Read BW : 12.27 GB/s
Kernel BW : 13343.97 GB/s
Status : PASS
The program evaluates multiple devices and reports the measured bandwidth and status.
OpenCL (Open Computing Language) is an open standard for parallel computing across heterogeneous hardware.
OpenCL allows programs to run compute workloads on:
• GPUs
• CPUs
• integrated GPUs
• FPGAs
• accelerators
OpenCL separates programs into two parts:
Runs on the CPU and is responsible for:
• discovering OpenCL platforms and devices
• allocating memory buffers
• compiling kernels
• launching compute kernels
Runs on the compute device (GPU / CPU) and performs massively parallel operations.
OpenCL is used in many real‑world applications and frameworks.
Examples include:
| Software | Use Case |
|---|---|
| Blender | GPU rendering |
| DaVinci Resolve | video processing |
| Darktable | photo processing |
| OpenCV | image processing |
| Intel oneAPI | heterogeneous computing |
| AMD ROCm | GPU compute |
| scientific HPC tools | simulations |
This tool performs three different measurements.
Measures transfer speed from:
CPU → GPU
Implemented with:
clEnqueueWriteBuffer
Measures transfer speed from:
GPU → CPU
Implemented with:
clEnqueueReadBuffer
A custom OpenCL kernel repeatedly reads and writes global memory.
Example kernel:
__kernel void memory_copy_test(__global const uchar* src,
__global uchar* dst,
const uint iterations)This simulates heavy GPU memory traffic.
The project intentionally uses minimal dependencies.
Main API used:
CL/cl.h
Used for:
• platform enumeration
• device discovery
• memory allocation
• kernel compilation
• kernel execution
| Library | Purpose |
|---|---|
| iostream | console output |
| vector | data containers |
| string | device information |
| algorithm | sorting results |
| numeric | averaging |
| chrono | performance timing |
| iomanip | formatted printing |
opencl-multidevice-bandwidth-analyzer
│
├── src
│ └── main.cpp
│
├── doc
│ └── image1.png
│
├── include
│ └── CL
│ └── cl.h
│
├── lib
│ └── OpenCL.lib
│
├── README.md
├── LICENSE
└── .gitignore
Contains the C++ benchmark implementation.
Main responsibilities:
• OpenCL platform discovery
• device enumeration
• memory transfer benchmarks
• kernel execution
• device ranking
Contains documentation assets such as screenshots used in the README.
Install OpenCL drivers appropriate for your hardware.
Install the latest GPU driver.
https://developer.nvidia.com/opencl
Install Intel oneAPI Base Toolkit.
Install ROCm or AMD GPU drivers.
git clone https://github.com/YOUR_USERNAME/opencl-multidevice-bandwidth-analyzer.git
cd opencl-multidevice-bandwidth-analyzer
g++ src/main.cpp -O2 -lOpenCL -o bandwidth_analyzer
cl src\main.cpp OpenCL.lib
./bandwidth_analyzer
or
bandwidth_analyzer.exe
The program will automatically detect all OpenCL devices and run the benchmark.
Current limitations:
• Only global memory bandwidth is tested
• No local/shared memory benchmarks
• No compute FLOPS test
• No multi‑GPU concurrent benchmarking
• Results may vary due to PCIe bandwidth or driver differences
Possible future extensions:
• GPU compute FLOPS benchmark
• shared/local memory benchmark
• OpenCL event profiling
• CSV export of results
• graphical charts for comparison
• CUDA vs OpenCL comparison mode
• multi‑GPU parallel testing
Sayed Ahmadreza Razian, PhD
LinkedIn
https://www.linkedin.com/in/ahmadrezarazian/
Google Scholar
https://scholar.google.com/citations?user=Dh9Iy2YAAAAJ
Email
AhmadrezaRazian@gmail.com
Feel free to contact me for collaboration or questions.
