This repository provides the compilation scripts for Neko (v0.8.0-rc1), comparing execution efficiency and performance between AMD MI210 and NVIDIA H100.
The experiment focus on test cases tgv_Re1600. The AMD platform utilized MI210 GPUs, while the NV platform (Nano5) utilized H100 GPUs.
The testcases used in this repository are derived from the benchmark problem provided in the ISC24 Student Cluster Competition (SCC).
Reference:
ISC High Performance 2024 SCC – Neko Benchmark
https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/3101687809/Getting+started+with+Neko+for+ISC24+SCC+In-Person
The datasets are used here solely for benchmarking and performance evaluation. All original materials belong to the ISC High Performance and HPC Advisory Council.
| Item | MI210 | H100 |
|---|---|---|
| CPU | AMD EPYC 9654 | Intel Xeon 8480CL |
| Autotune choice | 2 (KSTEP) |
1 (1D) |
| timestep 1 step time | 26.50 s | 12.10 s |
| average step time(step 200000) | 0.1684 s/step | 0.06732 s/step |
| total elapsed time(step 200000) | 36,978.70 s | 14,824.39 s |
The profiling results show that the execution efficiency is mainly limited by CPU–GPU synchronization overhead rather than kernel execution.
cudaStreamSynchronizedominates the CUDA API time, accounting for 82.3%, whilecudaEventSynchronizecontributes 14.6%. In contrast,cudaLaunchKernelrepresents only 1.7%, indicating that the CPU spends most of the time waiting for GPU completion instead of launching kernels.- The workload consists of many short GPU kernels executed repeatedly. The most time-consuming kernels include
scatter_kernel(13.1%),ax_helm_kernel(8.7%), anddudxyz_kernel(7.4%), which correspond to core numerical operators in the Neko spectral element solver. - Despite the synchronization overhead, GPU utilization remained above 50% throughout the profiling window, indicating that the application is primarily GPU-bound.
- The H100 platform achieved significantly better performance than the MI210 in the tgv_Re1600 testcase. The average timestep time was 0.06732 s/step on H100 compared to 0.1684 s/step on MI210, providing roughly 2.5× higher performance.
- Profiling results suggest that the solver launches a large number of small kernels, which leads to frequent CPU–GPU synchronization. Techniques such as CUDA Graphs, kernel fusion, or reducing synchronization frequency could further improve performance.
- Overall, while both platforms perform well for the Neko solver, the H100 benefits from higher memory bandwidth and newer GPU architecture, resulting in significantly better execution efficiency for this CFD workload.