Skip to content

NTHU-SC/AMD_Neko

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Neko Performance Evaluation

This repository provides the compilation scripts for Neko (v0.8.0-rc1), comparing execution efficiency and performance between AMD MI210 and NVIDIA H100.

Performance Comparison

The experiment focus on test cases tgv_Re1600. The AMD platform utilized MI210 GPUs, while the NV platform (Nano5) utilized H100 GPUs.

Testcase Source

The testcases used in this repository are derived from the benchmark problem provided in the ISC24 Student Cluster Competition (SCC).

Reference: ISC High Performance 2024 SCC – Neko Benchmark
https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/3101687809/Getting+started+with+Neko+for+ISC24+SCC+In-Person

The datasets are used here solely for benchmarking and performance evaluation. All original materials belong to the ISC High Performance and HPC Advisory Council.

Results

Item MI210 H100
CPU AMD EPYC 9654 Intel Xeon 8480CL
Autotune choice 2 (KSTEP) 1 (1D)
timestep 1 step time 26.50 s 12.10 s
average step time(step 200000) 0.1684 s/step 0.06732 s/step
total elapsed time(step 200000) 36,978.70 s 14,824.39 s

Profile Results

The profiling results show that the execution efficiency is mainly limited by CPU–GPU synchronization overhead rather than kernel execution.

  • cudaStreamSynchronize dominates the CUDA API time, accounting for 82.3%, while cudaEventSynchronize contributes 14.6%. In contrast, cudaLaunchKernel represents only 1.7%, indicating that the CPU spends most of the time waiting for GPU completion instead of launching kernels.
  • The workload consists of many short GPU kernels executed repeatedly. The most time-consuming kernels include scatter_kernel (13.1%), ax_helm_kernel (8.7%), and dudxyz_kernel (7.4%), which correspond to core numerical operators in the Neko spectral element solver.
  • Despite the synchronization overhead, GPU utilization remained above 50% throughout the profiling window, indicating that the application is primarily GPU-bound.

Discussion and Conclusion

  • The H100 platform achieved significantly better performance than the MI210 in the tgv_Re1600 testcase. The average timestep time was 0.06732 s/step on H100 compared to 0.1684 s/step on MI210, providing roughly 2.5× higher performance.
  • Profiling results suggest that the solver launches a large number of small kernels, which leads to frequent CPU–GPU synchronization. Techniques such as CUDA Graphs, kernel fusion, or reducing synchronization frequency could further improve performance.
  • Overall, while both platforms perform well for the Neko solver, the H100 benefits from higher memory bandwidth and newer GPU architecture, resulting in significantly better execution efficiency for this CFD workload.

About

Build and run Neko (a portable framework for high-order spectral element flow simulations) with Mi210

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors