Neko Performance Evaluation

This repository provides the compilation scripts for Neko (v0.8.0-rc1), comparing execution efficiency and performance between AMD MI210 and NVIDIA H100.

Performance Comparison

The experiment focus on test cases tgv_Re1600. The AMD platform utilized MI210 GPUs, while the NV platform (Nano5) utilized H100 GPUs.

Testcase Source

The testcases used in this repository are derived from the benchmark problem provided in the ISC24 Student Cluster Competition (SCC).

Reference: ISC High Performance 2024 SCC – Neko Benchmark
https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/3101687809/Getting+started+with+Neko+for+ISC24+SCC+In-Person

The datasets are used here solely for benchmarking and performance evaluation. All original materials belong to the ISC High Performance and HPC Advisory Council.

Results

Item	MI210	H100
CPU	AMD EPYC 9654	Intel Xeon 8480CL
Autotune choice	`2 (KSTEP)`	`1 (1D)`
timestep 1 step time	26.50 s	12.10 s
average step time（step 200000）	0.1684 s/step	0.06732 s/step
total elapsed time（step 200000）	36,978.70 s	14,824.39 s

Profile Results

The profiling results show that the execution efficiency is mainly limited by CPU–GPU synchronization overhead rather than kernel execution.

cudaStreamSynchronize dominates the CUDA API time, accounting for 82.3%, while cudaEventSynchronize contributes 14.6%. In contrast, cudaLaunchKernel represents only 1.7%, indicating that the CPU spends most of the time waiting for GPU completion instead of launching kernels.
The workload consists of many short GPU kernels executed repeatedly. The most time-consuming kernels include scatter_kernel (13.1%), ax_helm_kernel (8.7%), and dudxyz_kernel (7.4%), which correspond to core numerical operators in the Neko spectral element solver.
Despite the synchronization overhead, GPU utilization remained above 50% throughout the profiling window, indicating that the application is primarily GPU-bound.

Discussion and Conclusion

The H100 platform achieved significantly better performance than the MI210 in the tgv_Re1600 testcase. The average timestep time was 0.06732 s/step on H100 compared to 0.1684 s/step on MI210, providing roughly 2.5× higher performance.
Profiling results suggest that the solver launches a large number of small kernels, which leads to frequent CPU–GPU synchronization. Techniques such as CUDA Graphs, kernel fusion, or reducing synchronization frequency could further improve performance.
Overall, while both platforms perform well for the Neko solver, the H100 benefits from higher memory bandwidth and newer GPU architecture, resulting in significantly better execution efficiency for this CFD workload.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
scripts		scripts
testcase/tgv		testcase/tgv
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neko Performance Evaluation

Performance Comparison

Testcase Source

Results

Profile Results

Discussion and Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Neko Performance Evaluation

Performance Comparison

Testcase Source

Results

Profile Results

Discussion and Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages