DSEAdns is a direct numerical simulation (DNS) implementation within the DSEA (Data Streaming for Explicit Algorithms) framework. It solves the three-dimensional incompressible Navier–Stokes equations using a fourth-order central differencing scheme for spatial discretization and a third-order Runge-Kutta method for temporal integration.
The simulation setup follows the Taylor-Green vortex example presented by Jacobs et al. [1], serving as a reference case for accuracy and performance evaluation.
To build and run DSEAdns, the following tools are required:
- A C++ compiler with C++17 support
- CUDA 12.0 or newer
- An MPI library with MPI 2.0 support
Optional (for multi-rail communication across compute nodes):
- UCX (Unified Communication X) version 1.17 or newer (both headers and libraries)
- Adjust paths in the
Makefileto match your system’s CUDA, MPI, and (optionally) UCX installation. - Configure the simulation case by editing
run_case.sh:- Select a predefined problem size
- Choose the kernel file to be used
- Set the number of workers per GPU
- Define the number of communication rails
- Specify the number of supercycles to compute
- Execute the
run_case.shscript to start the simulation.
To enable simulation output, set the DOUTPUT macro to a positive integer. This value defines the interval (in supercycles) between each simulation output.
To change the output directory, modify the path in the void DS::write_vtr function in the selected kernel file.
Each kernel file corresponds to a specific optimization cycle. Below is an overview of the implemented cycles:
dsea_kernel_cycle00_base.cuBaseline implementation algorithm based on Jacobs et al. [1]. Uses fixed kernel configurations with a 1D block size of 128 threads.dsea_kernel_cycle01_0_fusing.cuIntroduces kernel fusion, transitioning from task-specific kernels to data-centric kernels. Reduces the number of kernels to five and adds support for configurable thread block sizes.dsea_kernel_cycle01_1_fusing+temporal_derivative.cuMoves temporal derivative calculation out ofdns_Res_StageAdvance, reducing overall memory traffic.dsea_kernel_cycle02_rhoETp_optimized.cuSplitsdns_rhoETpdxyzinto two kernels to improve memory coalescing. Uses shared memory and optimized floating-point division to reduce register pressure.dsea_kernel_cycle03_Res_StageAdvance_optimized.cuFurther reduces memory traffic by distributing temporal derivative computation across other kernels, saving one global store/load per result.dsea_kernel_cycle_04_0_shared.cuLeverages shared memory in additional kernels to eliminate redundant memory access.dsea_kernel_cycle_04_1_shared_schedule.cuLeverages instruction-level scheduling, overlapping shared memory loads with computations to hide memory latency.dsea_kernel_cycle04_2_scaling.cuFinal version with hardcoded, empirically optimized thread block configurations for maximum performance.
The final implementation achieves a 3× speedup compared to the baseline.
[1] Christian T. Jacobs, Satya P. Jammy, Neil D. Sandham, OpenSBLI: A framework for the automated derivation and parallel execution of finite difference solvers on a range of computer architectures, Journal of Computational Science, Volume 18, 2017, Pages 12–23. https://doi.org/10.1016/j.jocs.2016.11.001
If you use this work in academic or scientific contexts, please cite:
M. Rose, S. Homes, L. Ramsperger, J. Gracia, C. Niethammer, and J. Vrabec. Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics. HeteroPar 2025, accepted.