From 1d3a6b295e449f136d222a7ecfa27a5ff9bd5eff Mon Sep 17 00:00:00 2001 From: Premas Date: Fri, 26 Jul 2024 12:36:57 -0500 Subject: [PATCH 01/15] intro section parallelization --- docs/cheaha/tutorial/index.md | 7 +++ docs/cheaha/tutorial/parallelism.md | 73 +++++++++++++++++++++++++++++ mkdocs.yml | 1 + 3 files changed, 81 insertions(+) create mode 100644 docs/cheaha/tutorial/parallelism.md diff --git a/docs/cheaha/tutorial/index.md b/docs/cheaha/tutorial/index.md index 4dfff74ba..c934828ea 100644 --- a/docs/cheaha/tutorial/index.md +++ b/docs/cheaha/tutorial/index.md @@ -7,3 +7,10 @@ Have you encountered problems while using Anaconda on Cheaha? We have provided t Below is a list of Tutorials we currently have Using Anaconda on Cheaha; 1. Using PyTorch and TensorFlow with Anaconda on Cheaha, click [here.](../tutorial/pytorch_tensorflow.md) + +## Getting Started with Parallelism on Cheaha + +The tutorial covers usage of the following parallelization-based software installed in Cheaha. To use parllelization based software on Cheaha, [click here](./parallelism.md). + +1. [Gromacs](https://www.gromacs.org/): Used for High-performance molecular dynamics and analysis. +2. [Quantum Expresso](https://www.quantum-espresso.org/): Used for electronic-structure calculations and materials modeling. diff --git a/docs/cheaha/tutorial/parallelism.md b/docs/cheaha/tutorial/parallelism.md new file mode 100644 index 000000000..fe0507c34 --- /dev/null +++ b/docs/cheaha/tutorial/parallelism.md @@ -0,0 +1,73 @@ +# Tutorial on Usage of Parallelization-based Software in Cheaha + +## Sequential Execution + +Breaking a bigger problem into set of tasks, and executing these tasks one by one on a single CPU. + +## Parallel Execution + +Divide a larger problem into a series of smaller tasks and execute these tasks using multiple CPUs. The application should leverage parallelization technology to utilize multicore processors, multiple nodes, multiple GPUs, or a hybrid approach (such as combining CPUs and GPUs). + +## Parallelization Categories + +### Shared Memory Parallelization + +In this method, your job executes independent tasks on separate cores within the same compute node. These tasks share the node's resources and communicate by reading from and writing to shared memory. + +For instance, OpenMP is a parallel directive that supports shared memory parallelization. + +### Distributed Memory Parallelization + +Tasks can be distributed on different compute nodes and executed. The tasks communicate each other using message passing. A widely used standard to achieve this kind of parallelism is Message Passing Interface (MPI). + +## Software Packages that Support Parallelization + +### Gromacs + +Gromacs is specifically meant for high-performance molecular dynamics and analysis. + +Recent Gromacs version available in Cheaha can be loaded as, + +```bash +$module load rc/GROMACS/2022.3-gpu +``` + +Some of the simulation examples are covered in the [Gromacs Tutorial Page](http://www.mdtutorials.com/gmx/).We will take the example of Lysozyme in water as a case study. + +Download and extract the example `pdb` dataset using the following command, + +```bash +$DATA_SET=water_GMX50_bare +$wget -c https://ftp.gromacs.org/pub/benchmarks/${DATA_SET}.tar.gz +$tar xf ${DATA_SET}.tar.gz +``` + +To run the simulation using `gmx` executable, + +```bash +$cd ./water-cut1.0_GMX50_bare/1536 +$gmx grompp -f pme.mdp +$gmx mdrun -ntmpi 4 -nb gpu -pin on -nsteps 5000 -ntomp 10 -s topol.tpr +``` + +In the above run command, + +- `gmx` is the executable to run the pipeline. +- `-ntmpi` Number of MPI processes. +- `-nb gpu`defines the computation to use GPUs. +- `-pin on` binds CPU to core. +- `-nsteps 5000` Number of steps to run the simulation. +-`-ntomp 10` Number of cores, which will be 10*4 MPI processes i.e, 40 cores +- `-s topol.tpr` is the input parameter. + +#### Performance Analysis + +### Quantum Espresso + +Quantum Espresso (QE) is an open-source suite of codes for electronic-structure calculations and materials modeling based on density functional theory (DFT), plane waves, and pseudopotentials. It is used to study the properties of materials at the atomic scale. + +Quantum Expresso is available as a module in Cheaha and can be loaded as, + +```bash +$module load QuantumESPRESSO/6.3-foss-2018b +``` diff --git a/mkdocs.yml b/mkdocs.yml index cae3896bf..d63a78875 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -116,6 +116,7 @@ nav: - Tutorials: - cheaha/tutorial/index.md - Anaconda Environment Tutorial: cheaha/tutorial/pytorch_tensorflow.md + - Parallelism Tutorial: cheaha/tutorial/parallelism.md - Cheaha Web Portal: - cheaha/open_ondemand/index.md - Using the Web Portal: cheaha/open_ondemand/ood_layout.md From d4270d5d5ec40c105459f0ed1af51792ec38759a Mon Sep 17 00:00:00 2001 From: Premas Date: Tue, 13 Aug 2024 13:32:10 -0500 Subject: [PATCH 02/15] section heading changed --- docs/cheaha/tutorial/parallelism.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/cheaha/tutorial/parallelism.md b/docs/cheaha/tutorial/parallelism.md index fe0507c34..a8284ea47 100644 --- a/docs/cheaha/tutorial/parallelism.md +++ b/docs/cheaha/tutorial/parallelism.md @@ -8,7 +8,7 @@ Breaking a bigger problem into set of tasks, and executing these tasks one by on Divide a larger problem into a series of smaller tasks and execute these tasks using multiple CPUs. The application should leverage parallelization technology to utilize multicore processors, multiple nodes, multiple GPUs, or a hybrid approach (such as combining CPUs and GPUs). -## Parallelization Categories +## Types of Parallelization ### Shared Memory Parallelization @@ -20,7 +20,7 @@ For instance, OpenMP is a parallel directive that supports shared memory paralle Tasks can be distributed on different compute nodes and executed. The tasks communicate each other using message passing. A widely used standard to achieve this kind of parallelism is Message Passing Interface (MPI). -## Software Packages that Support Parallelization +## A Sampling of Parallel Software ### Gromacs From b73ece431212c804811a69d16ceffd4e37c49376 Mon Sep 17 00:00:00 2001 From: Premas Date: Wed, 11 Sep 2024 09:51:46 -0500 Subject: [PATCH 03/15] typos --- docs/cheaha/tutorial/index.md | 2 +- docs/cheaha/tutorial/parallelism.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/cheaha/tutorial/index.md b/docs/cheaha/tutorial/index.md index c934828ea..cc232c7b5 100644 --- a/docs/cheaha/tutorial/index.md +++ b/docs/cheaha/tutorial/index.md @@ -13,4 +13,4 @@ Below is a list of Tutorials we currently have Using Anaconda on Cheaha; The tutorial covers usage of the following parallelization-based software installed in Cheaha. To use parllelization based software on Cheaha, [click here](./parallelism.md). 1. [Gromacs](https://www.gromacs.org/): Used for High-performance molecular dynamics and analysis. -2. [Quantum Expresso](https://www.quantum-espresso.org/): Used for electronic-structure calculations and materials modeling. +1. [Quantum Expresso](https://www.quantum-espresso.org/): Used for electronic-structure calculations and materials modeling. diff --git a/docs/cheaha/tutorial/parallelism.md b/docs/cheaha/tutorial/parallelism.md index a8284ea47..370406256 100644 --- a/docs/cheaha/tutorial/parallelism.md +++ b/docs/cheaha/tutorial/parallelism.md @@ -59,7 +59,7 @@ In the above run command, - `-nsteps 5000` Number of steps to run the simulation. -`-ntomp 10` Number of cores, which will be 10*4 MPI processes i.e, 40 cores - `-s topol.tpr` is the input parameter. - + #### Performance Analysis ### Quantum Espresso From 6c0d397e8729cb7d4626a301634d55370b38c9cc Mon Sep 17 00:00:00 2001 From: Premas Date: Wed, 11 Sep 2024 09:57:28 -0500 Subject: [PATCH 04/15] add index --- mkdocs.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/mkdocs.yml b/mkdocs.yml index 5563138b7..7d629e7e1 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -118,6 +118,7 @@ nav: - Tutorials: - cheaha/tutorial/index.md - Anaconda Environment Tutorial: cheaha/tutorial/pytorch_tensorflow.md + - Parallelism Tutorial: cheaha/tutorial/parallelism.md - Cheaha Web Portal: - cheaha/open_ondemand/index.md - Using the Web Portal: cheaha/open_ondemand/ood_layout.md From 88664b2e3fa18205a18eb4c7104ed1aa6ed89692 Mon Sep 17 00:00:00 2001 From: Premas Date: Tue, 1 Oct 2024 11:14:51 -0500 Subject: [PATCH 05/15] inputs and perf --- .../software/res/parallelism_gromacs.csv | 8 ++ docs/cheaha/tutorial/parallelism.md | 127 ++++++++++++++++-- 2 files changed, 123 insertions(+), 12 deletions(-) create mode 100644 docs/cheaha/software/res/parallelism_gromacs.csv diff --git a/docs/cheaha/software/res/parallelism_gromacs.csv b/docs/cheaha/software/res/parallelism_gromacs.csv new file mode 100644 index 000000000..eb8e96b29 --- /dev/null +++ b/docs/cheaha/software/res/parallelism_gromacs.csv @@ -0,0 +1,8 @@ +"tmpi","omp","Trial1","Trial 2","Trial3","Average","Speedup" +"1","1","5335.744","5339.013","5443.12","5372.625667","1" +"24","1","327.59","321.911","321.894","323.7983333","16.59250562" +"12","2","344.966","347.613","346.746","346.4416667","15.50802396" +"6","4","354.287","354.183","355.017","354.4956667","15.15568785" +"4","6","370.1","370.199","371.161","370.4866667","14.50153582" +"2","12","401.381","401.141","400.759","401.0936667","13.39494017" +"1","24","350.487","349.396","349.96","349.9476667","15.35265464" diff --git a/docs/cheaha/tutorial/parallelism.md b/docs/cheaha/tutorial/parallelism.md index 370406256..7dd74d697 100644 --- a/docs/cheaha/tutorial/parallelism.md +++ b/docs/cheaha/tutorial/parallelism.md @@ -20,21 +20,35 @@ For instance, OpenMP is a parallel directive that supports shared memory paralle Tasks can be distributed on different compute nodes and executed. The tasks communicate each other using message passing. A widely used standard to achieve this kind of parallelism is Message Passing Interface (MPI). +### GPU Parallelization + +It divides large computational tasks into smaller subtasks that can be executed concurrently on different GPU cores. This results in significant acceleration compared to traditional CPU-based computations. CUDA (Compute Unified Device Architecture) is a widely used parallel computing platform and programming model developed by NVIDIA, enabling users to utlize the power of GPUs effectively. + +### Hybrid Parallelization + +It is a computing approach that integrates various parallelization techniques or models to effectively utilize a heterogeneous environment, such as diverse components like CPUs and GPUs. For instance, it combines shared-memory (openMP) and distributed-memory parallelization (MPI) to optimize application execution. + ## A Sampling of Parallel Software ### Gromacs Gromacs is specifically meant for high-performance molecular dynamics and analysis. +```bash +srun --nodes=1 --ntasks-per-node=24 --mem=120GB --time=10:00:00 --partition=intel-dcb --pty /bin/bash +``` + Recent Gromacs version available in Cheaha can be loaded as, ```bash -$module load rc/GROMACS/2022.3-gpu +$module load GROMACS/2019-fosscuda-2018b ``` -Some of the simulation examples are covered in the [Gromacs Tutorial Page](http://www.mdtutorials.com/gmx/).We will take the example of Lysozyme in water as a case study. +#### Study of Water System - An Example + +Some of the simulation examples are covered in the [Gromacs Tutorial Page](http://www.mdtutorials.com/gmx/).We will take the example of water system as a case study. Let us consider the [Gromacs Molecular Example](http://ftp.gromacs.org/pub/benchmarks/water_GMX50_bare.tar.gz). The simulation is meant for study of water system, for instance understanding water properties, molecular dynamics, etc. -Download and extract the example `pdb` dataset using the following command, +Download and extract the example `pdb` dataset using the following commands, ```bash $DATA_SET=water_GMX50_bare @@ -42,25 +56,113 @@ $wget -c https://ftp.gromacs.org/pub/benchmarks/${DATA_SET}.tar.gz $tar xf ${DATA_SET}.tar.gz ``` -To run the simulation using `gmx` executable, +##### Input Parameters + +Let us consider the subset of data `1536` from the repo, `water-cut1.0_GMX50_bare`. ```bash -$cd ./water-cut1.0_GMX50_bare/1536 -$gmx grompp -f pme.mdp -$gmx mdrun -ntmpi 4 -nb gpu -pin on -nsteps 5000 -ntomp 10 -s topol.tpr +$cd water-cut1.0_GMX50_bare/1536 +$ls +conf.gro pme.mdp rf.mdp topol.top +``` + +- The `.mdp` file is a molecular dynamics parameter (MDP) input file. +- The `.gro` is a coordinate file that contains spatial coordinates of atoms in a system for visualization and structure analysis. It includes box dimensions and number of atoms. +- The `.top` is a topology file that contains the molecular system's structure, including the types of atoms, bonds, angles, force field parameters, etc. + + +!!! note + The `rf.mdp` input file is excluded, as only the PME method is focused for this case study. + + +##### Preprocessing + +The initial step is to preprocess the input file, `pme.mdp` using the below command, + +```bash +gmx grompp -f pme.mdp +``` + +The above command prepare the input file for the simulation, it reads parameters and combines them into a single output file named `topol.tpr`. + +- gmx: gromacs execuable +- grompp: This stands for "GROMACS PreProcessor." It combines the simulation parameters specified in the .mdp file with other input files (like the topology file and the coordinate file) to generate a binary input file (.tpr) that will be used for the actual simulation. +- .tpr - Portable binary run input file. + +The ouptut file `topol.tpr` rcontains the settings and parameters for the simulation, such as time step (nsteps), number of atoms (natoms), temperature (ref), and coulombtype, and this example adopts Particle-Mesh Ewald (PME) coulombtype. As this file is in binary format it cannot be read with a normal editor. You can read a portable binary run input file using the below command. + +```bash +gmx dump -s topol.tpr > topol.out +``` + +##### Execution and Scalability + +This simulation execution gives detailed information about the settings of the molecular dynamics simulation for a water model, + +##### Water System Inputs + +Total Atoms (natoms): 1,536,000 +Number of Steps: 5,000 +Distribution: Domain Decomposition +MPI Ranks: 6 +Average Atoms per Domain: 256,000 +Domain Decomposition Grid: 6 x 1 x 1 + +```bash +$export OMP_NUM_THREADS=6 +$gmx mdrun -ntmpi 6 -nsteps 5000 -ntomp 4 -s topol.tpr -deffnm md_output.log ``` In the above run command, - `gmx` is the executable to run the pipeline. -- `-ntmpi` Number of MPI processes. -- `-nb gpu`defines the computation to use GPUs. -- `-pin on` binds CPU to core. +- `-ntmpi 4` Number of MPI processes. - `-nsteps 5000` Number of steps to run the simulation. --`-ntomp 10` Number of cores, which will be 10*4 MPI processes i.e, 40 cores +- `-ntomp 6` Number of cores, which will be 10*4 MPI processes i.e, 40 cores - `-s topol.tpr` is the input parameter. -#### Performance Analysis +##### Performance Analysis + +{{ read_csv('../software/res/parallelism_gromacs.csv', keep_default_na=False) }} + +The majority of the computational time and resources were spent on force calculations and PME mesh operations, indicating these are the most computationally intensive tasks in this simulation. The system achieved good load balancing, reflected in the distribution of tasks across the available ranks and threads. + + From c51984e9a0e475408c17448547533405cf0b654e Mon Sep 17 00:00:00 2001 From: Premas Date: Fri, 1 Nov 2024 11:03:47 -0500 Subject: [PATCH 06/15] adding gpu results --- .../cheaha/software/res/parallelism_gromacs.csv | 8 -------- .../software/res/parallelism_gromacs_cpu.csv | 8 ++++++++ .../software/res/parallelism_gromacs_gpu.csv | 5 +++++ .../res/parallelism_gromacs_gpu_hybrid.csv | 9 +++++++++ docs/cheaha/tutorial/parallelism.md | 17 ++++++++++------- mkdocs.yml | 2 +- 6 files changed, 33 insertions(+), 16 deletions(-) delete mode 100644 docs/cheaha/software/res/parallelism_gromacs.csv create mode 100644 docs/cheaha/software/res/parallelism_gromacs_cpu.csv create mode 100644 docs/cheaha/software/res/parallelism_gromacs_gpu.csv create mode 100644 docs/cheaha/software/res/parallelism_gromacs_gpu_hybrid.csv diff --git a/docs/cheaha/software/res/parallelism_gromacs.csv b/docs/cheaha/software/res/parallelism_gromacs.csv deleted file mode 100644 index eb8e96b29..000000000 --- a/docs/cheaha/software/res/parallelism_gromacs.csv +++ /dev/null @@ -1,8 +0,0 @@ -"tmpi","omp","Trial1","Trial 2","Trial3","Average","Speedup" -"1","1","5335.744","5339.013","5443.12","5372.625667","1" -"24","1","327.59","321.911","321.894","323.7983333","16.59250562" -"12","2","344.966","347.613","346.746","346.4416667","15.50802396" -"6","4","354.287","354.183","355.017","354.4956667","15.15568785" -"4","6","370.1","370.199","371.161","370.4866667","14.50153582" -"2","12","401.381","401.141","400.759","401.0936667","13.39494017" -"1","24","350.487","349.396","349.96","349.9476667","15.35265464" diff --git a/docs/cheaha/software/res/parallelism_gromacs_cpu.csv b/docs/cheaha/software/res/parallelism_gromacs_cpu.csv new file mode 100644 index 000000000..22ebe9f62 --- /dev/null +++ b/docs/cheaha/software/res/parallelism_gromacs_cpu.csv @@ -0,0 +1,8 @@ +"tmpi","omp","Execution Time","Speedup" +"1","1","5372.63","1" +"1","24","349.95","15.35" +"2","12","401.09","13.39" +"4","6","370.49","14.50" +"6","4","354.50","15.16" +"12","2","346.44","15.51" +"24","1","323.80","16.59" diff --git a/docs/cheaha/software/res/parallelism_gromacs_gpu.csv b/docs/cheaha/software/res/parallelism_gromacs_gpu.csv new file mode 100644 index 000000000..5b0b7c8a2 --- /dev/null +++ b/docs/cheaha/software/res/parallelism_gromacs_gpu.csv @@ -0,0 +1,5 @@ +"GPU","tmpi","omp","Execution Time","Speedup","Partition" +"1","1","1","230.62","23.30","amperenodes" +"2","1","1","230.30","23.33","amperenodes" +"1","1","1","337.53","15.92","pascalnodes" +"2","1","1","346.84","15.49","pascalnodes" diff --git a/docs/cheaha/software/res/parallelism_gromacs_gpu_hybrid.csv b/docs/cheaha/software/res/parallelism_gromacs_gpu_hybrid.csv new file mode 100644 index 000000000..e95ea1667 --- /dev/null +++ b/docs/cheaha/software/res/parallelism_gromacs_gpu_hybrid.csv @@ -0,0 +1,9 @@ +"GPU","tmpi","omp","Execution Time","Speedup","Partition" +"1","1","2","144.95","37.07","amperenodes" +"1","1","4","92.87","57.85","amperenodes" +"2","2","2","270.63","19.85","amperenodes" +"2","2","4","167.00","32.17","amperenodes" +"1","1","2","222.63","24.13","pascalnodes" +"1","1","4","176.62","30.42","pascalnodes" +"2","2","2","416.54","12.90","pascalnodes" +"2","2","4","261.52","20.54","pascalnodes" diff --git a/docs/cheaha/tutorial/parallelism.md b/docs/cheaha/tutorial/parallelism.md index 7dd74d697..086f75f8d 100644 --- a/docs/cheaha/tutorial/parallelism.md +++ b/docs/cheaha/tutorial/parallelism.md @@ -101,12 +101,12 @@ This simulation execution gives detailed information about the settings of the m ##### Water System Inputs -Total Atoms (natoms): 1,536,000 -Number of Steps: 5,000 -Distribution: Domain Decomposition -MPI Ranks: 6 -Average Atoms per Domain: 256,000 -Domain Decomposition Grid: 6 x 1 x 1 +- Total Atoms (natoms): 1,536,000 +- Number of Steps: 5,000 +- Distribution: Domain Decomposition +- MPI Ranks: 6 +- Average Atoms per Domain: 256,000 +- Domain Decomposition Grid: 6 x 1 x 1 ```bash $export OMP_NUM_THREADS=6 @@ -123,10 +123,13 @@ In the above run command, ##### Performance Analysis -{{ read_csv('../software/res/parallelism_gromacs.csv', keep_default_na=False) }} +{{ read_csv('../software/res/parallelism_gromacs_cpu.csv', keep_default_na=False) }} The majority of the computational time and resources were spent on force calculations and PME mesh operations, indicating these are the most computationally intensive tasks in this simulation. The system achieved good load balancing, reflected in the distribution of tasks across the available ranks and threads. +{{ read_csv('../software/res/parallelism_gromacs_gpu.csv', keep_default_na=False) }} + +{{ read_csv('../software/res/parallelism_gromacs_gpu_hybrid.csv', keep_default_na=False) }} ", + "$TM_SELECTED_TEXT", + "" + ], + "description": "Disables warning Markdown Lint MD045 for the selected DrawIO Diagram." + }, "Stub Page": { "prefix": "stub", "body": [ diff --git a/docs/cheaha/tutorial/parallelism.md b/docs/cheaha/tutorial/parallelism.md index 56cd1fc05..3f9ed3637 100644 --- a/docs/cheaha/tutorial/parallelism.md +++ b/docs/cheaha/tutorial/parallelism.md @@ -4,13 +4,17 @@ Breaking a bigger problem into set of tasks, and executing these tasks one by one on a single CPU. -![](../tutorial/images/parallelism_serial_execution.drawio) + +![](images/parallelism_serial_execution.drawio) + ## Parallel Execution Divide a larger problem into a series of smaller tasks and execute these tasks using multiple CPUs. The application should leverage parallelization technology to utilize multicore processors, multiple nodes, multiple GPUs, or a hybrid approach (such as combining CPUs and GPUs). -![](../tutorial/images/parallelism_parallel_execution.drawio) + +![](images/parallelism_parallel_execution.drawio) + ## Types of Parallelization From 88c78f1fc17b21952326ea823c50c9b098d69bc9 Mon Sep 17 00:00:00 2001 From: Premas Date: Fri, 10 Jan 2025 13:06:13 -0600 Subject: [PATCH 12/15] QE --- docs/cheaha/tutorial/parallelism.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/docs/cheaha/tutorial/parallelism.md b/docs/cheaha/tutorial/parallelism.md index 3f9ed3637..a7736beec 100644 --- a/docs/cheaha/tutorial/parallelism.md +++ b/docs/cheaha/tutorial/parallelism.md @@ -139,6 +139,15 @@ The majority of the computational time and resources were spent on force calcula {{ read_csv('../software/res/parallelism_gromacs_gpu_hybrid.csv', keep_default_na=False) }} +### Quantum Espresso + +Quantum Espresso (QE) is an open-source suite of codes for electronic-structure calculations and materials modeling based on density functional theory (DFT), plane waves, and pseudopotentials. It is used to study the properties of materials at the atomic scale. + +Quantum Expresso is available as a module in Cheaha and can be loaded as, + +```bash +$module load QuantumESPRESSO/6.3-foss-2018b +```