From 20775ab4d8d0be43c9a4b87281be97b3872cf7b4 Mon Sep 17 00:00:00 2001 From: Tom Papatheodore Date: Thu, 21 Sep 2023 12:54:31 -0400 Subject: [PATCH 1/2] added container docs --- docs/jobs.md | 286 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 286 insertions(+) diff --git a/docs/jobs.md b/docs/jobs.md index ed17f4c..aa4a52a 100644 --- a/docs/jobs.md +++ b/docs/jobs.md @@ -252,6 +252,292 @@ a new (or open an existing) notebook and access the GPUs on the compute node: ```{tip} Please see the [Python Environment](./software.md#python-environment) section to understand how the base Python environment and `pytorch` and `tensorflow` modules can be customized. ``` +## Containers +The container platform available for use on the HPC Fund cluster is [Singularity/Apptainer](https://apptainer.org/docs/user/main/), which can (of course) use/build Singularity containers or transparently convert Docker images into the Singularity format. + +```{note} +Apptainer is the new name for Singularity, so it will be referred to as Apptainer in the remainder of these docs. +``` + + +### Simple Ubuntu example +This example shows how to use Apptainer to pull down the latest base Ubuntu container from DockerHub and run it on the cluster. Notice that the Docker container is converted to the Singularity format (SIF) transparently during the `pull`. + +``` +$ apptainer pull docker://ubuntu +INFO: Converting OCI blobs to SIF format +INFO: Starting build... +Getting image source signatures +Copying blob 445a6a12be2b done +Copying config c6b84b685f done +Writing manifest to image destination +Storing signatures +2023/09/21 08:39:13 info unpack layer: sha256:445a6a12be2be54b4da18d7c77d4a41bc4746bc422f1f4325a60ff4fc7ea2e5d +INFO: Creating SIF file... + + +$ apptainer run ubuntu_latest.sif + + +Apptainer> cat /etc/os-release +PRETTY_NAME="Ubuntu 22.04.3 LTS" +NAME="Ubuntu" +VERSION_ID="22.04" +VERSION="22.04.3 LTS (Jammy Jellyfish)" +VERSION_CODENAME=jammy +ID=ubuntu +ID_LIKE=debian +HOME_URL="https://www.ubuntu.com/" +SUPPORT_URL="https://help.ubuntu.com/" +BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" +PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" +UBUNTU_CODENAME=jammy +``` +### ROCm-Enabled PyTorch example +This example shows how to use Apptainer to pull down the latest ROCm-enabled PyTorch container from DockerHub and run it on the cluster. + +```{note} +This container is much larger (~13 GB) than the Ubuntu container (~29 MB) so it should be run on a compute node from your project `$WORK` directory. + +* Project `$WORK` directories have a larger quota (2 TB shared among all users of the project), whereas user `$HOME` directories only have a quota of 25 GB. + +* A compute node is needed to avoid running out of shared resources on the login node while pulling down the container. +``` + +``` +$ salloc -A -N 1 -t 60 -p mi1004x +salloc: --------------------------------------------------------------- +salloc: AMD HPC Fund Job Submission Filter +salloc: --------------------------------------------------------------- +salloc: --> ok: runtime limit specified +salloc: --> ok: using default qos +salloc: --> ok: Billing account-> / +salloc: --> checking job limits... +salloc: --> requested runlimit = 1.0 hours (ok) +salloc: --> checking partition restrictions... +salloc: --> ok: partition = mi1004x +salloc: Granted job allocation + + + +$ apptainer pull docker://rocm/pytorch:latest +INFO: Converting OCI blobs to SIF format +INFO: Starting build... +Getting image source signatures +... +... +INFO: Creating SIF file... + + +$ apptainer run pytorch_latest.sif + + +Apptainer> rocm-smi +========================= ROCm System Management Interface ========================= +=================================== Concise Info =================================== +GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% +0 35.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% +1 36.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% +2 34.0c 30.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% +3 34.0c 32.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% +==================================================================================== +=============================== End of ROCm SMI Log ================================ + + +Apptainer> rocminfo | head +ROCk module is loaded +===================== +HSA System Attributes +===================== +Runtime Version: 1.1 +System Timestamp Freq.: 1000.000000MHz +Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) +Machine Model: LARGE +System Endianness: LITTLE + + +Apptainer> python3 +Python 3.8.16 (default, Jun 12 2023, 18:09:05) +[GCC 11.2.0] :: Anaconda, Inc. on linux +Type "help", "copyright", "credits" or "license" for more information. + + +>>> import torch + + +>>> print("GPU(s) available:", torch.cuda.is_available()) +GPU(s) available: True + + +>>> print("Number of available GPUs:", torch.cuda.device_count()) +Number of available GPUs: 4 +``` + +```{note} +* PyTorch uses `cuda` even when targeting ROCm devices. +* By default, your `$HOME` and `$WORK` directories are bind-mounted into the container. +``` + +### ROCm-enabled TensorFlow example +This example shows how to use Apptainer to pull down the latest ROCm-enabled TensorFlow container from DockerHub and run it on the cluster. + +```{note} +Similar to the PyTorch container above, this container is much larger (~11 GB) than the Ubuntu container (~29 MB) so it should be run on a compute node from your project `$WORK` directory. + +* Project `$WORK` directories have a larger quota (2 TB shared among all users of the project), whereas user `$HOME` directories only have a quota of 25 GB. + +* A compute node is needed to avoid running out of shared resources on the login node while pulling down the container. +``` + +``` +$ salloc -A -N 1 -t 60 -p mi1004x +salloc: --------------------------------------------------------------- +salloc: AMD HPC Fund Job Submission Filter +salloc: --------------------------------------------------------------- +salloc: --> ok: runtime limit specified +salloc: --> ok: using default qos +salloc: --> ok: Billing account-> / +salloc: --> checking job limits... +salloc: --> requested runlimit = 1.0 hours (ok) +salloc: --> checking partition restrictions... +salloc: --> ok: partition = mi1004x +salloc: Granted job allocation + + +$ apptainer pull docker://rocm/tensorflow:latest +INFO: Converting OCI blobs to SIF format +INFO: Starting build... +Getting image source signatures +... +... +INFO: Creating SIF file... + + +$ apptainer run --containall --bind=${HOME},${WORK},/dev/kfd,/dev/dri tensorflow_latest.sif + + +Apptainer> rocm-smi +========================= ROCm System Management Interface ========================= +=================================== Concise Info =================================== +GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% +0 35.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% +1 35.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% +2 34.0c 30.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% +3 34.0c 32.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% +==================================================================================== +=============================== End of ROCm SMI Log ================================ + + +Apptainer> rocminfo | head +ROCk module is loaded +===================== +HSA System Attributes +===================== +Runtime Version: 1.1 +System Timestamp Freq.: 1000.000000MHz +Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) +Machine Model: LARGE +System Endianness: LITTLE + + +Apptainer> python3 +Python 3.9.17 (main, Jun 6 2023, 20:11:04) +[GCC 9.4.0] on linux +Type "help", "copyright", "credits" or "license" for more information. + + +>>> import tensorflow as tf +2023-09-21 09:57:29.569788: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA +To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. + + +>>> gpu_list = tf.config.list_physical_devices('GPU') + + +>>> for gpu in gpu_list: +... print(gpu) +... +PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU') +PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU') +PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU') +PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU') +``` + +```{note} +By default, Apptainer will bring environment variables from the host (i.e., compute node) environment into the container. Unlike the PyTorch container above, this TensorFlow container does not appear to set some environment variables (e.g., `LANG`), so the values from the host environment are still set - which can cause some warnings about `hipcc`. + +To resolve this, the `--containall` flag was used to ensure nothing from the host environment gets brought in. This means we need to manually bind-mount some important directories; `/dev/kfd` and `/dev/dri` so the ROCm driver can collect information about the GPUs, and also the `$HOME` and `$WORK` directories. +``` + +### Extending Docker containers with Apptainer +If users need to build on top of existing containers (e.g., installing additional packages), they can do so with Apptainer definition files. For example, the following definition file can be used to build a container with an upgraded `scipy` and newly-installed `pandas` package using the ROCm-enabled PyTorch container as a starting point. It also sets an environment variable inside the container. + +``` +$ cat rocm_pt.def +Bootstrap: docker +From: rocm/pytorch:latest + +%environment +export MY_ENV_VAR="This is my environment variable" + +%post + pip3 install --upgrade pip + pip3 install scipy --upgrade + pip3 install pandas + + +$ apptainer build rocm_pt.sif rocm_pt.def +... +... +Successfully installed numpy-1.24.4 scipy-1.10.1 +... +Successfully installed pandas-2.0.3 pytz-2023.3.post1 tzdata-2023.3 +... +INFO: Adding environment to container +INFO: Creating SIF file... +INFO: Build complete: rocm_pt.sif + + +$ apptainer run rocm_pt.sif + + +Apptainer> echo $MY_ENV_VAR +This is my environment variable + + +Apptainer> pip list | grep pandas +pandas 2.0.3 + + +Apptainer> python3 +Python 3.8.16 (default, Jun 12 2023, 18:09:05) +[GCC 11.2.0] :: Anaconda, Inc. on linux +Type "help", "copyright", "credits" or "license" for more information. + + +>>> import pandas as pd + + +>>> data = { +... "numbers" : [2, 4, 6], +... "letters" : ['b', 'd', 'f'] +... } + + +>>> df = pd.DataFrame(data) + + +>>> print(df) + numbers letters +0 2 b +1 4 d +2 6 f +``` + +As we can see the `pandas` package is now available inside the container, and the ROCm and TensorFlow functionality still works as before. + +For more detailed information on using Apptainer definition files, please see [this section](https://apptainer.org/docs/user/main/definition_files.html) of the Apptainer user docs. + ok: partition = mi1004x salloc: Granted job allocation - +#---- PULL DOWN CONTAINER FROM GITHUB ----# $ apptainer pull docker://rocm/pytorch:latest INFO: Converting OCI blobs to SIF format INFO: Starting build... @@ -329,9 +332,11 @@ Getting image source signatures INFO: Creating SIF file... +#---- RUN THE CONTAINER ----# $ apptainer run pytorch_latest.sif +#---- (INSIDE CONTAINER) CHECK FOR GPUS ----# Apptainer> rocm-smi ========================= ROCm System Management Interface ========================= =================================== Concise Info =================================== @@ -344,6 +349,7 @@ GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% =============================== End of ROCm SMI Log ================================ +#---- (INSIDE CONTAINER) CHECK GPU INFO ----# Apptainer> rocminfo | head ROCk module is loaded ===================== @@ -356,19 +362,23 @@ Machine Model: LARGE System Endianness: LITTLE +#---- (INSIDE CONTAINER) LAUNCH A PYTHON SHELL ----# Apptainer> python3 Python 3.8.16 (default, Jun 12 2023, 18:09:05) [GCC 11.2.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. +#---- (INSIDE CONTAINER) IMPORT PYTORCH ----# >>> import torch +#---- (INSIDE CONTAINER) CHECK OF ROCM IS AVAILABLE ----# >>> print("GPU(s) available:", torch.cuda.is_available()) GPU(s) available: True +#---- (INSIDE CONTAINER) CHECK NUMBER OF GPUS ----# >>> print("Number of available GPUs:", torch.cuda.device_count()) Number of available GPUs: 4 ``` @@ -390,6 +400,7 @@ Similar to the PyTorch container above, this container is much larger (~11 GB) t ``` ``` +#---- GRAB A COMPUTE NODE IN AN INTERACTIVE JOB ----# $ salloc -A -N 1 -t 60 -p mi1004x salloc: --------------------------------------------------------------- salloc: AMD HPC Fund Job Submission Filter @@ -404,6 +415,7 @@ salloc: --> ok: partition = mi1004x salloc: Granted job allocation +#---- PULL DOWN CONTAINER FROM GITHUB ----# $ apptainer pull docker://rocm/tensorflow:latest INFO: Converting OCI blobs to SIF format INFO: Starting build... @@ -413,9 +425,11 @@ Getting image source signatures INFO: Creating SIF file... +#---- RUN THE CONTAINER - SEE NOTE BELOW FOR ADDITIONAL FLAGS ----# $ apptainer run --containall --bind=${HOME},${WORK},/dev/kfd,/dev/dri tensorflow_latest.sif +#---- (INSIDE CONTAINER) CHECK FOR GPUS ----# Apptainer> rocm-smi ========================= ROCm System Management Interface ========================= =================================== Concise Info =================================== @@ -428,6 +442,7 @@ GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% =============================== End of ROCm SMI Log ================================ +#---- (INSIDE CONTAINER) CHECK GPU INFO ----# Apptainer> rocminfo | head ROCk module is loaded ===================== @@ -440,20 +455,21 @@ Machine Model: LARGE System Endianness: LITTLE +#---- (INSIDE CONTAINER) LAUNCH A PYTHON SHELL ----# Apptainer> python3 Python 3.9.17 (main, Jun 6 2023, 20:11:04) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. +#---- (INSIDE CONTAINER) IMPORT PYTORCH ----# >>> import tensorflow as tf 2023-09-21 09:57:29.569788: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. +#---- (INSIDE CONTAINER) CHECK NUMBER OF GPUS ----# >>> gpu_list = tf.config.list_physical_devices('GPU') - - >>> for gpu in gpu_list: ... print(gpu) ... @@ -473,6 +489,7 @@ To resolve this, the `--containall` flag was used to ensure nothing from the hos If users need to build on top of existing containers (e.g., installing additional packages), they can do so with Apptainer definition files. For example, the following definition file can be used to build a container with an upgraded `scipy` and newly-installed `pandas` package using the ROCm-enabled PyTorch container as a starting point. It also sets an environment variable inside the container. ``` +#---- CUSTOMIZED APPTAINER DEFINITION FILE ----# $ cat rocm_pt.def Bootstrap: docker From: rocm/pytorch:latest @@ -486,6 +503,7 @@ export MY_ENV_VAR="This is my environment variable" pip3 install pandas +#---- BUILD CUSTOMIZED CONTAINER ----# $ apptainer build rocm_pt.sif rocm_pt.def ... ... @@ -498,26 +516,32 @@ INFO: Creating SIF file... INFO: Build complete: rocm_pt.sif +#---- RUN THE CONTAINER ----# $ apptainer run rocm_pt.sif +#---- (INSIDE CONTAINER) PRINT ENVIRONMENT VARIABLE ----# Apptainer> echo $MY_ENV_VAR This is my environment variable +#---- (INSIDE CONTAINER) CHECK IF PANDAS IS INSTALLED ----# Apptainer> pip list | grep pandas pandas 2.0.3 +#---- (INSIDE CONTAINER) LAUNCH A PYTHON SHELL ----# Apptainer> python3 Python 3.8.16 (default, Jun 12 2023, 18:09:05) [GCC 11.2.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. +#---- (INSIDE CONTAINER) IMPORT PANDAS ----# >>> import pandas as pd +#---- (INSIDE CONTAINER) SHOW PANDAS WORKING ----# >>> data = { ... "numbers" : [2, 4, 6], ... "letters" : ['b', 'd', 'f']