From 95e89f790519b183bfa80a8f646b459403c58bd4 Mon Sep 17 00:00:00 2001 From: Luna Morrow Date: Mon, 9 Feb 2026 15:16:54 +1000 Subject: [PATCH 01/23] Add baseline Slurm documentation formatted for markdown. Documentation will need some further updates to align better with the Tutorial (e.g. changing IP addresses, adjusting comments to support bare-metal and cloud setups, etc.) and to ensure the documented approach is sufficiently broad for general purpose. Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 1221 +++++++++++++++++++++++++- 1 file changed, 1220 insertions(+), 1 deletion(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index 2a5d226..ec380e4 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -14,4 +14,1223 @@ seo: noindex: false # false (default) or true --- -Coming soon! +# 0 Overview + +This guide walks through setting up Slurm on a OpenCHAMI cluster. This guide will assume you have already setup an OpenCHAMI cluster per Sections 1-2.4.2 in the [OpenCHAMI Tutorial](https://openchami.org/docs/tutorial/). The only other requirement is to run a webserver to serve a Slurm repo for the image builder to use, and this guide assumes Podman is present to use. This guide will only walk through Slurm setup for a cluster with one head node and one compute node, but is easily expanded to multiple compute nodes by updating the node list with ochami. + +## 0.1 Prerequisites + +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +This guide assumes you have setup an OpenCHAMI cluster per Sections 1-2.4.2 in the OpenCHAMI tutorial. +{{< /callout >}} + +## 0.2 Contents + +- [0 Overview](#0-overview) + - [0.1 Prerequisites](#01-prerequisites) + - [0.2 Contents](#02-contents) +- [1 Setup and Configure Slurm](#1-setup-and-configure-slurm) + - [1.1 Setup Slurm Build/Installation as a Local Repository](#11-setup-slurm-build/installation-as-a-local-repository) + - [1.2 Configure Slurm and Slurm Services](#12-configure-slurm-and-slurm-services) + - [1.3 Install Slurm and Setup Configuration Files](#13-install-slurm-and-setup-configuration-files) + - [1.4 Make a Local Slurm Repository and Serve it with Nginx](#14-make-a-local-slurm-repository-and-serve-it-with-nginx) + - [1.5 Configure the Boot Script Service and Cloud-Init](#15-configure-the-boot-script-service-and-cloud-init) + - [1.6 Configure and Start Slurm in the Compute Node](#16-configure-and-start-slurm-in-the-compute-node) + + +# 1 Setup and Configure Slurm + +Steps in this section occur on the head node created in the OpenCHAMI tutorial (or otherwise). + +## 1.1 Setup Slurm Build/Installation as a Local Repository + +Download Slurm pre-requisite sources compatible with Rocky 9 OS: + +``` +sudo dnf -y update && \ +sudo dnf clean all && \ +sudo dnf -y install epel-release && \ +sudo dnf -y install dnf-plugins-core && \ +sudo dnf config-manager --set-enabled devel && \ +sudo dnf config-manager --set-enabled crb && \ +sudo dnf groupinstall -y 'Development Tools' && \ +sudo dnf install -y createrepo freeipmi freeipmi-devel dbus-devel gtk2-devel hdf5 hdf5-devel http-parser-devel \ + hwloc hwloc-devel jq json-c-devel libaec libconfuse libcurl-devel libevent-devel \ + libyaml libyaml-devel lua-devel lua-filesystem lua-json lua-lpeg lua-posix lua-term mariadb mariadb-devel \ + munge munge-devel munge-libs ncurses-devel numactl numactl-devel oniguruma openssl-devel pam-devel \ + perl-DBI perl-ExtUtils-MakeMaker perl-Switch pigz python3 python3-devel readline-devel \ + lsb_release rrdtool rrdtool-devel tcl tcl-devel ucx ucx-cma ucx-devel ucx-ib wget \ + lz4-devel s2n-tls-devel libjwt-devel librdkafka-devel && \ +sudo dnf clean all +``` + +Create build script to install Slurm 24.05.5 and PMIX 4.2.9-1: + +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +This guide installs Slurm 24.05.5 and PMIX 4.2.9-1 to ensure compatibility. Other versions can be installed instead, but make sure to check version compatibility first. +{{< /callout >}} + +``` +cat < /home/rocky/build.sh +SLURMVERSION=${1:-24.05.5} +PMIXVERSION=${2:-4.2.9-1} +ELRELEASE=${3:-el9} #Rocky 9 + +subversions=( ${PMIXVERSION//-/ } ) +pmixmajor=${subversions[0]} +export LC_ALL="C" +OSVERSION=$(lsb_release -r | gawk '{print $2}') +CDIR=$(pwd) +SDIR="slurm/$OSVERSION/$SLURMVERSION" +mkdir -p ${SDIR} +if [[ -e ${SDIR}/pmix-${PMIXVERSION}.${ELRELEASE}.x86_64.rpm ]]; then + echo "The RPM of PMIX version ${PMIXVERSION} is already available." +else + cd slurm + wget https://github.com/openpmix/openpmix/releases/download/v${pmixmajor}/pmix-${PMIXVERSION}.src.rpm || { + echo "$? pmix-${PMIXVERSION}.src.rpm not downloaded" + exit + } + rpmbuild --rebuild ./pmix-${PMIXVERSION}.src.rpm &> rpmbuild-pmix-${PMIXVERSION}.log || { + echo "$? pmix-${PMIXVERSION}.src.rpm not builded, review rpmbuild-pmix-${PMIXVERSION}.log" + exit + } + cd ${CDIR} + mv /root/rpmbuild/RPMS/x86_64/pmix-${PMIXVERSION}.${ELRELEASE}.x86_64.rpm ${SDIR} + dnf -y install ${SDIR}/pmix-${PMIXVERSION}.${ELRELEASE}.x86_64.rpm +fi +if [[ -e ${SDIR}/slurm-${SLURMVERSION}-*.rpm ]]; then + echo "The RPMs of slurm ${SLURMVERSION} are already available." +else + cd slurm + wget https://download.schedmd.com/slurm/slurm-${SLURMVERSION}.tar.bz2 || wget http://www.schedmd.com/download/archive/slurm-${SLURMVERSION}.tar.bz2 || { + echo "$? slurm-${SLURMVERSION}.tar.bz2 not downloaded" + exit + } + rpmbuild -ta --with pmix --with lua --with pam --with mysql --with ucx --with slurmrestd slurm-${SLURMVERSION}.tar.bz2 &> rpmbuild-slurm-${SLURMVERSION}.log || { + echo "$? slurm-${SLURMVERSION}.tar.bz2 not builded, review rpmbuild-slurm-${SLURMVERSION}.log" + exit + } + grep 'configure: WARNING:' rpmbuild-slurm-${SLURMVERSION}.log + cd ${CDIR} + mv /root/rpmbuild/RPMS/x86_64/slurm*-${SLURMVERSION}-*.rpm ${SDIR} +fi +EOF +``` + +Adjust permissions for build script so that it is executable, and execute it with **root** privileges: + +``` +chmod 755 /home/rocky/build.sh +sudo ./build.sh +``` + +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +The following warnings are normal: +``` +configure: WARNING: unable to locate libnvidia-ml.so and/or nvml.h +configure: WARNING: unable to locate librocm_smi64.so and/or rocm_smi.h +configure: WARNING: unable to locate libze_loader.so and/or ze_api.h +configure: WARNING: HPE Slingshot: unable to locate libcxi/libcxi.h +configure: WARNING: unable to build man page html files without man2html +``` +{{< /callout >}} + +Copy the Slurm packages to the desired location to create the local repository: + +``` +sudo mkdir -p /install/osupdates/rocky9/x86_64/ +sudo cp -r slurm/9.7/24.05.5 /install/osupdates/rocky9/x86_64/slurm-24.05.5 +``` + +Create the local repository (this will be used for installation and images later): + +``` +sudo createrepo /install/osupdates/rocky9/x86_64/slurm-24.05.5 +``` + +## 1.2 Configure Slurm and Slurm Services + +Create user and group ‘slurm’ with specified UID/GID: + +``` +SLURMID=666 +sudo groupadd -g $SLURMID slurm +sudo useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u $SLURMID -g slurm -s /sbin/nologin slurm +``` + +Update the UID and GID of ‘munge’ user and group to 616, update directory ownership, create munge key and restart the munge service: + +``` +# Update UID and GID +sudo usermod -u 616 munge +sudo groupmod -g 616 munge + +# Fix user and group ownership +sudo chown munge:munge /var/log/munge/ +sudo chown munge:munge /var/lib/munge/ +sudo chown munge:munge /etc/munge/ + +# Create munge key +sudo create-munge-key + +# Start munge again +sudo systemctl enable --now munge +``` + +Install mariaDB: + +``` +sudo dnf -y install mariadb-server +``` + +Tune mariaDB with the Slurm recommended options for the compute node where mariaDB will be running: + +``` +cat <}} +We are assigning 5GB to the `innodb_buffer_pool_size`. The pool size should be 5-50% of the available memory of the head node and at least 4GB. +{{< /callout >}} + +Enable and start the mariaDB service as this is a single node cluster (so we aren't enabling High Availability): + +``` +sudo systemctl enable --now mariadb +``` + +Secure the mariaDB installation with a strong root password. Use ‘pwgen’ to generate a password and store this password securely: + +``` +sudo dnf -y install pwgen +pwgen 20 1 # generates 1 password of length 20 characters + +sudo mysql_secure_installation +``` + +**MariaDB setup/settings should be done as follows** +``` +Enter current password for root (enter for none): # enter rocky password: 'rocky' + +Switch to unix_socket authentication [Y/n] Y + +Change the root password? [Y/n] Y +New password: # use the password from pwgen +Re-enter new password: # use the password from pwgen + +Remove anonymous users? [Y/n] n + +Disallow root login remotely? [Y/n] Y + +Remove test database and access to it? [Y/n] n + +Reload privilege tables now? [Y/n] Y +``` + +Create the database and grant access to localhost, the head node and the compute node: + +``` +mysql -u root -p # enter the password from pwgen + +create database slurm_acct_db; +grant all on slurm_acct_db.* to slurm@'localhost' identified by ''; +grant all on slurm_acct_db.* to slurm@'master' identified by ''; +grant all on slurm_acct_db.* to slurm@'compute1' identified by ''; +grant all on slurm_acct_db.* to slurm@'head' identified by ''; +exit +``` + +Install a few more dependencies that are required: + +``` +sudo dnf -y install jq libconfuse numactl parallel perl-DBI perl-Switch +``` + +Setup directory structure for the Slurm database and controller daemon services: + +``` +sudo mkdir -p /var/spool/slurmctld /var/log/slurm /run/slurm +sudo chown -R slurm. /var/spool/slurmctld /var/log/slurm /run/slurm +echo "d /run/slurm 0755 slurm slurm -" | sudo tee /usr/lib/tmpfiles.d/slurm.conf +``` + +## 1.3 Install Slurm and Setup Configuration Files + +Add the Slurm repo created earlier to install from it (will ensure we get the correct package versions): + +``` +# Create local repo file +SLURMVERSION=24.05.5 +RELEASE=rocky9 + +echo "[slurm-local] +name=Slurm ${SLURMVERSION} - Local +baseurl=file:///install/osupdates/${RELEASE}/x86_64/slurm-${SLURMVERSION} +gpgcheck=0 +enabled=1 +countme=1" | sudo tee /etc/yum.repos.d/slurm-local.repo + +# Install from local repo file +sudo dnf -y install slurm slurm-contribs slurm-example-configs slurm-libpmi slurm-pam_slurm slurm-perlapi slurm-slurmctld slurm-slurmdbd pmix +``` + +Create configuration files by copying the example files, and then modify the directory and file ownership: + +``` +# Copy configuration files +sudo cp -p /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf +sudo cp -p /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf + +# Set directory and file ownership to slurm +sudo chown -R slurm. /etc/slurm/ +``` + +Modify the SlurmDB config: + +``` +DBHOST=head +DBPASSWORD= # EDIT TO THE PASSWORD SET IN THE MARIADB CONFIGURATION SECTION +SLURMDBHOST1=head + +sudo sed -i "s|DbdAddr.*|DbdAddr=${SLURMDBHOST1}|g" /etc/slurm/slurmdbd.conf +sudo sed -i "s|DbdHost.*|DbdHost=${SLURMDBHOST1}|g" /etc/slurm/slurmdbd.conf +sudo sed -i "s|PidFile.*|PidFile=/var/run/slurm/slurmdbd.pid|g" /etc/slurm/slurmdbd.conf + +sudo sed -i "s|#StorageHost.*|StorageHost=${DBHOST}|g" /etc/slurm/slurmdbd.conf +sudo sed -i "s|#StoragePort.*|StoragePort=3306|g" /etc/slurm/slurmdbd.conf +sudo sed -i "s|StoragePass.*|StoragePass=${DBPASSWORD}|g" /etc/slurm/slurmdbd.conf +sudo sed -i "s|SlurmUser.*|SlurmUser=slurm|g" /etc/slurm/slurmdbd.conf +sudo sed -i "s|PidFile.*|PidFile=/var/run/slurm/slurmdbd.pid|g" /etc/slurm/slurmdbd.conf + +sudo sed -i "s|#StorageLoc.*|StorageLoc=slurm_acct_db|g" /etc/slurm/slurmdbd.conf +``` + +Create the Slurm config file, which will be used by SlurmCTL: + +``` +cat <&1 | awk -F: '/configure arguments/ {print $2}' | xargs -n1 +``` + +Edit the Nginx config file: + +``` +cat <&2; + return 1; + fi; + if [ ! -f "$1" ]; then + echo "$1 does not exist." 1>&2; + return 1; + fi; + podman run \ + --rm \ + --device /dev/fuse \ + -e S3_ACCESS=admin \ + -e S3_SECRET=admin123 \ + -v "$(realpath $1)":/home/builder/config.yaml:Z \ + ${EXTRA_PODMAN_ARGS} \ + ghcr.io/openchami/image-build-el9:v0.1.2 \ + image-build \ + --config config.yaml \ + --log-level DEBUG +} + +build-image-rh8() +{ + if [ -z "$1" ]; then + echo 'Path to image config file required.' 1>&2; + return 1; + fi; + if [ ! -f "$1" ]; then + echo "$1 does not exist." 1>&2; + return 1; + fi; + podman run \ + --rm \ + --device /dev/fuse \ + -e S3_ACCESS=admin \ + -e S3_SECRET=admin123 \ + -v "$(realpath $1)":/home/builder/config.yaml:Z \ + ${EXTRA_PODMAN_ARGS} \ + ghcr.io/openchami/image-build:v0.1.2 \ + image-build \ + --config config.yaml \ + --log-level DEBUG +} +alias build-image=build-image-rh9 +EOF +``` + +Apply simplified command to current session. Note that the command will be automatically applied during later logins, so you will not need to source it again: + +``` +source /etc/profile.d/build-image.sh +``` + +Note: For future, you will be able to build images in the following way: + +``` +build-image /path/to/image/config.yaml +``` + +Check that the alias is being used: + +``` +which build-image +``` + +Output should look like the following: + +``` +alias build-image='build-image-rh9' + build-image-rh9 () + { + if [ -z "$1" ]; then + echo 'Path to image config file required.' 1>&2; + return 1; + fi; + if [ ! -f "$1" ]; then + echo "$1 does not exist." 1>&2; + return 1; + fi; + podman run --rm --device /dev/fuse -e S3_ACCESS=admin -e S3_SECRET=admin123 -v "$(realpath $1)":/home/builder/config.yaml:Z ${EXTRA_PODMAN_ARGS} ghcr.io/openchami/image-build-el9:v0.1.2 image-build --config config.yaml --log-level DEBUG + } +``` + +## 1.5 Configure the Boot Script Service and Cloud-Init. + +Get a fresh access token for ochami: + +``` +export TEST_ACCESS_TOKEN=$(sudo bash -lc 'gen_access_token') +``` + +Create payload for boot script service with URIs for slurm compute boot artefacts: + +```yaml +sudo mkdir /etc/openchami/data/boot/ + +URIS=$(s3cmd ls -Hr s3://boot-images | grep compute/slurm | awk '{print $4}' | sed 's-s3://-http://172.16.0.254:9000/-' | xargs) +URI_IMG=$(echo "$URIS" | cut -d' ' -f1) +URI_INITRAMFS=$(echo "$URIS" | cut -d' ' -f2) +URI_KERNEL=$(echo "$URIS" | cut -d' ' -f3) +cat <}} +If you make a mistake and need to reboot the compute node VM with an updated image, do the following: + +In another window in the host, destroy existing compute node VM. + +`sudo virsh destroy compute1` + +Attach to the console to watch compute1 boot again. + +`sudo virsh start --console compute1` +{{< /callout >}} + +## 1.6 Configure and Start Slurm in the Compute Node + +**From inside the head node VM**, log into the compute node: + +``` +ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@172.16.0.1 +``` + +Check all of the required packages were installed and from the correct sources: + +``` +dnf list installed +``` + +Create slurm config file that is identical to that of the head node VM: + +``` +cat <}} +Use `find / -name "slurm"` to make sure everything that needs to be changed is identified (note that not all results need ownership modified though!!) +{{< /callout >}} + +Create the directory /var/log/slurm as it doesn't exist yet, and set ownership to Slurm: + +``` +mkdir /var/log/slurm +chown slurm:slurm /var/log/slurm +``` + +Creating job_container.conf file that matches the one in the head node VM: + +``` +cat <}} +If you get the following error: +`usermod: user munge is currently used by process ` + +Kill the process and repeat above two commands: +`kill -15 ` +{{< /callout >}} + +Update munge file/directory ownership: + +``` +chown -R munge:munge /var/log/munge/ +chown -R munge:munge /var/lib/munge/ +chown -R munge:munge /etc/munge/ +``` + +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +Find all directories owned by old munge UID/GID with the following command: +`find / -uid 991 -type d` +{{< /callout >}} + +Copy munge key from the head node VM to the compute node VM: + +``` +# Inside the head node VM +cd ~ +sudo cp /etc/munge/munge.key ./ +sudo chown rocky:rocky munge.key +scp ./munge.key root@172.16.0.1:~/ + +# Inside the compute node VM +mv munge.key /etc/munge/munge.key +chown munge:munge /etc/munge/munge.key +``` + +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +In the case of an error about "Offending ECDSA key in /home/rocky/.ssh/known_hosts:3", wipe the contents of the known hosts file and try the 'scp' command again: +`> /home/rocky/.ssh/known_hosts` +{{< /callout >}} + +Enable and start munge service: + +``` +systemctl enable munge.service +systemctl start munge.service +systemctl status munge.service +``` + +Enable and start slurmd: + +``` +systemctl enable slurmd +systemctl start slurmd +systemctl status slurmd +``` + +Disable the firewall in the compute node: + +``` +systemctl stop firewalld +systemctl disable firewalld + +nft flush ruleset +nft list ruleset +``` + +Restart Slurm service daemons to ensure changes are applied: + +``` +# In the compute node: +systemctl restart slurmd + +# In the head node: +sudo systemctl restart slurmctld +sudo systemctl restart slurmdbd +``` + +Test munge on the **head node VM**: + +``` +# Try to munge and unmunge to access the compute node +munge -n | ssh root@172.16.0.1 unmunge +``` + +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +In the case of an error about "Offending ECDSA key in /home/rocky/.ssh/known_hosts:3", wipe the contents of the known hosts file and try the 'munge' command again: +`> /home/rocky/.ssh/known_hosts` +{{< /callout >}} + +Quickly test that you can submit a job from the head node VM: + +``` +# Check that node is present and idle +sinfo + +# Create user with Slurm account +sudo useradd -m -s /bin/bash testuser +sudo usermod -aG wheel testuser +sudo sacctmgr create user testuser defaultaccount=root +sudo su - testuser + +# Run a test job as the user 'testuser' +srun hostname +``` + +If something goes wrong and your compute node goes down, restart it with this command: +`sudo scontrol update NodeName=de01 State=RESUME` + From 2b2e7a5e27cab6ee13f84d3ad9096d28e7726625 Mon Sep 17 00:00:00 2001 From: Luna Morrow Date: Tue, 10 Feb 2026 09:41:52 +1000 Subject: [PATCH 02/23] Convert all relevant code blocks to bash format and adjust the method for creating some files from cat to copy-paste to prevent issues with bash command/variable processing Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 143 +++++++++++++-------------- 1 file changed, 69 insertions(+), 74 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index ec380e4..d998f0a 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -46,7 +46,7 @@ Steps in this section occur on the head node created in the OpenCHAMI tutorial ( Download Slurm pre-requisite sources compatible with Rocky 9 OS: -``` +```bash sudo dnf -y update && \ sudo dnf clean all && \ sudo dnf -y install epel-release && \ @@ -70,8 +70,8 @@ Create build script to install Slurm 24.05.5 and PMIX 4.2.9-1: This guide installs Slurm 24.05.5 and PMIX 4.2.9-1 to ensure compatibility. Other versions can be installed instead, but make sure to check version compatibility first. {{< /callout >}} -``` -cat < /home/rocky/build.sh +**Create file as rocky user: 'home/rocky/build.sh'** +```bash SLURMVERSION=${1:-24.05.5} PMIXVERSION=${2:-4.2.9-1} ELRELEASE=${3:-el9} #Rocky 9 @@ -115,19 +115,18 @@ else cd ${CDIR} mv /root/rpmbuild/RPMS/x86_64/slurm*-${SLURMVERSION}-*.rpm ${SDIR} fi -EOF ``` Adjust permissions for build script so that it is executable, and execute it with **root** privileges: -``` +```bash chmod 755 /home/rocky/build.sh sudo ./build.sh ``` {{< callout context="note" title="Note" icon="outline/info-circle" >}} The following warnings are normal: -``` +```bash configure: WARNING: unable to locate libnvidia-ml.so and/or nvml.h configure: WARNING: unable to locate librocm_smi64.so and/or rocm_smi.h configure: WARNING: unable to locate libze_loader.so and/or ze_api.h @@ -138,14 +137,14 @@ configure: WARNING: unable to build man page html files without man2html Copy the Slurm packages to the desired location to create the local repository: -``` +```bash sudo mkdir -p /install/osupdates/rocky9/x86_64/ sudo cp -r slurm/9.7/24.05.5 /install/osupdates/rocky9/x86_64/slurm-24.05.5 ``` Create the local repository (this will be used for installation and images later): -``` +```bash sudo createrepo /install/osupdates/rocky9/x86_64/slurm-24.05.5 ``` @@ -153,7 +152,7 @@ sudo createrepo /install/osupdates/rocky9/x86_64/slurm-24.05.5 Create user and group ‘slurm’ with specified UID/GID: -``` +```bash SLURMID=666 sudo groupadd -g $SLURMID slurm sudo useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u $SLURMID -g slurm -s /sbin/nologin slurm @@ -161,7 +160,7 @@ sudo useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u $SLURMID -g slu Update the UID and GID of ‘munge’ user and group to 616, update directory ownership, create munge key and restart the munge service: -``` +```bash # Update UID and GID sudo usermod -u 616 munge sudo groupmod -g 616 munge @@ -180,13 +179,13 @@ sudo systemctl enable --now munge Install mariaDB: -``` +```bash sudo dnf -y install mariadb-server ``` Tune mariaDB with the Slurm recommended options for the compute node where mariaDB will be running: -``` +```bash cat < # EDIT TO THE PASSWORD SET IN THE MARIADB CONFIGURATION SECTION SLURMDBHOST1=head @@ -314,7 +313,7 @@ sudo sed -i "s|#StorageLoc.*|StorageLoc=slurm_acct_db|g" /etc/slurm/slurmdbd.con Create the Slurm config file, which will be used by SlurmCTL: -``` +```bash cat <&1 | awk -F: '/configure arguments/ {print $2}' | xargs -n1 ``` -Edit the Nginx config file: - -``` -cat < Date: Tue, 10 Feb 2026 10:49:00 +1000 Subject: [PATCH 03/23] Update hostnames in tutorial to align with those used in the tutorial - this should make this guide easy to follow on with after the tutorial Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 23 ++++++++++------------- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index d998f0a..8113182 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -233,15 +233,14 @@ Remove test database and access to it? [Y/n] n Reload privilege tables now? [Y/n] Y ``` -Create the database and grant access to localhost, the head node and the compute node: +Create the database and grant access to localhost and the head node: ```bash mysql -u root -p # enter the password from pwgen create database slurm_acct_db; grant all on slurm_acct_db.* to slurm@'localhost' identified by ''; -grant all on slurm_acct_db.* to slurm@'master' identified by ''; -grant all on slurm_acct_db.* to slurm@'compute1' identified by ''; +grant all on slurm_acct_db.* to slurm@'demo.openchami.cluster' identified by ''; grant all on slurm_acct_db.* to slurm@'head' identified by ''; exit ``` @@ -316,9 +315,8 @@ Create the Slurm config file, which will be used by SlurmCTL: ```bash cat < Date: Tue, 10 Feb 2026 11:07:27 +1000 Subject: [PATCH 04/23] Small tweaks to markdown file to fix some typos and poor formatting. Next step will be expanding comments/explanations to provide more context to users, as well as providing more code blocks to show expected output of commands that produce output. Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index 8113182..baf1185 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -71,7 +71,7 @@ This guide installs Slurm 24.05.5 and PMIX 4.2.9-1 to ensure compatibility. Othe {{< /callout >}} **Create file as rocky user: 'home/rocky/build.sh'** -```bash +```bash {title="home/rocky/build.sh"} SLURMVERSION=${1:-24.05.5} PMIXVERSION=${2:-4.2.9-1} ELRELEASE=${3:-el9} #Rocky 9 @@ -214,7 +214,7 @@ pwgen 20 1 # generates 1 password of length 20 characters sudo mysql_secure_installation ``` -**MariaDB setup/settings should be done as follows** +**MariaDB setup/settings should be done as follows:** ``` Enter current password for root (enter for none): # enter rocky password: 'rocky' @@ -820,7 +820,6 @@ Configure cloud-init for compute group: **Edit as root: `/etc/openchami/data/cloud-init/ci-group-compute.yaml`** ```yaml {title="/etc/openchami/data/cloud-init/ci-group-compute.yaml"} -cat < Date: Wed, 11 Feb 2026 14:13:58 +1000 Subject: [PATCH 05/23] Update 'Install Slurm' documentation with review suggestions from David. Changes include making it more clear when pwgen password is used, correcting the file creation step for slurm.conf to prevent errors, removing instructions for aliasing the build commend (and instead redirecting to the appropriate tutorial section), updating instructions inline with a recent PR to replace MinIO with Versity S3 and some minor typo fixes Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 135 ++++++--------------------- 1 file changed, 26 insertions(+), 109 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index baf1185..09ce53f 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -16,12 +16,12 @@ seo: # 0 Overview -This guide walks through setting up Slurm on a OpenCHAMI cluster. This guide will assume you have already setup an OpenCHAMI cluster per Sections 1-2.4.2 in the [OpenCHAMI Tutorial](https://openchami.org/docs/tutorial/). The only other requirement is to run a webserver to serve a Slurm repo for the image builder to use, and this guide assumes Podman is present to use. This guide will only walk through Slurm setup for a cluster with one head node and one compute node, but is easily expanded to multiple compute nodes by updating the node list with ochami. +This guide walks through setting up Slurm on a OpenCHAMI cluster. This guide will assume you have already setup an OpenCHAMI cluster per Sections 1-2.6 in the [OpenCHAMI Tutorial](https://openchami.org/docs/tutorial/). The only other requirement is to run a webserver to serve a Slurm repo for the image builder to use, and this guide assumes Podman is present to use. This guide will only walk through Slurm setup for a cluster with one head node and one compute node, but is easily expanded to multiple compute nodes by updating the node list with ochami. ## 0.1 Prerequisites {{< callout context="note" title="Note" icon="outline/info-circle" >}} -This guide assumes you have setup an OpenCHAMI cluster per Sections 1-2.4.2 in the OpenCHAMI tutorial. +This guide assumes you have setup an OpenCHAMI cluster per Sections 1-2.6 in the OpenCHAMI tutorial. {{< /callout >}} ## 0.2 Contents @@ -205,7 +205,7 @@ Enable and start the mariaDB service as this is a single node cluster (so we are sudo systemctl enable --now mariadb ``` -Secure the mariaDB installation with a strong root password. Use ‘pwgen’ to generate a password and store this password securely: +Secure the mariaDB installation with a strong root password. Use `pwgen` to generate a password and store this password securely. You will use the `pwgen` password to setup and configure MariaDB, as well as to create a database for Slurm to access the head node: ```bash sudo dnf -y install pwgen @@ -233,10 +233,10 @@ Remove test database and access to it? [Y/n] n Reload privilege tables now? [Y/n] Y ``` -Create the database and grant access to localhost and the head node: +Create the database and grant access to localhost and the head node. You will need the password you generated with `pwgen` in the above step. Make sure you edit the bash code provided below to replace `` with the actual password: ```bash -mysql -u root -p # enter the password from pwgen +mysql -u root -p # when prompted, enter the password from pwgen create database slurm_acct_db; grant all on slurm_acct_db.* to slurm@'localhost' identified by ''; @@ -290,7 +290,7 @@ sudo cp -p /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf sudo chown -R slurm. /etc/slurm/ ``` -Modify the SlurmDB config: +Modify the SlurmDB config. You will need the `pwgen` generated password generated earlier when setting up MariaDB for this section: ```bash DBHOST=head @@ -310,10 +310,10 @@ sudo sed -i "s|PidFile.*|PidFile=/var/run/slurm/slurmdbd.pid|g" /etc/slurm/slurm sudo sed -i "s|#StorageLoc.*|StorageLoc=slurm_acct_db|g" /etc/slurm/slurmdbd.conf ``` -Create the Slurm config file, which will be used by SlurmCTL: +Create the Slurm config file, which will be used by SlurmCTL. Note that you may need to update the `NodeName` info depending on the configuration of your compute node. -```bash -cat <&1 | awk -F: '/configure arguments/ {print $2}' | xargs -n1 +nginx -V 2>&1 | awk -F: '/configure arguments/ {print $2}' | xargs -n1 | grep conf-path ``` -**Edit the Nginx config file as the rocky user: `/etc/nginx/nginx.conf`** +**Edit the Nginx config file as root: `/etc/nginx/nginx.conf`** ```bash {title="/etc/nginx/nginx.conf"} user nginx; worker_processes auto; @@ -582,11 +581,11 @@ options: - 'rocky9' pkg_manager: dnf gpgcheck: False - parent: '172.16.0.254:5000/demo/rocky-base:9' + parent: 'demo.openchami.cluster:5000/demo/rocky-base:9' registry_opts_pull: - '--tls-verify=false' - publish_s3: 'http://172.16.0.254:9000' + publish_s3: 'http://demo.openchami.cluster:7070' s3_prefix: 'compute/slurm/' s3_bucket: 'boot-images' @@ -628,15 +627,15 @@ packages: - slurm-torque-24.05.5 ``` -Run podman container to run image build command. +Run podman container to run image build command. The S3_ACCESS and S3_SECRET tokens are set in the tutorial [here](https://openchami.org/docs/tutorial/#233-install-and-configure-s3-clients). ```bash podman run \ --rm \ --device /dev/fuse \ --network host \ - -e S3_ACCESS=admin \ - -e S3_SECRET=admin123 \ + -e S3_ACCESS=${ROOT_ACCESS_KEY} \ + -e S3_SECRET=${ROOT_SECRET_KEY} \ -v /etc/openchami/data/images/compute-slurm-rocky9.yaml:/home/builder/config.yaml \ ghcr.io/openchami/image-build-el9:v0.1.2 \ image-build \ @@ -644,98 +643,17 @@ podman run \ --log-level DEBUG ``` -Check that the images built. - -```bash -s3cmd ls -Hr s3://boot-images/ | cut -d' ' -f 4- -``` - -Simplify the image build command for future image building if desired: - -**Edit as root: `/etc/profile.d/build-image.sh`** -```bash {title="/etc/profile.d/build-image.sh"} -build-image-rh9() -{ - if [ -z "$1" ]; then - echo 'Path to image config file required.' 1>&2; - return 1; - fi; - if [ ! -f "$1" ]; then - echo "$1 does not exist." 1>&2; - return 1; - fi; - podman run \ - --rm \ - --device /dev/fuse \ - -e S3_ACCESS=admin \ - -e S3_SECRET=admin123 \ - -v "$(realpath $1)":/home/builder/config.yaml:Z \ - ${EXTRA_PODMAN_ARGS} \ - ghcr.io/openchami/image-build-el9:v0.1.2 \ - image-build \ - --config config.yaml \ - --log-level DEBUG -} - -build-image-rh8() -{ - if [ -z "$1" ]; then - echo 'Path to image config file required.' 1>&2; - return 1; - fi; - if [ ! -f "$1" ]; then - echo "$1 does not exist." 1>&2; - return 1; - fi; - podman run \ - --rm \ - --device /dev/fuse \ - -e S3_ACCESS=admin \ - -e S3_SECRET=admin123 \ - -v "$(realpath $1)":/home/builder/config.yaml:Z \ - ${EXTRA_PODMAN_ARGS} \ - ghcr.io/openchami/image-build:v0.1.2 \ - image-build \ - --config config.yaml \ - --log-level DEBUG -} -alias build-image=build-image-rh9 -``` - -Apply simplified command to current session. Note that the command will be automatically applied during later logins, so you will not need to source it again: - -```bash -source /etc/profile.d/build-image.sh -``` +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +If you have already aliased the image build command per the [tutorial](https://openchami.org/docs/tutorial/#233-install-and-configure-s3-clients), you can instead run: -Note: For future, you will be able to build images in the following way: +`build-image /etc/openchami/data/images/compute-slurm-rocky9.yaml` +{{< /callout >}} -```bash -build-image /path/to/image/config.yaml -``` -Check that the alias is being used: +Check that the images built. ```bash -which build-image -``` - -Output should look like the following: - -``` -alias build-image='build-image-rh9' - build-image-rh9 () - { - if [ -z "$1" ]; then - echo 'Path to image config file required.' 1>&2; - return 1; - fi; - if [ ! -f "$1" ]; then - echo "$1 does not exist." 1>&2; - return 1; - fi; - podman run --rm --device /dev/fuse -e S3_ACCESS=admin -e S3_SECRET=admin123 -v "$(realpath $1)":/home/builder/config.yaml:Z ${EXTRA_PODMAN_ARGS} ghcr.io/openchami/image-build-el9:v0.1.2 image-build --config config.yaml --log-level DEBUG - } +s3cmd ls -Hr s3://boot-images/ | cut -d' ' -f 4- ``` ## 1.5 Configure the Boot Script Service and Cloud-Init. @@ -898,10 +816,10 @@ Check all of the required packages were installed and from the correct sources: dnf list installed ``` -Create slurm config file that is identical to that of the head node VM: +Create slurm config file that is identical to that of the head node VM. Note that you may need to update the `NodeName` info depending on the configuration of your compute node: -```bash -cat < Date: Thu, 12 Feb 2026 13:32:39 +1000 Subject: [PATCH 06/23] Minor changes to 'Install Slurm' documentation based on review feedback from David. Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index 09ce53f..0ce00e2 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -669,7 +669,7 @@ Create payload for boot script service with URIs for slurm compute boot artefact ```bash sudo mkdir /etc/openchami/data/boot/ -URIS=$(s3cmd ls -Hr s3://boot-images | grep compute/slurm | awk '{print $4}' | sed 's-s3://-http://172.16.0.254:9000/-' | xargs) +URIS=$(s3cmd ls -Hr s3://boot-images | grep compute/slurm | awk '{print $4}' | sed 's-s3://-http://172.16.0.254:7070/-' | xargs) URI_IMG=$(echo "$URIS" | cut -d' ' -f1) URI_INITRAMFS=$(echo "$URIS" | cut -d' ' -f2) URI_KERNEL=$(echo "$URIS" | cut -d' ' -f3) @@ -1055,16 +1055,20 @@ Find all directories owned by old munge UID/GID with the following command: `find / -uid 991 -type d` {{< /callout >}} -Copy munge key from the head node VM to the compute node VM: +Copy the munge key from the head node to the compute node. + +**Inside the head node:** ```bash -# Inside the head node VM cd ~ sudo cp /etc/munge/munge.key ./ sudo chown rocky:rocky munge.key scp ./munge.key root@172.16.0.1:~/ +``` + +**Inside the compute node:** -# Inside the compute node VM +```bash mv munge.key /etc/munge/munge.key chown munge:munge /etc/munge/munge.key ``` @@ -1074,6 +1078,8 @@ In the case of an error about "Offending ECDSA key in /home/rocky/.ssh/known_hos `> /home/rocky/.ssh/known_hosts` {{< /callout >}} +Continuing **inside the compute node**, setup and start the services for Slurm. + Enable and start munge service: ```bash From 3b1acbe13815bf2dbf84ee32bb83f2a9aa35871e Mon Sep 17 00:00:00 2001 From: Luna Morrow Date: Fri, 13 Feb 2026 14:49:02 +1000 Subject: [PATCH 07/23] Minor changes to formatting following feedback from David and Devon. Some reviews are still pending as I figure out the source of the problem and a solution, and I will address these in a later commit. Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index 0ce00e2..73d5b6d 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -1108,15 +1108,19 @@ nft list ruleset Restart Slurm service daemons to ensure changes are applied: -```bash -# In the compute node: -systemctl restart slurmd +**Inside the head node:** -# In the head node: +```bash sudo systemctl restart slurmctld sudo systemctl restart slurmdbd ``` +**Inside the compute node:** + +```bash +systemctl restart slurmd +``` + Test munge on the **head node VM**: ```bash From 5fb73fb4b162f8d01237c52e3402829256850408 Mon Sep 17 00:00:00 2001 From: Luna Morrow Date: Fri, 13 Feb 2026 15:30:43 +1000 Subject: [PATCH 08/23] Add support for bare-metal and cloud instance head nodes, in addition to VM head nodes. Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 149 ++++++++++++++++++++++++--- 1 file changed, 135 insertions(+), 14 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index 73d5b6d..b42ba04 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -572,6 +572,14 @@ curl http://localhost:8080/slurm-24.05.5/repodata/repomd.xml Create the compute Slurm image config file (uses the base image created in the tutorial as the parent layer): +{{< callout context="caution" title="Warning" icon="outline/alert-triangle" >}} +When writing YAML, it's important to be consistent with spacing. **It is +recommended to use spaces for all indentation instead of tabs.** + +When pasting, you may have to configure your editor to not apply indentation +rules (`:set paste` in Vim, `:set nopaste` to switch back). +{{< /callout >}} + **Edit as root: `/etc/openchami/data/images/compute-slurm-rocky9.yaml`** ```yaml {title="/etc/openchami/data/images/compute-slurm-rocky9.yaml"} options: @@ -769,8 +777,51 @@ ochami cloud-init group get config compute ochami cloud-init group render compute x1000c0s0b0n0 ``` -**In another window inside the VM host**, create compute1 compute node VM. Do NOT run the below command from inside the head node VM: +Boot the compute1 compute node VM from the compute Slurm image: +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +If the head node is in a VM (see [**Head Node: Using Virtual +Machine**](https://openchami.org/docs/tutorial/#05-head-node-using-virtual-machine)), make sure to run the +`virt-install` command on the host! +{{< /callout >}} + +{{< tabs "install-compute-vm" >}} +{{< tab "Bare Metal Head" >}} + +```bash +sudo virt-install \ + --name compute1 \ + --memory 4096 \ + --vcpus 1 \ + --disk none \ + --pxe \ + --os-variant rocky9 \ + --network network=openchami-net,model=virtio,mac=52:54:00:be:ef:01 \ + --graphics none \ + --console pty,target_type=serial \ + --boot network,hd \ + --boot loader=/usr/share/OVMF/OVMF_CODE.secboot.fd,loader.readonly=yes,loader.type=pflash,nvram.template=/usr/share/OVMF/OVMF_VARS.fd,loader_secure=no \ + --virt-type kvm +``` +{{< /tab >}} +{{< tab "Cloud Instance Head" >}} +```bash +sudo virt-install \ + --name compute1 \ + --memory 4096 \ + --vcpus 1 \ + --disk none \ + --pxe \ + --os-variant rocky9 \ + --network network=openchami-net,model=virtio,mac=52:54:00:be:ef:01 \ + --graphics none \ + --console pty,target_type=serial \ + --boot network,hd \ + --boot loader=/usr/share/OVMF/OVMF_CODE.secboot.fd,loader.readonly=yes,loader.type=pflash,nvram.template=/usr/share/OVMF/OVMF_VARS.fd,loader_secure=no \ + --virt-type kvm +``` +{{< /tab >}} +{{< tab "VM Head" >}} ```bash sudo virt-install \ --name compute1 \ @@ -783,28 +834,98 @@ sudo virt-install \ --graphics none \ --console pty,target_type=serial \ --boot network,hd \ - --boot loader=/usr/share/OVMF/OVMF_CODE_4M.secboot.fd,loader.readonly=yes,loader.type=pflash,nvram.template=/usr/share/OVMF/OVMF_VARS_4M.fd,loader_secure=no \ + --boot loader=/usr/share/OVMF/OVMF_CODE.secboot.fd,loader.readonly=yes,loader.type=pflash,nvram.template=/usr/share/OVMF/OVMF_VARS.fd,loader_secure=no \ --virt-type kvm ``` +{{< /tab >}} +{{< /tabs >}} -Once PXE boot process is done, detach from the VM with `ctrl+]`. Log back into the virsh console if desired with `virsh console compute1`. +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +If you recieve following error: +`ERROR Failed to open file '/usr/share/OVMF/OVMF_VARS.fd': No such file or directory` -{{< callout context="note" title="Note" icon="outline/info-circle" >}} -If you make a mistake and need to reboot the compute node VM with an updated image, do the following: +Repeat the command, but replace `OVMF_VARS.fd` with `OVMF_VARS_4M.fd` and replace `OVMF_CODE.secboot.fd` with `OVMF_CODE_4M.secboot.fd`. + +If this still fails, check the path under **/usr/share/OVMF** to check the name of the files there, as some distros store them under varient names. +{{< /callout >}} + + +Watch it boot. First, it should PXE: + +``` +>>Start PXE over IPv4. + Station IP address is 172.16.0.1 + + Server IP address is 172.16.0.254 + NBP filename is ipxe-x86_64.efi + NBP filesize is 1079296 Bytes + Downloading NBP file... + + NBP file downloaded successfully. +BdsDxe: loading Boot0001 "UEFI PXEv4 (MAC:525400BEEF01)" from PciRoot(0x0)/Pci(0x1,0x0)/Pci(0x0,0x0)/MAC(525400BEEF01,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0) +BdsDxe: starting Boot0001 "UEFI PXEv4 (MAC:525400BEEF01)" from PciRoot(0x0)/Pci(0x1,0x0)/Pci(0x0,0x0)/MAC(525400BEEF01,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0) +iPXE initialising devices... +autoexec.ipxe... Not found (https://ipxe.org/2d12618e) + + + +iPXE 1.21.1+ (ge9a2) -- Open Source Network Boot Firmware -- https://ipxe.org +Features: DNS HTTP HTTPS iSCSI TFTP VLAN SRP AoE EFI Menu +``` + +Then, we should see it get it's boot script from TFTP, then BSS (the `/boot/v1` URL), then download it's kernel/initramfs and boot into Linux. -In another window in the host, destroy existing compute node VM. +``` +Configuring (net0 52:54:00:be:ef:01)...... ok +tftp://172.16.0.254:69/config.ipxe... ok +Booting from http://172.16.0.254:8081/boot/v1/bootscript?mac=52:54:00:be:ef:01 +http://172.16.0.254:8081/boot/v1/bootscript... ok +http://172.16.0.254:7070/boot-images/efi-images/compute/debug/vmlinuz-5.14.0-611.24.1.el9_7.x86_64... ok +http://172.16.0.254:7070/boot-images/efi-images/compute/debug/initramfs-5.14.0-611.24.1.el9_7.x86_64.img... ok +``` + +During Linux boot, output should indicate that the SquashFS image gets downloaded and loaded. + +``` +[ 2.169210] dracut-initqueue[545]: % Total % Received % Xferd Average Speed Time Time Time Current +[ 2.170532] dracut-initqueue[545]: Dload Upload Total Spent Left Speed +100 1356M 100 1356M 0 0 1037M 0 0:00:01 0:00:01 --:--:-- 1038M +[ 3.627908] squashfs: version 4.0 (2009/01/31) Phillip Lougher +``` + +Once PXE boot process is done, detach from the VM with `ctrl+]`. Log back into the virsh console if desired with `virsh console compute1`. -`sudo virsh destroy compute1` +{{< callout context="tip" title="Tip" icon="outline/bulb" >}} +If the VM installation fails for any reason, it can be destroyed and undefined so that the install command can be run again. -Attach to the console to watch compute1 boot again. +1. Shut down ("destroy") the VM: + ```bash + sudo virsh destroy compute1 + ``` +1. Undefine the VM: + ```bash + sudo virsh undefine --nvram compute1 + ``` +1. Rerun the `virt-install` command above. -`sudo virsh start --console compute1` + +**Alternatively**, if you want to reboot the compute node VM with an updated image, do the following: + +```bash +sudo virsh destroy compute1 +sudo virsh start --console compute1 +``` {{< /callout >}} + ## 1.6 Configure and Start Slurm in the Compute Node -**From inside the head node VM**, log into the compute node: +Login as root to the compute node, ignoring its host key: + +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +If using a VM head node, login from there. Else, login from host. +{{< /callout >}} ```bash ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@172.16.0.1 @@ -816,7 +937,7 @@ Check all of the required packages were installed and from the correct sources: dnf list installed ``` -Create slurm config file that is identical to that of the head node VM. Note that you may need to update the `NodeName` info depending on the configuration of your compute node: +Create slurm config file that is identical to that of the head node. Note that you may need to update the `NodeName` info depending on the configuration of your compute node: **Edit the Slurm config file as root: `/etc/slurm/slurm.conf`** ```bash {title="/etc/slurm/slurm.conf"} @@ -1009,7 +1130,7 @@ mkdir /var/log/slurm chown slurm:slurm /var/log/slurm ``` -Creating job_container.conf file that matches the one in the head node VM: +Creating job_container.conf file that matches the one in the head node: ```bash cat < /home/rocky/.ssh/known_hosts` {{< /callout >}} -Quickly test that you can submit a job from the head node VM: +Quickly test that you can submit a job from the head node: ```bash # Check that node is present and idle From e48ad49ae6841056343ac23825582d0a476a32a1 Mon Sep 17 00:00:00 2001 From: Luna Morrow Date: Fri, 13 Feb 2026 16:32:53 +1000 Subject: [PATCH 09/23] Include 'expected output' code blocks where useful to show users how certain commands shoudl behave and/or the output they should produce. Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 295 +++++++++++++++++++++++++-- 1 file changed, 273 insertions(+), 22 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index b42ba04..b2f073e 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -35,9 +35,11 @@ This guide assumes you have setup an OpenCHAMI cluster per Sections 1-2.6 in the - [1.3 Install Slurm and Setup Configuration Files](#13-install-slurm-and-setup-configuration-files) - [1.4 Make a Local Slurm Repository and Serve it with Nginx](#14-make-a-local-slurm-repository-and-serve-it-with-nginx) - [1.5 Configure the Boot Script Service and Cloud-Init](#15-configure-the-boot-script-service-and-cloud-init) - - [1.6 Configure and Start Slurm in the Compute Node](#16-configure-and-start-slurm-in-the-compute-node) - + - [1.6 Boot the Compute Node with the Slurm Compute Image](#16-boot-the-compute-node-with-the-slurm-compute-image) + - [1.7 Configure and Start Slurm in the Compute Node](#17-configure-and-start-slurm-in-the-compute-node) + - [1.8 Test Munge and Slurm](#18-test-munge-and-slurm) + # 1 Setup and Configure Slurm Steps in this section occur on the head node created in the OpenCHAMI tutorial (or otherwise). @@ -148,6 +150,17 @@ Create the local repository (this will be used for installation and images later sudo createrepo /install/osupdates/rocky9/x86_64/slurm-24.05.5 ``` +The output should be: + +``` +Directory walk started +Directory walk done - 15 packages +Temporary output repo path: /install/osupdates/rocky9/x86_64/slurm-24.05.5/.repodata/ +Preparing sqlite DBs +Pool started (with 5 workers) +Pool finished +``` + ## 1.2 Configure Slurm and Slurm Services Create user and group ‘slurm’ with specified UID/GID: @@ -564,12 +577,72 @@ http { Detach from the container with: `ctrl-P, then ctrl-Q`. -Check everything is working by grabbing the repodata file from the head node: +Check everything is working by grabbing the repodata file from inside the head node: ```bash curl http://localhost:8080/slurm-24.05.5/repodata/repomd.xml ``` +The output should be: + +``` + + + 1770960915 + + 4670c00aed4cc64e542e8b76f4f59ec4dd333a2e02258ddab5b7604874915dff + 04f66940b8479413f57cf15aa66d56624aede301f064356ee667ccf4594470ef + + 1770960914 + 5336 + 33064 + + + 11b43e8e70d418dbe78a8c064ca42e18d63397729ba0710323034597f681d0a4 + 1f2b8e754a2db5c26557ad2a7b9c8b6a210115a4263fb153bc0445dc8210b59c + + 1770960914 + 11154 + 68224 + + + a7f25375920bf5d30d9de42a6f4aeaa8105b1150bc9bef1440e700369bcdcf53 + da1da29e2d02a626986c3647032c175e0cb768d4d643c9020e2ccc343ced93e4 + + 1770960914 + 1229 + 3354 + + + 34f7acb86f91ab845250ed939181b88acb0be454e9c42eb99cb871e3241f75e4 + 038901ed7c43b991becd931370b539c29ad5c7abffefc1ce6fc20cb8e1c1b7c7 + + 1770960915 + 12132 + 131072 + 10 + + + 1912b17f136f28e892c9591e34abc1c8ef5b466df8eed7d6d1c5adadb200c6ad + 9eb023458e4570a8c3d9407e24ee52a94befc93785e71b1f72a5d90f314762e2 + + 1770960915 + 15917 + 73728 + 10 + + + 2872ebc347c2e5fe166907ba8341dc10ef9d0419261fac253cb6bab0d1eb046f + 5db7c12e76bde1a6b5739ad5c52481633d1dd87599e86ce4d84bae8fe4504db1 + + 1770960915 + 1940 + 24576 + 10 + + +``` + Create the compute Slurm image config file (uses the base image created in the tutorial as the parent layer): {{< callout context="caution" title="Warning" icon="outline/alert-triangle" >}} @@ -664,6 +737,14 @@ Check that the images built. s3cmd ls -Hr s3://boot-images/ | cut -d' ' -f 4- ``` +The output should be: + +``` +1615M s3://boot-images/compute/slurm/rocky9.7-compute-slurm-rocky9 + 84M s3://boot-images/efi-images/compute/slurm/initramfs-5.14.0-611.20.1.el9_7.x86_64.img + 14M s3://boot-images/efi-images/compute/slurm/vmlinuz-5.14.0-611.20.1.el9_7.x86_64 +``` + ## 1.5 Configure the Boot Script Service and Cloud-Init. Get a fresh access token for ochami: @@ -703,6 +784,26 @@ Check the BSS boot parameters were added: ochami bss boot params get -F yaml ``` +The output should be: + +``` +- cloud-init: + meta-data: null + phone-home: + fqdn: "" + hostname: "" + instance_id: "" + pub_key_dsa: "" + pub_key_ecdsa: "" + pub_key_rsa: "" + user-data: null + initrd: http://172.16.0.254:9000/boot-images/efi-images/compute/slurm/initramfs-5.14.0-611.20.1.el9_7.x86_64.img + kernel: http://172.16.0.254:9000/boot-images/efi-images/compute/slurm/vmlinuz-5.14.0-611.20.1.el9_7.x86_64 + macs: + - 52:54:00:be:ef:01 + params: nomodeset ro root=live:http://172.16.0.254:9000/boot-images/compute/slurm/rocky9.7-compute-slurm-rocky9 ip=dhcp overlayroot=tmpfs overlayroot_cfgdisk=disabled apparmor=0 selinux=0 console=ttyS0,115200 ip6=off cloud-init=enabled ds=nocloud-net;s=http://172.16.0.254:8081/cloud-init +``` + Create new directory for setting up cloud-init configuration: ```bash @@ -710,14 +811,15 @@ sudo mkdir -p /etc/openchami/data/cloud-init cd /etc/openchami/data/cloud-init ``` -Create new ssh key: +Create new ssh key on the head node and press **Enter** for all of the prompts: ```bash ssh-keygen -t ed25519 ``` -Note: press `Enter` for all prompts to make ssh straightforward. -Setup the cloud-init configuration: +The new key that was generated can be found in `~/.ssh/id_ed25519.pub`. This key will need to be used in the cloud-init meta-data configured below. + +Setup the cloud-init configuration by creating `ci-defaults.yaml`: ```bash cat <" + ], + "short-name": "nid" +} +``` + Configure cloud-init for compute group: **Edit as root: `/etc/openchami/data/cloud-init/ci-group-compute.yaml`** @@ -764,19 +882,65 @@ Configure cloud-init for compute group: disable_root: false ``` -Set config for compute group with ochami and check they are set: +Now, set this configuration for the compute group: ```bash -# Set compute group config with ochami ochami cloud-init group set -f yaml -d @/etc/openchami/data/cloud-init/ci-group-compute.yaml +``` + +Check that it got added with: -# Check compute group config set +```bash ochami cloud-init group get config compute +``` + +The cloud-config file created within the YAML above should get print out: + +```yaml +## template: jinja +#cloud-config +merge_how: +- name: list + settings: [append] +- name: dict + settings: [no_replace, recurse_list] +users: + - name: root + ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }} +disable_root: false +``` -# Check jinja2 rendering properly for the compute node +`ochami` has basic per-group template rendering available that can be used to +check that the Jinja2 is rendering properly for a node. Check if for the first +compute node (x1000c0s0b0n0): + +```bash ochami cloud-init group render compute x1000c0s0b0n0 ``` +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +This feature requires that impersonation is enabled with cloud-init. Check and +make sure that the `IMPERSONATION` environment variable is set in +`/etc/openchami/configs/openchami.env`. +{{< /callout >}} + +The SSH key that was created above should appear in the config: + +```yaml +#cloud-config +merge_how: +- name: list + settings: [append] +- name: dict + settings: [no_replace, recurse_list] +users: + - name: root + ssh_authorized_keys: [''] +``` + + +## 1.6 Boot the Compute Node with the Slurm Compute Image + Boot the compute1 compute node VM from the compute Slurm image: {{< callout context="note" title="Note" icon="outline/info-circle" >}} @@ -919,7 +1083,7 @@ sudo virsh start --console compute1 {{< /callout >}} -## 1.6 Configure and Start Slurm in the Compute Node +## 1.7 Configure and Start Slurm in the Compute Node Login as root to the compute node, ignoring its host key: @@ -931,7 +1095,7 @@ If using a VM head node, login from there. Else, login from host. ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@172.16.0.1 ``` -Check all of the required packages were installed and from the correct sources: +Check all of the required packages were installed and from the correct sources (e.g. slurm packages should be installed from `@slurm-local` repo): ```bash dnf list installed @@ -1120,7 +1284,7 @@ chown -R slurm:slurm /var/lib/slurm ``` {{< callout context="note" title="Note" icon="outline/info-circle" >}} -Use `find / -name "slurm"` to make sure everything that needs to be changed is identified (note that not all results need ownership modified though!!) +Use `find / -name "slurm"` to make sure everything that needs to be changed is identified. Note that not all results need ownership modified though, such as directories under `/run/`, `/usr/` or `/var/`! {{< /callout >}} Create the directory /var/log/slurm as it doesn't exist yet, and set ownership to Slurm: @@ -1173,6 +1337,7 @@ chown -R munge:munge /etc/munge/ {{< callout context="note" title="Note" icon="outline/info-circle" >}} Find all directories owned by old munge UID/GID with the following command: + `find / -uid 991 -type d` {{< /callout >}} @@ -1196,6 +1361,7 @@ chown munge:munge /etc/munge/munge.key {{< callout context="note" title="Note" icon="outline/info-circle" >}} In the case of an error about "Offending ECDSA key in /home/rocky/.ssh/known_hosts:3", wipe the contents of the known hosts file and try the 'scp' command again: + `> /home/rocky/.ssh/known_hosts` {{< /callout >}} @@ -1209,6 +1375,23 @@ systemctl start munge.service systemctl status munge.service ``` +The output should be: + +``` +● munge.service - MUNGE authentication service + Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled) + Active: active (running) since Wed 2026-02-04 00:55:55 UTC; 1 week 2 days ago + Docs: man:munged(8) + Main PID: 1451 (munged) + Tasks: 4 (limit: 24335) + Memory: 2.2M (peak: 2.5M) + CPU: 4.710s + CGroup: /system.slice/munge.service + └─1451 /usr/sbin/munged + +Feb 04 00:55:55 de01 systemd[1]: Started MUNGE authentication service. +``` + Enable and start slurmd: ```bash @@ -1217,7 +1400,32 @@ systemctl start slurmd systemctl status slurmd ``` -Disable the firewall in the compute node: +The output should be: + +``` +● slurmd.service - Slurm node daemon + Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; preset: disabled) + Active: active (running) since Fri 2026-02-13 05:59:32 UTC; 4s ago + Main PID: 30727 (slurmd) + Tasks: 1 + Memory: 1.3M (peak: 1.5M) + CPU: 16ms + CGroup: /system.slice/slurmd.service + └─30727 /usr/sbin/slurmd --systemd + +Feb 13 05:59:32 de01.openchami.cluster systemd[1]: Stopped Slurm node daemon. +Feb 13 05:59:32 de01.openchami.cluster systemd[1]: slurmd.service: Consumed 3.533s CPU time, 3.0M memory peak. +Feb 13 05:59:32 de01.openchami.cluster systemd[1]: Starting Slurm node daemon... +Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: _read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults +Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: _read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults +Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: CPU frequency setting not configured for this node +Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: slurmd version 24.05.5 started +Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: slurmd started on Fri, 13 Feb 2026 05:59:32 +0000 +Feb 13 05:59:32 de01.openchami.cluster systemd[1]: Started Slurm node daemon. +Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=3892 TmpDisk=778 Uptime=796812 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) +``` + +Disable the firewall and reset the nft ruleset in the compute node: ```bash systemctl stop firewalld @@ -1242,6 +1450,8 @@ sudo systemctl restart slurmdbd systemctl restart slurmd ``` +## 1.8 Test Munge and Slurm + Test munge on the **head node**: ```bash @@ -1249,27 +1459,68 @@ Test munge on the **head node**: munge -n | ssh root@172.16.0.1 unmunge ``` +The output should be: + +``` +STATUS: Success (0) +ENCODE_HOST: ??? (192.168.200.2) +ENCODE_TIME: 2026-02-13 05:33:34 +0000 (1770960814) +DECODE_TIME: 2026-02-13 05:33:34 +0000 (1770960814) +TTL: 300 +CIPHER: aes128 (4) +MAC: sha256 (5) +ZIP: none (0) +UID: ??? (1000) +GID: ??? (1000) +LENGTH: 0 +``` + {{< callout context="note" title="Note" icon="outline/info-circle" >}} In the case of an error about "Offending ECDSA key in /home/rocky/.ssh/known_hosts:3", wipe the contents of the known hosts file and try the 'munge' command again: + `> /home/rocky/.ssh/known_hosts` {{< /callout >}} -Quickly test that you can submit a job from the head node: +Test that you can submit a job from the **head node**. + +Check that the node is present and idle: ```bash -# Check that node is present and idle sinfo +``` + +The output should be: + +``` +PARTITION AVAIL TIMELIMIT NODES STATE NODELIST +main* up infinite 1 idle de01 +``` + +Create user with a Slurm account: -# Create user with Slurm account +```bash sudo useradd -m -s /bin/bash testuser sudo usermod -aG wheel testuser sudo sacctmgr create user testuser defaultaccount=root sudo su - testuser +``` + +Run a test job as the user 'testuser': -# Run a test job as the user 'testuser' +```bash srun hostname ``` +The output should be: + +``` +de01 +``` + If something goes wrong and your compute node goes down, restart it with this command: -`sudo scontrol update NodeName=de01 State=RESUME` + +```bash +sudo scontrol update NodeName=de01 State=RESUME +``` + From 9831dbcd85e9adcd52f89512e1e2ba1d6a2c0239 Mon Sep 17 00:00:00 2001 From: Luna Morrow Date: Thu, 19 Feb 2026 16:10:06 +1000 Subject: [PATCH 10/23] Add support for installing version 0.5.18 of munge to prevent known security vulnerabilities with versions 0.5-0.5.17 Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 49 ++++++++++++++++++++++++++-- 1 file changed, 47 insertions(+), 2 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index b2f073e..dbd3c7c 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -46,6 +46,51 @@ Steps in this section occur on the head node created in the OpenCHAMI tutorial ( ## 1.1 Setup Slurm Build/Installation as a Local Repository +Install version 0.5.18 of munge. Versions 0.5-0.5.17 have a significant security vulnerability, so it is important that version 0.5.18 is used instead of 0.5.13 which is available through dnf for Rocky Linux 9. For more information see: [https://nvd.nist.gov/vuln/detail/CVE-2026-25506](https://nvd.nist.gov/vuln/detail/CVE-2026-25506) + + +Grab munge version 0.5.18 release tarball from GitHub: + +```bash +curl -sL https://github.com/dun/munge/releases/download/munge-0.5.18/munge-0.5.18.tar.xz -o munge-0.5.18.tar.xz +``` + +Convert tarball to rpm package, build dependencies and build binary package: + +```bash +rpmbuild -ts munge-0.5.18.tar.xz + +sudo dnf builddep /home/rocky/rpmbuild/SRPMS/munge-0.5.18-1.el9.src.rpm + +rpmbuild -tb munge-0.5.18.tar.xz +``` + +Install rpms created by rpmbuild: + +```bash +cd ~/rpmbuild + +sudo rpm --install --verbose --force \ + RPMS/x86_64/munge-0.5.18-1.el9.x86_64.rpm \ + RPMS/x86_64/munge-debuginfo-0.5.18-1.el9.x86_64.rpm \ + RPMS/x86_64/munge-debugsource-0.5.18-1.el9.x86_64.rpm \ + RPMS/x86_64/munge-devel-0.5.18-1.el9.x86_64.rpm \ + RPMS/x86_64/munge-libs-0.5.18-1.el9.x86_64.rpm \ + RPMS/x86_64/munge-libs-debuginfo-0.5.18-1.el9.x86_64.rpm +``` + +Check that munge was installed correctly: + +```bash +munge --version +``` + +The output should be: + +``` +munge-0.5.18 (2026-02-10) +``` + Download Slurm pre-requisite sources compatible with Rocky 9 OS: ```bash @@ -59,7 +104,7 @@ sudo dnf groupinstall -y 'Development Tools' && \ sudo dnf install -y createrepo freeipmi freeipmi-devel dbus-devel gtk2-devel hdf5 hdf5-devel http-parser-devel \ hwloc hwloc-devel jq json-c-devel libaec libconfuse libcurl-devel libevent-devel \ libyaml libyaml-devel lua-devel lua-filesystem lua-json lua-lpeg lua-posix lua-term mariadb mariadb-devel \ - munge munge-devel munge-libs ncurses-devel numactl numactl-devel oniguruma openssl-devel pam-devel \ + ncurses-devel numactl numactl-devel oniguruma openssl-devel pam-devel \ perl-DBI perl-ExtUtils-MakeMaker perl-Switch pigz python3 python3-devel readline-devel \ lsb_release rrdtool rrdtool-devel tcl tcl-devel ucx ucx-cma ucx-devel ucx-ib wget \ lz4-devel s2n-tls-devel libjwt-devel librdkafka-devel && \ @@ -184,7 +229,7 @@ sudo chown munge:munge /var/lib/munge/ sudo chown munge:munge /etc/munge/ # Create munge key -sudo create-munge-key +sudo -u munge /usr/sbin/mungekey -v # Start munge again sudo systemctl enable --now munge From 5adfb38f79ce5a6bb6e9f76b9d6d9c6700300600 Mon Sep 17 00:00:00 2001 From: Luna Morrow Date: Fri, 20 Feb 2026 15:53:59 +1000 Subject: [PATCH 11/23] Update image build command to install munge version 0.5.18 into the compute node. Additionally made some tweaks to the documentation to make the workflow more robust after repeating it on a fresh node. Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 44 ++++++++++++++++++---------- 1 file changed, 29 insertions(+), 15 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index dbd3c7c..5deac7f 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -374,7 +374,7 @@ Create the Slurm config file, which will be used by SlurmCTL. Note that you may ```bash {title="/etc/slurm/slurm.conf"} # ClusterName=demo -SlurmctldHost=demo.openchami.cluster +SlurmctldHost=demo # #DisableRootJobs=NO EnforcePartLimits=ALL @@ -707,11 +707,11 @@ options: - 'rocky9' pkg_manager: dnf gpgcheck: False - parent: 'demo.openchami.cluster:5000/demo/rocky-base:9' + parent: 'master.openchami.cluster:5000/demo/rocky-base:9' registry_opts_pull: - '--tls-verify=false' - publish_s3: 'http://demo.openchami.cluster:7070' + publish_s3: 'http://master.openchami.cluster:7070' s3_prefix: 'compute/slurm/' s3_bucket: 'boot-images' @@ -722,6 +722,7 @@ repos: - alias: 'Slurm' url: 'http://localhost:8080/slurm-24.05.5' + packages: - boxes - figlet @@ -730,6 +731,8 @@ packages: - tcpdump - traceroute - vim + - curl + - rpm-build - shadow-utils - pwgen - jq @@ -751,6 +754,15 @@ packages: - slurm-slurmdbd-24.05.5 - slurm-slurmrestd-24.05.5 - slurm-torque-24.05.5 + +cmds: + - cmd: 'curl -sL https://github.com/dun/munge/releases/download/munge-0.5.18/munge-0.5.18.tar.xz -o munge-0.5.18.tar.xz' + - cmd: 'rpmbuild -ts munge-0.5.18.tar.xz' + - cmd: 'dnf builddep -y /root/rpmbuild/SRPMS/munge-0.5.18-1.el9.src.rpm' + - cmd: 'rpmbuild -tb munge-0.5.18.tar.xz' + - cmd: 'cd /root/rpmbuild' + - cmd: 'rpm --install --verbose --force /root/rpmbuild/RPMS/x86_64/munge-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-debuginfo-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-debugsource-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-devel-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-libs-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-libs-debuginfo-0.5.18-1.el9.x86_64.rpm' + - cmd: 'dnf remove -y munge-libs-0.5.13-13.el9 munge-0.5.13-13.el9' ``` Run podman container to run image build command. The S3_ACCESS and S3_SECRET tokens are set in the tutorial [here](https://openchami.org/docs/tutorial/#233-install-and-configure-s3-clients). @@ -1152,7 +1164,7 @@ Create slurm config file that is identical to that of the head node. Note that y ```bash {title="/etc/slurm/slurm.conf"} # ClusterName=demo -SlurmctldHost=demo.openchami.cluster +SlurmctldHost=demo # #DisableRootJobs=NO EnforcePartLimits=ALL @@ -1342,13 +1354,13 @@ chown slurm:slurm /var/log/slurm Creating job_container.conf file that matches the one in the head node: ```bash -cat < Date: Tue, 24 Feb 2026 10:10:41 +1000 Subject: [PATCH 12/23] Update short hostname from 'head' to 'demo', as I missed this update in a few places Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index 5deac7f..9d28597 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -299,7 +299,7 @@ mysql -u root -p # when prompted, enter the password from pwgen create database slurm_acct_db; grant all on slurm_acct_db.* to slurm@'localhost' identified by ''; grant all on slurm_acct_db.* to slurm@'demo.openchami.cluster' identified by ''; -grant all on slurm_acct_db.* to slurm@'head' identified by ''; +grant all on slurm_acct_db.* to slurm@'demo' identified by ''; exit ``` @@ -351,9 +351,9 @@ sudo chown -R slurm. /etc/slurm/ Modify the SlurmDB config. You will need the `pwgen` generated password generated earlier when setting up MariaDB for this section: ```bash -DBHOST=head +DBHOST=demo DBPASSWORD= # EDIT TO THE PASSWORD SET IN THE MARIADB CONFIGURATION SECTION -SLURMDBHOST1=head +SLURMDBHOST1=demo sudo sed -i "s|DbdAddr.*|DbdAddr=${SLURMDBHOST1}|g" /etc/slurm/slurmdbd.conf sudo sed -i "s|DbdHost.*|DbdHost=${SLURMDBHOST1}|g" /etc/slurm/slurmdbd.conf @@ -542,7 +542,7 @@ Configure the hosts file with addresses for both the head node and the compute n ```bash cat < Date: Wed, 4 Mar 2026 11:45:51 +1000 Subject: [PATCH 13/23] Update 'Choose Your Own Adventure' section of tutorial to include reference to the 'Install Slurm' guide Signed-off-by: Luna Morrow --- content/docs/tutorial.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/content/docs/tutorial.md b/content/docs/tutorial.md index bd6af5e..1f7caf5 100644 --- a/content/docs/tutorial.md +++ b/content/docs/tutorial.md @@ -3147,3 +3147,10 @@ This can be done in at least two ways here: - Alternatively, the necessary SLURM and MPI packages can be installed via the cloud-init config. + +### 3.6 Deploy Slurm Workload Manager + +Follow the guide for [Slurm Installation](https://openchami.org/docs/guides/install-slurm/) +to setup and configure the Slurm Workload Manager on the OpenCHAMI cluster +created during this tutorial. Complete up to [Section 2.6](https://openchami.org/docs/tutorial/#26-boot-the-compute-node-with-the-debug-image) +of this tutorial, then begin the "Install Slurm" guide. From 77a36e93deda365902bfad1bf83b8d975a478cab Mon Sep 17 00:00:00 2001 From: Luna Morrow Date: Wed, 4 Mar 2026 15:51:24 +1000 Subject: [PATCH 14/23] Apply reviewed suggestions from Devon, excluding leveraging Cloud-Init and the image config to reduce the number of commands needing to be run on the compute node. We are waiting on feedback from David and Alex before potentially implementing a more persistent Slurm configuration on the compute node/s. Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 210 +++++++++++++++------------ 1 file changed, 121 insertions(+), 89 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index 9d28597..50c3339 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -16,7 +16,7 @@ seo: # 0 Overview -This guide walks through setting up Slurm on a OpenCHAMI cluster. This guide will assume you have already setup an OpenCHAMI cluster per Sections 1-2.6 in the [OpenCHAMI Tutorial](https://openchami.org/docs/tutorial/). The only other requirement is to run a webserver to serve a Slurm repo for the image builder to use, and this guide assumes Podman is present to use. This guide will only walk through Slurm setup for a cluster with one head node and one compute node, but is easily expanded to multiple compute nodes by updating the node list with ochami. +This guide walks through setting up Slurm on an OpenCHAMI cluster. This guide will assume you have already setup an OpenCHAMI cluster per Sections 1-2.6 in the [OpenCHAMI Tutorial](https://openchami.org/docs/tutorial/), therefore this guide will assume the rocky user on a Rocky Linux 9 system. Please substitute the rocky user with your normal user if you have setup an OpenCHAMI cluster outside of the tutorial. The only other requirement is to run a webserver to serve a Slurm repo for the image builder to use, and this guide assumes Podman is present to use. This guide will only walk through Slurm setup for a cluster with one head node and one compute node, but is easily expanded to multiple compute nodes by updating the node list with ochami. ## 0.1 Prerequisites @@ -48,6 +48,11 @@ Steps in this section occur on the head node created in the OpenCHAMI tutorial ( Install version 0.5.18 of munge. Versions 0.5-0.5.17 have a significant security vulnerability, so it is important that version 0.5.18 is used instead of 0.5.13 which is available through dnf for Rocky Linux 9. For more information see: [https://nvd.nist.gov/vuln/detail/CVE-2026-25506](https://nvd.nist.gov/vuln/detail/CVE-2026-25506) +Change into working directory (created in [Section 1.1](https://openchami.org/docs/tutorial/#11-set-up-storage-directories) of the Tutorial), so that any files that are created are put here. + +```bash +cd /opt/workdir +``` Grab munge version 0.5.18 release tarball from GitHub: @@ -58,9 +63,11 @@ curl -sL https://github.com/dun/munge/releases/download/munge-0.5.18/munge-0.5.1 Convert tarball to rpm package, build dependencies and build binary package: ```bash +sudo dnf install -y rpm-build rpmdevtools + rpmbuild -ts munge-0.5.18.tar.xz -sudo dnf builddep /home/rocky/rpmbuild/SRPMS/munge-0.5.18-1.el9.src.rpm +sudo dnf builddep /opt/workdir/rpmbuild/SRPMS/munge-0.5.18-1.el9.src.rpm rpmbuild -tb munge-0.5.18.tar.xz ``` @@ -68,15 +75,13 @@ rpmbuild -tb munge-0.5.18.tar.xz Install rpms created by rpmbuild: ```bash -cd ~/rpmbuild - sudo rpm --install --verbose --force \ - RPMS/x86_64/munge-0.5.18-1.el9.x86_64.rpm \ - RPMS/x86_64/munge-debuginfo-0.5.18-1.el9.x86_64.rpm \ - RPMS/x86_64/munge-debugsource-0.5.18-1.el9.x86_64.rpm \ - RPMS/x86_64/munge-devel-0.5.18-1.el9.x86_64.rpm \ - RPMS/x86_64/munge-libs-0.5.18-1.el9.x86_64.rpm \ - RPMS/x86_64/munge-libs-debuginfo-0.5.18-1.el9.x86_64.rpm + rpmbuild/RPMS/x86_64/munge-0.5.18-1.el9.x86_64.rpm \ + rpmbuild/RPMS/x86_64/munge-debuginfo-0.5.18-1.el9.x86_64.rpm \ + rpmbuild/RPMS/x86_64/munge-debugsource-0.5.18-1.el9.x86_64.rpm \ + rpmbuild/RPMS/x86_64/munge-devel-0.5.18-1.el9.x86_64.rpm \ + rpmbuild/RPMS/x86_64/munge-libs-0.5.18-1.el9.x86_64.rpm \ + rpmbuild/RPMS/x86_64/munge-libs-debuginfo-0.5.18-1.el9.x86_64.rpm ``` Check that munge was installed correctly: @@ -117,8 +122,9 @@ Create build script to install Slurm 24.05.5 and PMIX 4.2.9-1: This guide installs Slurm 24.05.5 and PMIX 4.2.9-1 to ensure compatibility. Other versions can be installed instead, but make sure to check version compatibility first. {{< /callout >}} -**Create file as rocky user: 'home/rocky/build.sh'** -```bash {title="home/rocky/build.sh"} +**Edit as normal user: `/opt/workdir/build.sh`** + +```bash {title="/opt/workdir/build.sh"} SLURMVERSION=${1:-24.05.5} PMIXVERSION=${2:-4.2.9-1} ELRELEASE=${3:-el9} #Rocky 9 @@ -167,8 +173,8 @@ fi Adjust permissions for build script so that it is executable, and execute it with **root** privileges: ```bash -chmod 755 /home/rocky/build.sh -sudo ./build.sh +chmod 755 /opt/workdir/build.sh +sudo /opt/workdir/build.sh ``` {{< callout context="note" title="Note" icon="outline/info-circle" >}} @@ -185,14 +191,14 @@ configure: WARNING: unable to build man page html files without man2html Copy the Slurm packages to the desired location to create the local repository: ```bash -sudo mkdir -p /install/osupdates/rocky9/x86_64/ -sudo cp -r slurm/9.7/24.05.5 /install/osupdates/rocky9/x86_64/slurm-24.05.5 +sudo mkdir -p /srv/repo/rocky/9/x86_64/ +sudo cp -r /opt/workdir/slurm/9.7/24.05.5 /srv/repo/rocky/9/x86_64/slurm-24.05.5 ``` Create the local repository (this will be used for installation and images later): ```bash -sudo createrepo /install/osupdates/rocky9/x86_64/slurm-24.05.5 +sudo createrepo /srv/repo/rocky/9/x86_64/slurm-24.05.5 ``` The output should be: @@ -200,7 +206,7 @@ The output should be: ``` Directory walk started Directory walk done - 15 packages -Temporary output repo path: /install/osupdates/rocky9/x86_64/slurm-24.05.5/.repodata/ +Temporary output repo path: /srv/repo/rocky/9/x86_64/slurm-24.05.5/.repodata/ Preparing sqlite DBs Pool started (with 5 workers) Pool finished @@ -216,6 +222,14 @@ sudo groupadd -g $SLURMID slurm sudo useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u $SLURMID -g slurm -s /sbin/nologin slurm ``` +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +The following warning is expected and can be ignored, as the 'slurm' user is a system service account so can have a UID below 1000. + +``` +useradd warning: slurm's uid 666 outside of the UID_MIN 1000 and UID_MAX 60000 range. +``` +{{< /callout >}} + Update the UID and GID of ‘munge’ user and group to 616, update directory ownership, create munge key and restart the munge service: ```bash @@ -263,18 +277,19 @@ Enable and start the mariaDB service as this is a single node cluster (so we are sudo systemctl enable --now mariadb ``` -Secure the mariaDB installation with a strong root password. Use `pwgen` to generate a password and store this password securely. You will use the `pwgen` password to setup and configure MariaDB, as well as to create a database for Slurm to access the head node: +Secure the mariaDB installation with a strong root password. Use `pwgen` to generate a password, and make sure to store this password securely. You will use the `pwgen` password to setup and configure MariaDB, as well as to create a database for Slurm to access the head node: ```bash sudo dnf -y install pwgen -pwgen 20 1 # generates 1 password of length 20 characters +export SQL_PWORD="$(pwgen 20 1)" +echo "${SQL_PWORD}" # copy output for interactive prompts and so you can store it somewhere securely sudo mysql_secure_installation ``` **MariaDB setup/settings should be done as follows:** ``` -Enter current password for root (enter for none): # enter rocky password: 'rocky' +Enter current password for root (enter for none): # enter user password (e.g. "rocky" if following tutorial) Switch to unix_socket authentication [Y/n] Y @@ -291,16 +306,16 @@ Remove test database and access to it? [Y/n] n Reload privilege tables now? [Y/n] Y ``` -Create the database and grant access to localhost and the head node. You will need the password you generated with `pwgen` in the above step. Make sure you edit the bash code provided below to replace `` with the actual password: +Create the database and grant access to localhost and the head node. You will need the password you generated with `pwgen` in the above step when prompted to "Enter password:": ```bash -mysql -u root -p # when prompted, enter the password from pwgen - +cat <'; -grant all on slurm_acct_db.* to slurm@'demo.openchami.cluster' identified by ''; -grant all on slurm_acct_db.* to slurm@'demo' identified by ''; +grant all on slurm_acct_db.* to slurm@'localhost' identified by "${SQL_PWORD}"; +grant all on slurm_acct_db.* to slurm@'demo.openchami.cluster' identified by "${SQL_PWORD}"; +grant all on slurm_acct_db.* to slurm@'demo' identified by "${SQL_PWORD}"; exit +EOF ``` Install a few more dependencies that are required: @@ -326,12 +341,14 @@ Add the Slurm repo created earlier to install from it (will ensure we get the co SLURMVERSION=24.05.5 RELEASE=rocky9 -echo "[slurm-local] +cat < # EDIT TO THE PASSWORD SET IN THE MARIADB CONFIGURATION SECTION +DBPASSWORD="${SQL_PWORD}" # EDIT TO THE PASSWORD SET IN THE MARIADB CONFIGURATION SECTION SLURMDBHOST1=demo sudo sed -i "s|DbdAddr.*|DbdAddr=${SLURMDBHOST1}|g" /etc/slurm/slurmdbd.conf @@ -368,6 +385,12 @@ sudo sed -i "s|PidFile.*|PidFile=/var/run/slurm/slurmdbd.pid|g" /etc/slurm/slurm sudo sed -i "s|#StorageLoc.*|StorageLoc=slurm_acct_db|g" /etc/slurm/slurmdbd.conf ``` +The environment variable we set earlier to store the password for SQL should not be unset for security: + +```bash +unset SQL_PWORD +``` + Create the Slurm config file, which will be used by SlurmCTL. Note that you may need to update the `NodeName` info depending on the configuration of your compute node. **Edit the Slurm config file as root: `/etc/slurm/slurm.conf`** @@ -531,11 +554,13 @@ Add job container config file to Slurm config directory: ```bash SLURMTMPDIR=/lscratch -echo "# Job /tmp on a local volume mounted on ${SLURMTMPDIR} +cat <&1 | awk -F: '/configure arguments/ {print $2}' | xargs -n1 | grep conf-path -``` - -**Edit the Nginx config file as root: `/etc/nginx/nginx.conf`** -```bash {title="/etc/nginx/nginx.conf"} +**Edit as normal user: `/opt/workdir/nginx.conf`** +```bash {title="/opt/workdir/nginx.conf"} user nginx; worker_processes auto; @@ -614,13 +615,20 @@ http { # configuration for processing URIs for local Slurm repo # serve static files from this path # such that a request for /slurm-24.05.5/repodata/repomd.xml will be served /usr/share/nginx/html/slurm-24.05.5/repodata/repomd.xml - root /usr/share/nginx/html + root /usr/share/nginx/html; } } } ``` -Detach from the container with: `ctrl-P, then ctrl-Q`. +Use Podman to run Nginx in a container that has the local Slurm repository and the Nginx configuration file mounted into it: + +```bash +podman run --name serve-slurm \ + -v /opt/workdir/nginx.conf:/etc/nginx/nginx.conf \ + --mount type=bind,source=/install/osupdates/rocky9/x86_64/slurm-24.05.5,target=/usr/share/nginx/html/slurm-24.05.5,readonly \ + -p 8080:80 -d nginx +``` Check everything is working by grabbing the repodata file from inside the head node: @@ -630,7 +638,7 @@ curl http://localhost:8080/slurm-24.05.5/repodata/repomd.xml The output should be: -``` +```xml 1770960915 @@ -699,6 +707,7 @@ rules (`:set paste` in Vim, `:set nopaste` to switch back). {{< /callout >}} **Edit as root: `/etc/openchami/data/images/compute-slurm-rocky9.yaml`** + ```yaml {title="/etc/openchami/data/images/compute-slurm-rocky9.yaml"} options: layer_type: base @@ -707,11 +716,11 @@ options: - 'rocky9' pkg_manager: dnf gpgcheck: False - parent: 'master.openchami.cluster:5000/demo/rocky-base:9' + parent: 'demo.openchami.cluster:5000/demo/rocky-base:9' registry_opts_pull: - '--tls-verify=false' - publish_s3: 'http://master.openchami.cluster:7070' + publish_s3: 'http://demo.openchami.cluster:7070' s3_prefix: 'compute/slurm/' s3_bucket: 'boot-images' @@ -791,7 +800,7 @@ If you have already aliased the image build command per the [tutorial](https://o Check that the images built. ```bash -s3cmd ls -Hr s3://boot-images/ | cut -d' ' -f 4- +s3cmd ls -Hr s3://boot-images/ | cut -d' ' -f 4- | grep slurm ``` The output should be: @@ -802,7 +811,7 @@ The output should be: 14M s3://boot-images/efi-images/compute/slurm/vmlinuz-5.14.0-611.20.1.el9_7.x86_64 ``` -## 1.5 Configure the Boot Script Service and Cloud-Init. +## 1.5 Configure the Boot Script Service and Cloud-Init Get a fresh access token for ochami: @@ -813,7 +822,7 @@ export DEMO_ACCESS_TOKEN=$(sudo bash -lc 'gen_access_token') Create payload for boot script service with URIs for slurm compute boot artefacts: ```bash -sudo mkdir /etc/openchami/data/boot/ +sudo mkdir -p /etc/openchami/data/boot/bss URIS=$(s3cmd ls -Hr s3://boot-images | grep compute/slurm | awk '{print $4}' | sed 's-s3://-http://172.16.0.254:7070/-' | xargs) URI_IMG=$(echo "$URIS" | cut -d' ' -f1) @@ -1102,8 +1111,8 @@ Configuring (net0 52:54:00:be:ef:01)...... ok tftp://172.16.0.254:69/config.ipxe... ok Booting from http://172.16.0.254:8081/boot/v1/bootscript?mac=52:54:00:be:ef:01 http://172.16.0.254:8081/boot/v1/bootscript... ok -http://172.16.0.254:7070/boot-images/efi-images/compute/debug/vmlinuz-5.14.0-611.24.1.el9_7.x86_64... ok -http://172.16.0.254:7070/boot-images/efi-images/compute/debug/initramfs-5.14.0-611.24.1.el9_7.x86_64.img... ok +http://172.16.0.254:7070/boot-images/efi-images/compute/slurm/vmlinuz-5.14.0-611.24.1.el9_7.x86_64... ok +http://172.16.0.254:7070/boot-images/efi-images/compute/slurm/initramfs-5.14.0-611.24.1.el9_7.x86_64.img... ok ``` During Linux boot, output should indicate that the SquashFS image gets downloaded and loaded. @@ -1152,11 +1161,36 @@ If using a VM head node, login from there. Else, login from host. ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@172.16.0.1 ``` -Check all of the required packages were installed and from the correct sources (e.g. slurm packages should be installed from `@slurm-local` repo): +Check that the munge and slurm packages were installed from the correct sources (e.g. slurm packages should be installed from `@slurm-local` repo): ```bash -dnf list installed +dnf list installed | grep -e munge -e slurm +``` + +The output should be: + ``` +munge.x86_64 0.5.18-1.el9 @System +munge-debuginfo.x86_64 0.5.18-1.el9 @System +munge-debugsource.x86_64 0.5.18-1.el9 @System +munge-devel.x86_64 0.5.18-1.el9 @System +munge-libs.x86_64 0.5.18-1.el9 @System +munge-libs-debuginfo.x86_64 0.5.18-1.el9 @System +pmix.x86_64 4.2.9-1.el9 @8080_slurm-24.05.5 +slurm.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-contribs.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-devel.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-example-configs.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-libpmi.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-pam_slurm.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-perlapi.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-sackd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-slurmctld.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-slurmd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-slurmdbd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-slurmrestd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-torque.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +``` Create slurm config file that is identical to that of the head node. Note that you may need to update the `NodeName` info depending on the configuration of your compute node: @@ -1356,11 +1390,13 @@ Creating job_container.conf file that matches the one in the head node: ```bash SLURMTMPDIR=/lscratch -echo "# Job /tmp on a local volume mounted on ${SLURMTMPDIR} +cat <}} -Find all directories owned by old munge UID/GID with the following command: - -`find / -uid 991 -type d` -{{< /callout >}} - Copy the munge key from the head node to the compute node. **Inside the head node:** @@ -1405,7 +1433,7 @@ Copy the munge key from the head node to the compute node. ```bash cd ~ sudo cp /etc/munge/munge.key ./ -sudo chown rocky:rocky munge.key +sudo chown "$(id -u):$(id -u)" munge.key scp ./munge.key root@172.16.0.1:~/ ``` @@ -1417,9 +1445,11 @@ chown munge:munge /etc/munge/munge.key ``` {{< callout context="note" title="Note" icon="outline/info-circle" >}} -In the case of an error about "Offending ECDSA key in /home/rocky/.ssh/known_hosts:3", wipe the contents of the known hosts file and try the 'scp' command again: +In the case of an error about "Offending ECDSA key in ~/.ssh/known_hosts:3", remove the compute node from the known hosts file and try the 'scp' command again: + +`ssh-keygen -R 172.16.0.1` -`> /home/rocky/.ssh/known_hosts` +Alternatively, setup an `ignore.conf` file per [Section 2.8.3](https://openchami.org/docs/tutorial/#283-logging-into-the-compute-node) of the tutorial, to prevent this issue. {{< /callout >}} Continuing **inside the compute node**, setup and start the services for Slurm. @@ -1531,9 +1561,11 @@ LENGTH: 0 ``` {{< callout context="note" title="Note" icon="outline/info-circle" >}} -In the case of an error about "Offending ECDSA key in /home/rocky/.ssh/known_hosts:3", wipe the contents of the known hosts file and try the 'munge' command again: +In the case of an error about "Offending ECDSA key in ~/.ssh/known_hosts:3", remove the compute node from the known hosts file and try the 'scp' command again: + +`ssh-keygen -R 172.16.0.1` -`> /home/rocky/.ssh/known_hosts` +Alternatively, setup an `ignore.conf` file per [Section 2.8.3](https://openchami.org/docs/tutorial/#283-logging-into-the-compute-node) of the tutorial, to prevent this issue. {{< /callout >}} Test that you can submit a job from the **head node**. @@ -1571,8 +1603,8 @@ The output should be: ``` srun: job 1 queued and waiting for resources srun: job 1 has been allocated resources -slurmstepd: error: couldn't chdir to `/home/lmorrow': No such file or directory: going to /tmp instead -slurmstepd: error: couldn't chdir to `/home/lmorrow': No such file or directory: going to /tmp instead +slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead +slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead de01 ``` From 28bf97a31112b941b121db55b672e4ba06ba05f9 Mon Sep 17 00:00:00 2001 From: Luna Morrow Date: Thu, 5 Mar 2026 11:07:15 +1000 Subject: [PATCH 15/23] Minor syntax changes and typo fixes to address review feedback from Devon Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index 50c3339..d904ef8 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -385,7 +385,7 @@ sudo sed -i "s|PidFile.*|PidFile=/var/run/slurm/slurmdbd.pid|g" /etc/slurm/slurm sudo sed -i "s|#StorageLoc.*|StorageLoc=slurm_acct_db|g" /etc/slurm/slurmdbd.conf ``` -The environment variable we set earlier to store the password for SQL should not be unset for security: +The environment variable we set earlier to store the password for SQL should now be unset for security: ```bash unset SQL_PWORD @@ -1447,7 +1447,9 @@ chown munge:munge /etc/munge/munge.key {{< callout context="note" title="Note" icon="outline/info-circle" >}} In the case of an error about "Offending ECDSA key in ~/.ssh/known_hosts:3", remove the compute node from the known hosts file and try the 'scp' command again: -`ssh-keygen -R 172.16.0.1` +``` +ssh-keygen -R 172.16.0.1 +``` Alternatively, setup an `ignore.conf` file per [Section 2.8.3](https://openchami.org/docs/tutorial/#283-logging-into-the-compute-node) of the tutorial, to prevent this issue. {{< /callout >}} @@ -1563,7 +1565,9 @@ LENGTH: 0 {{< callout context="note" title="Note" icon="outline/info-circle" >}} In the case of an error about "Offending ECDSA key in ~/.ssh/known_hosts:3", remove the compute node from the known hosts file and try the 'scp' command again: -`ssh-keygen -R 172.16.0.1` +``` +ssh-keygen -R 172.16.0.1 +``` Alternatively, setup an `ignore.conf` file per [Section 2.8.3](https://openchami.org/docs/tutorial/#283-logging-into-the-compute-node) of the tutorial, to prevent this issue. {{< /callout >}} From 88643949fd2af95d43e09cd86b3043a68cb40430 Mon Sep 17 00:00:00 2001 From: Luna Morrow Date: Thu, 5 Mar 2026 12:35:14 +1000 Subject: [PATCH 16/23] Correct missing argument to rpmbuild so that the munge RPMs are built in the working directory '/opt/workdir' (as desired) and not the user's home directory Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index d904ef8..d4053bd 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -65,11 +65,11 @@ Convert tarball to rpm package, build dependencies and build binary package: ```bash sudo dnf install -y rpm-build rpmdevtools -rpmbuild -ts munge-0.5.18.tar.xz +rpmbuild -ts munge-0.5.18.tar.xz --define "_topdir /opt/workdir/rpmbuild" -sudo dnf builddep /opt/workdir/rpmbuild/SRPMS/munge-0.5.18-1.el9.src.rpm +sudo dnf builddep -y /opt/workdir/rpmbuild/SRPMS/munge-0.5.18-1.el9.src.rpm -rpmbuild -tb munge-0.5.18.tar.xz +rpmbuild -tb munge-0.5.18.tar.xz --define "_topdir /opt/workdir/rpmbuild" ``` Install rpms created by rpmbuild: @@ -369,7 +369,7 @@ Modify the SlurmDB config. You will need the `pwgen` generated password generate ```bash DBHOST=demo -DBPASSWORD="${SQL_PWORD}" # EDIT TO THE PASSWORD SET IN THE MARIADB CONFIGURATION SECTION +DBPASSWORD="${SQL_PWORD}" SLURMDBHOST1=demo sudo sed -i "s|DbdAddr.*|DbdAddr=${SLURMDBHOST1}|g" /etc/slurm/slurmdbd.conf From 9358c4f77e0e5732d77bcf319c0250d92c7c0bc1 Mon Sep 17 00:00:00 2001 From: Luna Morrow Date: Thu, 5 Mar 2026 13:02:47 +1000 Subject: [PATCH 17/23] Update baseurl filepath to new location of slurm RPMs in '/opt/workdir' in the slurm-local.repo file Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index d4053bd..8131095 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -339,12 +339,11 @@ Add the Slurm repo created earlier to install from it (will ensure we get the co ```bash # Create local repo file SLURMVERSION=24.05.5 -RELEASE=rocky9 cat < Date: Thu, 5 Mar 2026 13:08:54 +1000 Subject: [PATCH 18/23] Mount slurm RPMs in Podmain container from the correct new location of slurm RPMs in '/opt/workdir' Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index 8131095..acc35d4 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -343,7 +343,7 @@ SLURMVERSION=24.05.5 cat < Date: Tue, 10 Mar 2026 11:47:12 +1000 Subject: [PATCH 19/23] Adjust path to boot image and update munge file/directory ownership command Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index acc35d4..6a6ac2b 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -827,7 +827,7 @@ URIS=$(s3cmd ls -Hr s3://boot-images | grep compute/slurm | awk '{print $4}' | s URI_IMG=$(echo "$URIS" | cut -d' ' -f1) URI_INITRAMFS=$(echo "$URIS" | cut -d' ' -f2) URI_KERNEL=$(echo "$URIS" | cut -d' ' -f3) -cat <}} +If there is a version 0.5.13 of munge currently installed and present in the output from the above command, remove it to ensure that version 0.5.18 is used. + +``` +dnf remove -y munge-libs-0.5.13- munge-0.5.13- +``` +{{< /callout >}} + Create slurm config file that is identical to that of the head node. Note that you may need to update the `NodeName` info depending on the configuration of your compute node: **Edit the Slurm config file as root: `/etc/slurm/slurm.conf`** @@ -1422,7 +1430,7 @@ Kill the process and repeat above two commands: Update munge file/directory ownership: ```bash -find / -uid 991 -type d -mount -writable -exec chown -R munge:munge \{\} \; +find / -mount -writable -type d -uid 991 -exec chown -R munge:munge \{\} \; ``` Copy the munge key from the head node to the compute node. From 146a4e5337a3150e2558d6e6756731d1e66e0926 Mon Sep 17 00:00:00 2001 From: Luna Morrow Date: Wed, 11 Mar 2026 12:17:17 +1000 Subject: [PATCH 20/23] Add tabs for slurm.conf file for different head node instances and an explanation that the SlurmctldHost must be 'head' instead of 'demo' when the head node is a VM Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 710 +++++++++++++++++++++++++-- 1 file changed, 678 insertions(+), 32 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index 6a6ac2b..099cc24 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -392,7 +392,16 @@ unset SQL_PWORD Create the Slurm config file, which will be used by SlurmCTL. Note that you may need to update the `NodeName` info depending on the configuration of your compute node. +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +If the head node is in a VM (see [**Head Node: Using Virtual +Machine**](https://openchami.org/docs/tutorial/#05-head-node-using-virtual-machine)), +the `SlurmctldHost` will be `head` instead of `demo`. +{{< /callout >}} + **Edit the Slurm config file as root: `/etc/slurm/slurm.conf`** + +{{< tabs "slurm-config-headnode" >}} +{{< tab "Bare Metal Head" >}} ```bash {title="/etc/slurm/slurm.conf"} # ClusterName=demo @@ -547,6 +556,320 @@ NodeName=de01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore= PartitionName=main Nodes=de01 Default=YES State=UP OverSubscribe=NO PreemptMode=OFF ``` +{{< /tab >}} +{{< tab "Cloud Instance Head" >}} +```bash {title="/etc/slurm/slurm.conf"} +# +ClusterName=demo +SlurmctldHost=demo +# +#DisableRootJobs=NO +EnforcePartLimits=ALL +#Epilog= +#EpilogSlurmctld= +#FirstJobId=1 +#MaxJobId=67043328 +#GresTypes= +#GroupUpdateForce=0 +#GroupUpdateTime=600 +#JobFileAppend=0 +JobRequeue=0 +#JobSubmitPlugins=lua +KillOnBadExit=1 +#LaunchType=launch/slurm +#Licenses=foo*4,bar +#MailProg=/bin/mail +#MaxJobCount=10000 +#MaxStepCount=40000 +#MaxTasksPerNode=512 +MpiDefault=pmix +#MpiParams=ports=#-# +#PluginDir= +#PlugStackConfig= +PrivateData=accounts,jobs,reservations,usage,users +ProctrackType=proctrack/linuxproc +#Prolog= +PrologFlags=Contain +#PrologSlurmctld= +#PropagatePrioProcess=0 +PropagateResourceLimits=NONE +#PropagateResourceLimitsExcept= +#RebootProgram= +ReturnToService=2 +SlurmctldPidFile=/var/run/slurm/slurmctld.pid +SlurmctldPort=6817 +SlurmdPidFile=/var/run/slurm/slurmd.pid +SlurmdPort=6818 +SlurmdSpoolDir=/var/spool/slurmd +SlurmUser=slurm +SlurmdUser=root +#SrunEpilog= +#SrunProlog= +StateSaveLocation=/var/spool/slurmctld +SwitchType=switch/none +#TaskEpilog= +TaskPlugin=task/none +#TaskProlog= +#TopologyPlugin=topology/tree +#TmpFS=/tmp +#TrackWCKey=no +#TreeWidth= +#UnkillableStepProgram= +#UsePAM=0 +# +# +# TIMERS +#BatchStartTimeout=10 +CompleteWait=32 +#EpilogMsgTime=2000 +#GetEnvTimeout=2 +#HealthCheckInterval=0 +#HealthCheckProgram= +InactiveLimit=300 +KillWait=30 +MessageTimeout=30 +#ResvOverRun=0 +MinJobAge=300 +#OverTimeLimit=0 +SlurmctldTimeout=120 +SlurmdTimeout=300 +#UnkillableStepTimeout=60 +#VSizeFactor=0 +Waittime=0 +# +# +# SCHEDULING +DefMemPerCPU=2048 +#MaxMemPerCPU=0 +#SchedulerTimeSlice=30 +SchedulerType=sched/backfill +SelectType=select/cons_tres +SelectTypeParameters=CR_Core_Memory +SchedulerParameters=defer,bf_continue,bf_interval=60,bf_resolution=300,bf_window=1440,bf_busy_nodes,default_queue_depth=1000,bf_max_job_start=200,bf_max_job_test=500,max_switch_wait=1800 +DependencyParameters=kill_invalid_depend +# +# +# JOB PRIORITY +#PriorityFlags= +#PriorityType=priority/multifactor +#PriorityDecayHalfLife= +#PriorityCalcPeriod= +#PriorityFavorSmall= +#PriorityMaxAge= +#PriorityUsageResetPeriod= +#PriorityWeightAge= +#PriorityWeightFairshare= +#PriorityWeightJobSize= +#PriorityWeightPartition= +#PriorityWeightQOS= +# +# +# LOGGING AND ACCOUNTING +AccountingStorageEnforce=safe,associations,limits,qos +#AccountingStorageHost= +#AccountingStoragePass= +#AccountingStoragePort= +AccountingStorageType=accounting_storage/slurmdbd +#AccountingStorageUser= +#AccountingStoreFlags= +#JobCompHost= +#JobCompLoc= +#JobCompPass= +#JobCompPort= +JobCompType=jobcomp/none +#JobCompUser= +JobContainerType=job_container/tmpfs +JobAcctGatherFrequency=30 +JobAcctGatherType=jobacct_gather/cgroup +SlurmctldDebug=info +SlurmctldLogFile=/var/log/slurm/slurmctld.log +SlurmdDebug=info +SlurmdLogFile=/var/log/slurm/slurmd.log +#SlurmSchedLogFile= +#SlurmSchedLogLevel= +#DebugFlags= +# +# +# POWER SAVE SUPPORT FOR IDLE NODES (optional) +#SuspendProgram= +#ResumeProgram= +#SuspendTimeout= +#ResumeTimeout= +#ResumeRate= +#SuspendExcNodes= +#SuspendExcParts= +#SuspendRate= +#SuspendTime= +# +# +# CUSTOM CONFIGS +LaunchParameters=use_interactive_step +#SlurmctldParameters=enable_configless +# +# +# COMPUTE NODES ## GET CONF WITH `slurmd -C` +NodeName=de01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3892 + +PartitionName=main Nodes=de01 Default=YES State=UP OverSubscribe=NO PreemptMode=OFF +``` +{{< /tab >}} +{{< tab "VM Head" >}} +```bash {title="/etc/slurm/slurm.conf"} +# +ClusterName=demo +SlurmctldHost=head +# +#DisableRootJobs=NO +EnforcePartLimits=ALL +#Epilog= +#EpilogSlurmctld= +#FirstJobId=1 +#MaxJobId=67043328 +#GresTypes= +#GroupUpdateForce=0 +#GroupUpdateTime=600 +#JobFileAppend=0 +JobRequeue=0 +#JobSubmitPlugins=lua +KillOnBadExit=1 +#LaunchType=launch/slurm +#Licenses=foo*4,bar +#MailProg=/bin/mail +#MaxJobCount=10000 +#MaxStepCount=40000 +#MaxTasksPerNode=512 +MpiDefault=pmix +#MpiParams=ports=#-# +#PluginDir= +#PlugStackConfig= +PrivateData=accounts,jobs,reservations,usage,users +ProctrackType=proctrack/linuxproc +#Prolog= +PrologFlags=Contain +#PrologSlurmctld= +#PropagatePrioProcess=0 +PropagateResourceLimits=NONE +#PropagateResourceLimitsExcept= +#RebootProgram= +ReturnToService=2 +SlurmctldPidFile=/var/run/slurm/slurmctld.pid +SlurmctldPort=6817 +SlurmdPidFile=/var/run/slurm/slurmd.pid +SlurmdPort=6818 +SlurmdSpoolDir=/var/spool/slurmd +SlurmUser=slurm +SlurmdUser=root +#SrunEpilog= +#SrunProlog= +StateSaveLocation=/var/spool/slurmctld +SwitchType=switch/none +#TaskEpilog= +TaskPlugin=task/none +#TaskProlog= +#TopologyPlugin=topology/tree +#TmpFS=/tmp +#TrackWCKey=no +#TreeWidth= +#UnkillableStepProgram= +#UsePAM=0 +# +# +# TIMERS +#BatchStartTimeout=10 +CompleteWait=32 +#EpilogMsgTime=2000 +#GetEnvTimeout=2 +#HealthCheckInterval=0 +#HealthCheckProgram= +InactiveLimit=300 +KillWait=30 +MessageTimeout=30 +#ResvOverRun=0 +MinJobAge=300 +#OverTimeLimit=0 +SlurmctldTimeout=120 +SlurmdTimeout=300 +#UnkillableStepTimeout=60 +#VSizeFactor=0 +Waittime=0 +# +# +# SCHEDULING +DefMemPerCPU=2048 +#MaxMemPerCPU=0 +#SchedulerTimeSlice=30 +SchedulerType=sched/backfill +SelectType=select/cons_tres +SelectTypeParameters=CR_Core_Memory +SchedulerParameters=defer,bf_continue,bf_interval=60,bf_resolution=300,bf_window=1440,bf_busy_nodes,default_queue_depth=1000,bf_max_job_start=200,bf_max_job_test=500,max_switch_wait=1800 +DependencyParameters=kill_invalid_depend +# +# +# JOB PRIORITY +#PriorityFlags= +#PriorityType=priority/multifactor +#PriorityDecayHalfLife= +#PriorityCalcPeriod= +#PriorityFavorSmall= +#PriorityMaxAge= +#PriorityUsageResetPeriod= +#PriorityWeightAge= +#PriorityWeightFairshare= +#PriorityWeightJobSize= +#PriorityWeightPartition= +#PriorityWeightQOS= +# +# +# LOGGING AND ACCOUNTING +AccountingStorageEnforce=safe,associations,limits,qos +#AccountingStorageHost= +#AccountingStoragePass= +#AccountingStoragePort= +AccountingStorageType=accounting_storage/slurmdbd +#AccountingStorageUser= +#AccountingStoreFlags= +#JobCompHost= +#JobCompLoc= +#JobCompPass= +#JobCompPort= +JobCompType=jobcomp/none +#JobCompUser= +JobContainerType=job_container/tmpfs +JobAcctGatherFrequency=30 +JobAcctGatherType=jobacct_gather/cgroup +SlurmctldDebug=info +SlurmctldLogFile=/var/log/slurm/slurmctld.log +SlurmdDebug=info +SlurmdLogFile=/var/log/slurm/slurmd.log +#SlurmSchedLogFile= +#SlurmSchedLogLevel= +#DebugFlags= +# +# +# POWER SAVE SUPPORT FOR IDLE NODES (optional) +#SuspendProgram= +#ResumeProgram= +#SuspendTimeout= +#ResumeTimeout= +#ResumeRate= +#SuspendExcNodes= +#SuspendExcParts= +#SuspendRate= +#SuspendTime= +# +# +# CUSTOM CONFIGS +LaunchParameters=use_interactive_step +#SlurmctldParameters=enable_configless +# +# +# COMPUTE NODES ## GET CONF WITH `slurmd -C` +NodeName=de01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3892 + +PartitionName=main Nodes=de01 Default=YES State=UP OverSubscribe=NO PreemptMode=OFF +``` +{{< /tab >}} +{{< /tabs >}} Add job container config file to Slurm config directory: @@ -1169,43 +1492,364 @@ dnf list installed | grep -e munge -e slurm The output should be: ``` -munge.x86_64 0.5.18-1.el9 @System -munge-debuginfo.x86_64 0.5.18-1.el9 @System -munge-debugsource.x86_64 0.5.18-1.el9 @System -munge-devel.x86_64 0.5.18-1.el9 @System -munge-libs.x86_64 0.5.18-1.el9 @System -munge-libs-debuginfo.x86_64 0.5.18-1.el9 @System -pmix.x86_64 4.2.9-1.el9 @8080_slurm-24.05.5 -slurm.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 -slurm-contribs.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 -slurm-devel.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 -slurm-example-configs.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 -slurm-libpmi.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 -slurm-pam_slurm.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 -slurm-perlapi.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 -slurm-sackd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 -slurm-slurmctld.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 -slurm-slurmd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 -slurm-slurmdbd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 -slurm-slurmrestd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 -slurm-torque.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 -``` - -{{< callout context="note" title="Note" icon="outline/info-circle" >}} -If there is a version 0.5.13 of munge currently installed and present in the output from the above command, remove it to ensure that version 0.5.18 is used. +munge.x86_64 0.5.18-1.el9 @System +munge-debuginfo.x86_64 0.5.18-1.el9 @System +munge-debugsource.x86_64 0.5.18-1.el9 @System +munge-devel.x86_64 0.5.18-1.el9 @System +munge-libs.x86_64 0.5.18-1.el9 @System +munge-libs-debuginfo.x86_64 0.5.18-1.el9 @System +pmix.x86_64 4.2.9-1.el9 @8080_slurm-24.05.5 +slurm.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-contribs.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-devel.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-example-configs.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-libpmi.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-pam_slurm.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-perlapi.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-sackd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-slurmctld.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-slurmd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-slurmdbd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-slurmrestd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +slurm-torque.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 +``` + +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +If there is a version 0.5.13 of munge currently installed and present in the output from the above command, remove it to ensure that version 0.5.18 is used. + +``` +dnf remove -y munge-libs-0.5.13- munge-0.5.13- +``` +{{< /callout >}} + +Create slurm config file that is identical to that of the head node. Note that you may need to update the `NodeName` info depending on the configuration of your compute node: + +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +If the head node is in a VM (see [**Head Node: Using Virtual +Machine**](https://openchami.org/docs/tutorial/#05-head-node-using-virtual-machine)), +the `SlurmctldHost` will be `head` instead of `demo`. +{{< /callout >}} + +**Edit the Slurm config file as root: `/etc/slurm/slurm.conf`** + +{{< tabs "slurm-config-computenode" >}} +{{< tab "Bare Metal Head" >}} +```bash {title="/etc/slurm/slurm.conf"} +# +ClusterName=demo +SlurmctldHost=demo +# +#DisableRootJobs=NO +EnforcePartLimits=ALL +#Epilog= +#EpilogSlurmctld= +#FirstJobId=1 +#MaxJobId=67043328 +#GresTypes= +#GroupUpdateForce=0 +#GroupUpdateTime=600 +#JobFileAppend=0 +JobRequeue=0 +#JobSubmitPlugins=lua +KillOnBadExit=1 +#LaunchType=launch/slurm +#Licenses=foo*4,bar +#MailProg=/bin/mail +#MaxJobCount=10000 +#MaxStepCount=40000 +#MaxTasksPerNode=512 +MpiDefault=pmix +#MpiParams=ports=#-# +#PluginDir= +#PlugStackConfig= +PrivateData=accounts,jobs,reservations,usage,users +ProctrackType=proctrack/linuxproc +#Prolog= +PrologFlags=Contain +#PrologSlurmctld= +#PropagatePrioProcess=0 +PropagateResourceLimits=NONE +#PropagateResourceLimitsExcept= +#RebootProgram= +ReturnToService=2 +SlurmctldPidFile=/var/run/slurm/slurmctld.pid +SlurmctldPort=6817 +SlurmdPidFile=/var/run/slurm/slurmd.pid +SlurmdPort=6818 +SlurmdSpoolDir=/var/spool/slurmd +SlurmUser=slurm +SlurmdUser=root +#SrunEpilog= +#SrunProlog= +StateSaveLocation=/var/spool/slurmctld +SwitchType=switch/none +#TaskEpilog= +TaskPlugin=task/none +#TaskProlog= +#TopologyPlugin=topology/tree +#TmpFS=/tmp +#TrackWCKey=no +#TreeWidth= +#UnkillableStepProgram= +#UsePAM=0 +# +# +# TIMERS +#BatchStartTimeout=10 +CompleteWait=32 +#EpilogMsgTime=2000 +#GetEnvTimeout=2 +#HealthCheckInterval=0 +#HealthCheckProgram= +InactiveLimit=300 +KillWait=30 +MessageTimeout=30 +#ResvOverRun=0 +MinJobAge=300 +#OverTimeLimit=0 +SlurmctldTimeout=120 +SlurmdTimeout=300 +#UnkillableStepTimeout=60 +#VSizeFactor=0 +Waittime=0 +# +# +# SCHEDULING +DefMemPerCPU=2048 +#MaxMemPerCPU=0 +#SchedulerTimeSlice=30 +SchedulerType=sched/backfill +SelectType=select/cons_tres +SelectTypeParameters=CR_Core_Memory +SchedulerParameters=defer,bf_continue,bf_interval=60,bf_resolution=300,bf_window=1440,bf_busy_nodes,default_queue_depth=1000,bf_max_job_start=200,bf_max_job_test=500,max_switch_wait=1800 +DependencyParameters=kill_invalid_depend +# +# +# JOB PRIORITY +#PriorityFlags= +#PriorityType=priority/multifactor +#PriorityDecayHalfLife= +#PriorityCalcPeriod= +#PriorityFavorSmall= +#PriorityMaxAge= +#PriorityUsageResetPeriod= +#PriorityWeightAge= +#PriorityWeightFairshare= +#PriorityWeightJobSize= +#PriorityWeightPartition= +#PriorityWeightQOS= +# +# +# LOGGING AND ACCOUNTING +AccountingStorageEnforce=safe,associations,limits,qos +#AccountingStorageHost= +#AccountingStoragePass= +#AccountingStoragePort= +AccountingStorageType=accounting_storage/slurmdbd +#AccountingStorageUser= +#AccountingStoreFlags= +#JobCompHost= +#JobCompLoc= +#JobCompPass= +#JobCompPort= +JobCompType=jobcomp/none +#JobCompUser= +JobContainerType=job_container/tmpfs +JobAcctGatherFrequency=30 +JobAcctGatherType=jobacct_gather/cgroup +SlurmctldDebug=info +SlurmctldLogFile=/var/log/slurm/slurmctld.log +SlurmdDebug=info +SlurmdLogFile=/var/log/slurm/slurmd.log +#SlurmSchedLogFile= +#SlurmSchedLogLevel= +#DebugFlags= +# +# +# POWER SAVE SUPPORT FOR IDLE NODES (optional) +#SuspendProgram= +#ResumeProgram= +#SuspendTimeout= +#ResumeTimeout= +#ResumeRate= +#SuspendExcNodes= +#SuspendExcParts= +#SuspendRate= +#SuspendTime= +# +# +# CUSTOM CONFIGS +LaunchParameters=use_interactive_step +#SlurmctldParameters=enable_configless +# +# +# COMPUTE NODES ## GET CONF WITH `slurmd -C` +NodeName=de01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3892 + +PartitionName=main Nodes=de01 Default=YES State=UP OverSubscribe=NO PreemptMode=OFF +``` +{{< /tab >}} +{{< tab "Cloud Instance Head" >}} +```bash {title="/etc/slurm/slurm.conf"} +# +ClusterName=demo +SlurmctldHost=demo +# +#DisableRootJobs=NO +EnforcePartLimits=ALL +#Epilog= +#EpilogSlurmctld= +#FirstJobId=1 +#MaxJobId=67043328 +#GresTypes= +#GroupUpdateForce=0 +#GroupUpdateTime=600 +#JobFileAppend=0 +JobRequeue=0 +#JobSubmitPlugins=lua +KillOnBadExit=1 +#LaunchType=launch/slurm +#Licenses=foo*4,bar +#MailProg=/bin/mail +#MaxJobCount=10000 +#MaxStepCount=40000 +#MaxTasksPerNode=512 +MpiDefault=pmix +#MpiParams=ports=#-# +#PluginDir= +#PlugStackConfig= +PrivateData=accounts,jobs,reservations,usage,users +ProctrackType=proctrack/linuxproc +#Prolog= +PrologFlags=Contain +#PrologSlurmctld= +#PropagatePrioProcess=0 +PropagateResourceLimits=NONE +#PropagateResourceLimitsExcept= +#RebootProgram= +ReturnToService=2 +SlurmctldPidFile=/var/run/slurm/slurmctld.pid +SlurmctldPort=6817 +SlurmdPidFile=/var/run/slurm/slurmd.pid +SlurmdPort=6818 +SlurmdSpoolDir=/var/spool/slurmd +SlurmUser=slurm +SlurmdUser=root +#SrunEpilog= +#SrunProlog= +StateSaveLocation=/var/spool/slurmctld +SwitchType=switch/none +#TaskEpilog= +TaskPlugin=task/none +#TaskProlog= +#TopologyPlugin=topology/tree +#TmpFS=/tmp +#TrackWCKey=no +#TreeWidth= +#UnkillableStepProgram= +#UsePAM=0 +# +# +# TIMERS +#BatchStartTimeout=10 +CompleteWait=32 +#EpilogMsgTime=2000 +#GetEnvTimeout=2 +#HealthCheckInterval=0 +#HealthCheckProgram= +InactiveLimit=300 +KillWait=30 +MessageTimeout=30 +#ResvOverRun=0 +MinJobAge=300 +#OverTimeLimit=0 +SlurmctldTimeout=120 +SlurmdTimeout=300 +#UnkillableStepTimeout=60 +#VSizeFactor=0 +Waittime=0 +# +# +# SCHEDULING +DefMemPerCPU=2048 +#MaxMemPerCPU=0 +#SchedulerTimeSlice=30 +SchedulerType=sched/backfill +SelectType=select/cons_tres +SelectTypeParameters=CR_Core_Memory +SchedulerParameters=defer,bf_continue,bf_interval=60,bf_resolution=300,bf_window=1440,bf_busy_nodes,default_queue_depth=1000,bf_max_job_start=200,bf_max_job_test=500,max_switch_wait=1800 +DependencyParameters=kill_invalid_depend +# +# +# JOB PRIORITY +#PriorityFlags= +#PriorityType=priority/multifactor +#PriorityDecayHalfLife= +#PriorityCalcPeriod= +#PriorityFavorSmall= +#PriorityMaxAge= +#PriorityUsageResetPeriod= +#PriorityWeightAge= +#PriorityWeightFairshare= +#PriorityWeightJobSize= +#PriorityWeightPartition= +#PriorityWeightQOS= +# +# +# LOGGING AND ACCOUNTING +AccountingStorageEnforce=safe,associations,limits,qos +#AccountingStorageHost= +#AccountingStoragePass= +#AccountingStoragePort= +AccountingStorageType=accounting_storage/slurmdbd +#AccountingStorageUser= +#AccountingStoreFlags= +#JobCompHost= +#JobCompLoc= +#JobCompPass= +#JobCompPort= +JobCompType=jobcomp/none +#JobCompUser= +JobContainerType=job_container/tmpfs +JobAcctGatherFrequency=30 +JobAcctGatherType=jobacct_gather/cgroup +SlurmctldDebug=info +SlurmctldLogFile=/var/log/slurm/slurmctld.log +SlurmdDebug=info +SlurmdLogFile=/var/log/slurm/slurmd.log +#SlurmSchedLogFile= +#SlurmSchedLogLevel= +#DebugFlags= +# +# +# POWER SAVE SUPPORT FOR IDLE NODES (optional) +#SuspendProgram= +#ResumeProgram= +#SuspendTimeout= +#ResumeTimeout= +#ResumeRate= +#SuspendExcNodes= +#SuspendExcParts= +#SuspendRate= +#SuspendTime= +# +# +# CUSTOM CONFIGS +LaunchParameters=use_interactive_step +#SlurmctldParameters=enable_configless +# +# +# COMPUTE NODES ## GET CONF WITH `slurmd -C` +NodeName=de01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3892 +PartitionName=main Nodes=de01 Default=YES State=UP OverSubscribe=NO PreemptMode=OFF ``` -dnf remove -y munge-libs-0.5.13- munge-0.5.13- -``` -{{< /callout >}} - -Create slurm config file that is identical to that of the head node. Note that you may need to update the `NodeName` info depending on the configuration of your compute node: - -**Edit the Slurm config file as root: `/etc/slurm/slurm.conf`** +{{< /tab >}} +{{< tab "VM Head" >}} ```bash {title="/etc/slurm/slurm.conf"} # ClusterName=demo -SlurmctldHost=demo +SlurmctldHost=head # #DisableRootJobs=NO EnforcePartLimits=ALL @@ -1356,6 +2000,8 @@ NodeName=de01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore= PartitionName=main Nodes=de01 Default=YES State=UP OverSubscribe=NO PreemptMode=OFF ``` +{{< /tab >}} +{{< /tabs >}} Configure the hosts file with addresses for both the head node and the compute node: From 9045a490e4f6b70ccdc6c5d116b9bd03e6fd8194 Mon Sep 17 00:00:00 2001 From: Luna Morrow Date: Mon, 16 Mar 2026 13:31:51 +1000 Subject: [PATCH 21/23] Initial update of 'Install Slurm' documentation to leverage cloud-init so that compute node Slurm configuration is persistent across nodes and on reboot Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 1262 +++++++++++--------------- 1 file changed, 530 insertions(+), 732 deletions(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index 04223da..7d64559 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -249,6 +249,21 @@ sudo -u munge /usr/sbin/mungekey -v sudo systemctl enable --now munge ``` +Copy the munge key to the normal user's home directory, so that the compute node can fetch it while booting. + +```bash +sudo cp /etc/munge/munge.key ~/ +sudo chown "$(id -u):$(id -u)" munge.key +``` + +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +Make sure to delete the copy of `munge.key` in the normal user's home directory after the compute node is setup! + +```bash +rm ~/munge.key +``` +{{< /callout >}} + Install mariaDB: ```bash @@ -920,6 +935,13 @@ EOF {{< /tab >}} {{< /tabs >}} +Start Slurm service daemons: + +```bash +sudo systemctl start slurmdbd +sudo systemctl start slurmctld +``` + ## 1.4 Make a Local Slurm Repository and Serve it with Nginx Create configuration file to mount into Nginx container: @@ -1079,18 +1101,23 @@ repos: - alias: 'Slurm' url: 'http://localhost:8080/slurm-24.05.5' - packages: - boxes - figlet - git - nfs-utils + - bind-utils + - openldap-clients + - sssd + - sssd-ldap + - oddjob-mkhomedir - tcpdump - traceroute - vim - curl - rpm-build - shadow-utils + - sshpass - pwgen - jq - libconfuse @@ -1119,7 +1146,7 @@ cmds: - cmd: 'rpmbuild -tb munge-0.5.18.tar.xz' - cmd: 'cd /root/rpmbuild' - cmd: 'rpm --install --verbose --force /root/rpmbuild/RPMS/x86_64/munge-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-debuginfo-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-debugsource-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-devel-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-libs-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-libs-debuginfo-0.5.18-1.el9.x86_64.rpm' - - cmd: 'dnf remove -y munge-libs-0.5.13-13.el9 munge-0.5.13-13.el9' + - cmd: 'dnf remove -y munge-libs-0.5.13-* munge-0.5.13-*' ``` Run podman container to run image build command. The S3_ACCESS and S3_SECRET tokens are set in the tutorial [here](https://openchami.org/docs/tutorial/#233-install-and-configure-s3-clients). @@ -1277,6 +1304,127 @@ Configure cloud-init for compute group: **Edit as root: `/etc/openchami/data/cloud-init/ci-group-compute.yaml`** +{{< tabs "cloud-init-compute-configs" >}} +{{< tab "Bare Metal Head" >}} + +```yaml {title="/etc/openchami/data/cloud-init/ci-group-compute.yaml"} +- name: compute + description: "compute config" + file: + encoding: plain + content: | + ## template: jinja + #cloud-config + merge_how: + - name: list + settings: [append] + - name: dict + settings: [no_replace, recurse_list] + users: + - name: root + ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }} + disable_root: false + + write_files: + - path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo + 172.16.0.1 de01.openchami.cluster de01 + + - path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + + bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + + runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset +``` + +{{< /tab >}} +{{< tab "Cloud Instance Head" >}} + +```yaml {title="/etc/openchami/data/cloud-init/ci-group-compute.yaml"} +- name: compute + description: "compute config" + file: + encoding: plain + content: | + ## template: jinja + #cloud-config + merge_how: + - name: list + settings: [append] + - name: dict + settings: [no_replace, recurse_list] + users: + - name: root + ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }} + disable_root: false + + write_files: + - path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo + 172.16.0.1 de01.openchami.cluster de01 + + - path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + demo.openchami.cluster:/etc/sssd /etc/sssd nfs defaults 0 0 + demo.openchami.cluster:/etc/openldap /etc/openldap nfs defaults 0 0 + + bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + + runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset +``` + +{{< /tab >}} +{{< tab "VM Head" >}} + ```yaml {title="/etc/openchami/data/cloud-init/ci-group-compute.yaml"} - name: compute description: "compute config" @@ -1293,9 +1441,51 @@ Configure cloud-init for compute group: users: - name: root ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }} - disable_root: false + disable_root: false + + write_files: + - path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo head + 172.16.0.1 de01.openchami.cluster de01 + + - path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + demo.openchami.cluster:/etc/sssd /etc/sssd nfs defaults 0 0 + demo.openchami.cluster:/etc/openldap /etc/openldap nfs defaults 0 0 + + bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + + runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset ``` +{{< /tab >}} +{{< /tabs >}} + + Now, set this configuration for the compute group: ```bash @@ -1310,6 +1500,115 @@ ochami cloud-init group get config compute The cloud-config file created within the YAML above should get print out: +{{< tabs "compute-config-outputs" >}} +{{< tab "Bare Metal Head" >}} + +```yaml +## template: jinja +#cloud-config +merge_how: +- name: list + settings: [append] +- name: dict + settings: [no_replace, recurse_list] +users: + - name: root + ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }} +disable_root: false + +write_files: +- path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo + 172.16.0.1 de01.openchami.cluster de01 + +- path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + +bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + +runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset +``` + +{{< /tab >}} +{{< tab "Cloud Instance Head" >}} + +```yaml +## template: jinja +#cloud-config +merge_how: +- name: list + settings: [append] +- name: dict + settings: [no_replace, recurse_list] +users: + - name: root + ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }} +disable_root: false + +write_files: +- path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo + 172.16.0.1 de01.openchami.cluster de01 + +- path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + +bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + +runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset +``` + +{{< /tab >}} +{{< tab "VM Head" >}} + ```yaml ## template: jinja #cloud-config @@ -1322,10 +1621,49 @@ users: - name: root ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }} disable_root: false + +write_files: +- path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo head + 172.16.0.1 de01.openchami.cluster de01 + +- path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + +bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + +runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset ``` +{{< /tab >}} +{{< /tabs >}} + `ochami` has basic per-group template rendering available that can be used to -check that the Jinja2 is rendering properly for a node. Check if for the first +check that the Jinja2 is rendering properly for a node. Check it for the first compute node (x1000c0s0b0n0): ```bash @@ -1340,7 +1678,117 @@ make sure that the `IMPERSONATION` environment variable is set in The SSH key that was created above should appear in the config: +{{< tabs "compute-node-config-outputs" >}} +{{< tab "Bare Metal Head" >}} + +```yaml +## template: jinja +#cloud-config +merge_how: +- name: list + settings: [append] +- name: dict + settings: [no_replace, recurse_list] +users: + - name: root + ssh_authorized_keys: [''] +disable_root: false + +write_files: +- path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo + 172.16.0.1 de01.openchami.cluster de01 + +- path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + +bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + +runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset +``` + +{{< /tab >}} +{{< tab "Cloud Instance Head" >}} + ```yaml +## template: jinja +#cloud-config +merge_how: +- name: list + settings: [append] +- name: dict + settings: [no_replace, recurse_list] +users: + - name: root + ssh_authorized_keys: [''] +disable_root: false + +write_files: +- path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo + 172.16.0.1 de01.openchami.cluster de01 + +- path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + +bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + +runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset +``` + +{{< /tab >}} +{{< tab "VM Head" >}} + +```yaml +## template: jinja #cloud-config merge_how: - name: list @@ -1350,8 +1798,47 @@ merge_how: users: - name: root ssh_authorized_keys: [''] +disable_root: false + +write_files: +- path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo head + 172.16.0.1 de01.openchami.cluster de01 + +- path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + +bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + +runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset ``` +{{< /tab >}} +{{< /tabs >}} ## 1.6 Boot the Compute Node with the Slurm Compute Image @@ -1548,736 +2035,49 @@ dnf remove -y munge-libs-0.5.13- munge-0.5.13- ``` {{< /callout >}} -Create slurm config file that is identical to that of the head node. Note that you may need to update the `NodeName` info depending on the configuration of your compute node: +Restart Slurm service daemons in the **head node**: + +```bash +sudo systemctl restart slurmdbd +sudo systemctl restart slurmctld +``` + +## 1.8 Test Munge and Slurm + +Test munge on the **head node**: + +```bash +# Try to munge and unmunge to access the compute node +munge -n | ssh root@172.16.0.1 unmunge +``` + +The output should be: + +``` +STATUS: Success (0) +ENCODE_HOST: ??? (192.168.200.2) +ENCODE_TIME: 2026-02-13 05:33:34 +0000 (1770960814) +DECODE_TIME: 2026-02-13 05:33:34 +0000 (1770960814) +TTL: 300 +CIPHER: aes128 (4) +MAC: sha256 (5) +ZIP: none (0) +UID: ??? (1000) +GID: ??? (1000) +LENGTH: 0 +``` {{< callout context="note" title="Note" icon="outline/info-circle" >}} -If the head node is in a VM (see [**Head Node: Using Virtual -Machine**](https://openchami.org/docs/tutorial/#05-head-node-using-virtual-machine)), -the `SlurmctldHost` will be `head` instead of `demo`. +In the case of an error about "Offending ECDSA key in ~/.ssh/known_hosts:3", remove the compute node from the known hosts file and try the 'scp' command again: + +``` +ssh-keygen -R 172.16.0.1 +``` + +Alternatively, setup an `ignore.conf` file per [Section 2.8.3](https://openchami.org/docs/tutorial/#283-logging-into-the-compute-node) of the tutorial, to prevent this issue. {{< /callout >}} -**Edit the Slurm config file as root: `/etc/slurm/slurm.conf`** - -{{< tabs "slurm-config-computenode" >}} -{{< tab "Bare Metal Head" >}} -```bash {title="/etc/slurm/slurm.conf"} -# -ClusterName=demo -SlurmctldHost=demo -# -#DisableRootJobs=NO -EnforcePartLimits=ALL -#Epilog= -#EpilogSlurmctld= -#FirstJobId=1 -#MaxJobId=67043328 -#GresTypes= -#GroupUpdateForce=0 -#GroupUpdateTime=600 -#JobFileAppend=0 -JobRequeue=0 -#JobSubmitPlugins=lua -KillOnBadExit=1 -#LaunchType=launch/slurm -#Licenses=foo*4,bar -#MailProg=/bin/mail -#MaxJobCount=10000 -#MaxStepCount=40000 -#MaxTasksPerNode=512 -MpiDefault=pmix -#MpiParams=ports=#-# -#PluginDir= -#PlugStackConfig= -PrivateData=accounts,jobs,reservations,usage,users -ProctrackType=proctrack/linuxproc -#Prolog= -PrologFlags=Contain -#PrologSlurmctld= -#PropagatePrioProcess=0 -PropagateResourceLimits=NONE -#PropagateResourceLimitsExcept= -#RebootProgram= -ReturnToService=2 -SlurmctldPidFile=/var/run/slurm/slurmctld.pid -SlurmctldPort=6817 -SlurmdPidFile=/var/run/slurm/slurmd.pid -SlurmdPort=6818 -SlurmdSpoolDir=/var/spool/slurmd -SlurmUser=slurm -SlurmdUser=root -#SrunEpilog= -#SrunProlog= -StateSaveLocation=/var/spool/slurmctld -SwitchType=switch/none -#TaskEpilog= -TaskPlugin=task/none -#TaskProlog= -#TopologyPlugin=topology/tree -#TmpFS=/tmp -#TrackWCKey=no -#TreeWidth= -#UnkillableStepProgram= -#UsePAM=0 -# -# -# TIMERS -#BatchStartTimeout=10 -CompleteWait=32 -#EpilogMsgTime=2000 -#GetEnvTimeout=2 -#HealthCheckInterval=0 -#HealthCheckProgram= -InactiveLimit=300 -KillWait=30 -MessageTimeout=30 -#ResvOverRun=0 -MinJobAge=300 -#OverTimeLimit=0 -SlurmctldTimeout=120 -SlurmdTimeout=300 -#UnkillableStepTimeout=60 -#VSizeFactor=0 -Waittime=0 -# -# -# SCHEDULING -DefMemPerCPU=2048 -#MaxMemPerCPU=0 -#SchedulerTimeSlice=30 -SchedulerType=sched/backfill -SelectType=select/cons_tres -SelectTypeParameters=CR_Core_Memory -SchedulerParameters=defer,bf_continue,bf_interval=60,bf_resolution=300,bf_window=1440,bf_busy_nodes,default_queue_depth=1000,bf_max_job_start=200,bf_max_job_test=500,max_switch_wait=1800 -DependencyParameters=kill_invalid_depend -# -# -# JOB PRIORITY -#PriorityFlags= -#PriorityType=priority/multifactor -#PriorityDecayHalfLife= -#PriorityCalcPeriod= -#PriorityFavorSmall= -#PriorityMaxAge= -#PriorityUsageResetPeriod= -#PriorityWeightAge= -#PriorityWeightFairshare= -#PriorityWeightJobSize= -#PriorityWeightPartition= -#PriorityWeightQOS= -# -# -# LOGGING AND ACCOUNTING -AccountingStorageEnforce=safe,associations,limits,qos -#AccountingStorageHost= -#AccountingStoragePass= -#AccountingStoragePort= -AccountingStorageType=accounting_storage/slurmdbd -#AccountingStorageUser= -#AccountingStoreFlags= -#JobCompHost= -#JobCompLoc= -#JobCompPass= -#JobCompPort= -JobCompType=jobcomp/none -#JobCompUser= -JobContainerType=job_container/tmpfs -JobAcctGatherFrequency=30 -JobAcctGatherType=jobacct_gather/cgroup -SlurmctldDebug=info -SlurmctldLogFile=/var/log/slurm/slurmctld.log -SlurmdDebug=info -SlurmdLogFile=/var/log/slurm/slurmd.log -#SlurmSchedLogFile= -#SlurmSchedLogLevel= -#DebugFlags= -# -# -# POWER SAVE SUPPORT FOR IDLE NODES (optional) -#SuspendProgram= -#ResumeProgram= -#SuspendTimeout= -#ResumeTimeout= -#ResumeRate= -#SuspendExcNodes= -#SuspendExcParts= -#SuspendRate= -#SuspendTime= -# -# -# CUSTOM CONFIGS -LaunchParameters=use_interactive_step -#SlurmctldParameters=enable_configless -# -# -# COMPUTE NODES ## GET CONF WITH `slurmd -C` -NodeName=de01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3892 - -PartitionName=main Nodes=de01 Default=YES State=UP OverSubscribe=NO PreemptMode=OFF -``` -{{< /tab >}} -{{< tab "Cloud Instance Head" >}} -```bash {title="/etc/slurm/slurm.conf"} -# -ClusterName=demo -SlurmctldHost=demo -# -#DisableRootJobs=NO -EnforcePartLimits=ALL -#Epilog= -#EpilogSlurmctld= -#FirstJobId=1 -#MaxJobId=67043328 -#GresTypes= -#GroupUpdateForce=0 -#GroupUpdateTime=600 -#JobFileAppend=0 -JobRequeue=0 -#JobSubmitPlugins=lua -KillOnBadExit=1 -#LaunchType=launch/slurm -#Licenses=foo*4,bar -#MailProg=/bin/mail -#MaxJobCount=10000 -#MaxStepCount=40000 -#MaxTasksPerNode=512 -MpiDefault=pmix -#MpiParams=ports=#-# -#PluginDir= -#PlugStackConfig= -PrivateData=accounts,jobs,reservations,usage,users -ProctrackType=proctrack/linuxproc -#Prolog= -PrologFlags=Contain -#PrologSlurmctld= -#PropagatePrioProcess=0 -PropagateResourceLimits=NONE -#PropagateResourceLimitsExcept= -#RebootProgram= -ReturnToService=2 -SlurmctldPidFile=/var/run/slurm/slurmctld.pid -SlurmctldPort=6817 -SlurmdPidFile=/var/run/slurm/slurmd.pid -SlurmdPort=6818 -SlurmdSpoolDir=/var/spool/slurmd -SlurmUser=slurm -SlurmdUser=root -#SrunEpilog= -#SrunProlog= -StateSaveLocation=/var/spool/slurmctld -SwitchType=switch/none -#TaskEpilog= -TaskPlugin=task/none -#TaskProlog= -#TopologyPlugin=topology/tree -#TmpFS=/tmp -#TrackWCKey=no -#TreeWidth= -#UnkillableStepProgram= -#UsePAM=0 -# -# -# TIMERS -#BatchStartTimeout=10 -CompleteWait=32 -#EpilogMsgTime=2000 -#GetEnvTimeout=2 -#HealthCheckInterval=0 -#HealthCheckProgram= -InactiveLimit=300 -KillWait=30 -MessageTimeout=30 -#ResvOverRun=0 -MinJobAge=300 -#OverTimeLimit=0 -SlurmctldTimeout=120 -SlurmdTimeout=300 -#UnkillableStepTimeout=60 -#VSizeFactor=0 -Waittime=0 -# -# -# SCHEDULING -DefMemPerCPU=2048 -#MaxMemPerCPU=0 -#SchedulerTimeSlice=30 -SchedulerType=sched/backfill -SelectType=select/cons_tres -SelectTypeParameters=CR_Core_Memory -SchedulerParameters=defer,bf_continue,bf_interval=60,bf_resolution=300,bf_window=1440,bf_busy_nodes,default_queue_depth=1000,bf_max_job_start=200,bf_max_job_test=500,max_switch_wait=1800 -DependencyParameters=kill_invalid_depend -# -# -# JOB PRIORITY -#PriorityFlags= -#PriorityType=priority/multifactor -#PriorityDecayHalfLife= -#PriorityCalcPeriod= -#PriorityFavorSmall= -#PriorityMaxAge= -#PriorityUsageResetPeriod= -#PriorityWeightAge= -#PriorityWeightFairshare= -#PriorityWeightJobSize= -#PriorityWeightPartition= -#PriorityWeightQOS= -# -# -# LOGGING AND ACCOUNTING -AccountingStorageEnforce=safe,associations,limits,qos -#AccountingStorageHost= -#AccountingStoragePass= -#AccountingStoragePort= -AccountingStorageType=accounting_storage/slurmdbd -#AccountingStorageUser= -#AccountingStoreFlags= -#JobCompHost= -#JobCompLoc= -#JobCompPass= -#JobCompPort= -JobCompType=jobcomp/none -#JobCompUser= -JobContainerType=job_container/tmpfs -JobAcctGatherFrequency=30 -JobAcctGatherType=jobacct_gather/cgroup -SlurmctldDebug=info -SlurmctldLogFile=/var/log/slurm/slurmctld.log -SlurmdDebug=info -SlurmdLogFile=/var/log/slurm/slurmd.log -#SlurmSchedLogFile= -#SlurmSchedLogLevel= -#DebugFlags= -# -# -# POWER SAVE SUPPORT FOR IDLE NODES (optional) -#SuspendProgram= -#ResumeProgram= -#SuspendTimeout= -#ResumeTimeout= -#ResumeRate= -#SuspendExcNodes= -#SuspendExcParts= -#SuspendRate= -#SuspendTime= -# -# -# CUSTOM CONFIGS -LaunchParameters=use_interactive_step -#SlurmctldParameters=enable_configless -# -# -# COMPUTE NODES ## GET CONF WITH `slurmd -C` -NodeName=de01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3892 - -PartitionName=main Nodes=de01 Default=YES State=UP OverSubscribe=NO PreemptMode=OFF -``` -{{< /tab >}} -{{< tab "VM Head" >}} -```bash {title="/etc/slurm/slurm.conf"} -# -ClusterName=demo -SlurmctldHost=head -# -#DisableRootJobs=NO -EnforcePartLimits=ALL -#Epilog= -#EpilogSlurmctld= -#FirstJobId=1 -#MaxJobId=67043328 -#GresTypes= -#GroupUpdateForce=0 -#GroupUpdateTime=600 -#JobFileAppend=0 -JobRequeue=0 -#JobSubmitPlugins=lua -KillOnBadExit=1 -#LaunchType=launch/slurm -#Licenses=foo*4,bar -#MailProg=/bin/mail -#MaxJobCount=10000 -#MaxStepCount=40000 -#MaxTasksPerNode=512 -MpiDefault=pmix -#MpiParams=ports=#-# -#PluginDir= -#PlugStackConfig= -PrivateData=accounts,jobs,reservations,usage,users -ProctrackType=proctrack/linuxproc -#Prolog= -PrologFlags=Contain -#PrologSlurmctld= -#PropagatePrioProcess=0 -PropagateResourceLimits=NONE -#PropagateResourceLimitsExcept= -#RebootProgram= -ReturnToService=2 -SlurmctldPidFile=/var/run/slurm/slurmctld.pid -SlurmctldPort=6817 -SlurmdPidFile=/var/run/slurm/slurmd.pid -SlurmdPort=6818 -SlurmdSpoolDir=/var/spool/slurmd -SlurmUser=slurm -SlurmdUser=root -#SrunEpilog= -#SrunProlog= -StateSaveLocation=/var/spool/slurmctld -SwitchType=switch/none -#TaskEpilog= -TaskPlugin=task/none -#TaskProlog= -#TopologyPlugin=topology/tree -#TmpFS=/tmp -#TrackWCKey=no -#TreeWidth= -#UnkillableStepProgram= -#UsePAM=0 -# -# -# TIMERS -#BatchStartTimeout=10 -CompleteWait=32 -#EpilogMsgTime=2000 -#GetEnvTimeout=2 -#HealthCheckInterval=0 -#HealthCheckProgram= -InactiveLimit=300 -KillWait=30 -MessageTimeout=30 -#ResvOverRun=0 -MinJobAge=300 -#OverTimeLimit=0 -SlurmctldTimeout=120 -SlurmdTimeout=300 -#UnkillableStepTimeout=60 -#VSizeFactor=0 -Waittime=0 -# -# -# SCHEDULING -DefMemPerCPU=2048 -#MaxMemPerCPU=0 -#SchedulerTimeSlice=30 -SchedulerType=sched/backfill -SelectType=select/cons_tres -SelectTypeParameters=CR_Core_Memory -SchedulerParameters=defer,bf_continue,bf_interval=60,bf_resolution=300,bf_window=1440,bf_busy_nodes,default_queue_depth=1000,bf_max_job_start=200,bf_max_job_test=500,max_switch_wait=1800 -DependencyParameters=kill_invalid_depend -# -# -# JOB PRIORITY -#PriorityFlags= -#PriorityType=priority/multifactor -#PriorityDecayHalfLife= -#PriorityCalcPeriod= -#PriorityFavorSmall= -#PriorityMaxAge= -#PriorityUsageResetPeriod= -#PriorityWeightAge= -#PriorityWeightFairshare= -#PriorityWeightJobSize= -#PriorityWeightPartition= -#PriorityWeightQOS= -# -# -# LOGGING AND ACCOUNTING -AccountingStorageEnforce=safe,associations,limits,qos -#AccountingStorageHost= -#AccountingStoragePass= -#AccountingStoragePort= -AccountingStorageType=accounting_storage/slurmdbd -#AccountingStorageUser= -#AccountingStoreFlags= -#JobCompHost= -#JobCompLoc= -#JobCompPass= -#JobCompPort= -JobCompType=jobcomp/none -#JobCompUser= -JobContainerType=job_container/tmpfs -JobAcctGatherFrequency=30 -JobAcctGatherType=jobacct_gather/cgroup -SlurmctldDebug=info -SlurmctldLogFile=/var/log/slurm/slurmctld.log -SlurmdDebug=info -SlurmdLogFile=/var/log/slurm/slurmd.log -#SlurmSchedLogFile= -#SlurmSchedLogLevel= -#DebugFlags= -# -# -# POWER SAVE SUPPORT FOR IDLE NODES (optional) -#SuspendProgram= -#ResumeProgram= -#SuspendTimeout= -#ResumeTimeout= -#ResumeRate= -#SuspendExcNodes= -#SuspendExcParts= -#SuspendRate= -#SuspendTime= -# -# -# CUSTOM CONFIGS -LaunchParameters=use_interactive_step -#SlurmctldParameters=enable_configless -# -# -# COMPUTE NODES ## GET CONF WITH `slurmd -C` -NodeName=de01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3892 - -PartitionName=main Nodes=de01 Default=YES State=UP OverSubscribe=NO PreemptMode=OFF -``` -{{< /tab >}} -{{< /tabs >}} - -Configure the hosts file with addresses for both the head node and the compute node: - -{{< tabs "compute-hosts" >}} -{{< tab "Bare Metal Head" >}} - -```bash -cat <}} -{{< tab "Cloud Instance Head" >}} - -```bash -cat <}} -{{< tab "VM Head" >}} - -```bash -cat <}} -{{< /tabs >}} - -Create the Slurm user on the compute node: - -```bash -SLURMID=666 -groupadd -g $SLURMID slurm -useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u $SLURMID -g slurm -s /sbin/nologin slurm -``` - -Update Slurm file and directory ownership: - -```bash -chown -R slurm:slurm /etc/slurm/ -chown -R slurm:slurm /var/lib/slurm -``` - -{{< callout context="note" title="Note" icon="outline/info-circle" >}} -Use `find / -name "slurm"` to make sure everything that needs to be changed is identified. Note that not all results need ownership modified though, such as directories under `/run/`, `/usr/` or `/var/`! -{{< /callout >}} - -Create the directory /var/log/slurm as it doesn't exist yet, and set ownership to Slurm: - -```bash -mkdir /var/log/slurm -chown slurm:slurm /var/log/slurm -``` - -Creating job_container.conf file that matches the one in the head node: - -```bash -SLURMTMPDIR=/lscratch - -cat <}} -If you get the following error: -`usermod: user munge is currently used by process ` - -Kill the process and repeat above two commands: -`kill -15 ` -{{< /callout >}} - -Update munge file/directory ownership: - -```bash -find / -mount -writable -type d -uid 991 -exec chown -R munge:munge \{\} \; -``` - -Copy the munge key from the head node to the compute node. - -**Inside the head node:** - -```bash -cd ~ -sudo cp /etc/munge/munge.key ./ -sudo chown "$(id -u):$(id -u)" munge.key -scp ./munge.key root@172.16.0.1:~/ -``` - -**Inside the compute node:** - -```bash -mv munge.key /etc/munge/munge.key -chown munge:munge /etc/munge/munge.key -``` - -{{< callout context="note" title="Note" icon="outline/info-circle" >}} -In the case of an error about "Offending ECDSA key in ~/.ssh/known_hosts:3", remove the compute node from the known hosts file and try the 'scp' command again: - -``` -ssh-keygen -R 172.16.0.1 -``` - -Alternatively, setup an `ignore.conf` file per [Section 2.8.3](https://openchami.org/docs/tutorial/#283-logging-into-the-compute-node) of the tutorial, to prevent this issue. -{{< /callout >}} - -Continuing **inside the compute node**, setup and start the services for Slurm. - -Enable and start munge service: - -```bash -systemctl enable munge.service -systemctl start munge.service -systemctl status munge.service -``` - -The output should be: - -``` -● munge.service - MUNGE authentication service - Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled) - Active: active (running) since Wed 2026-02-04 00:55:55 UTC; 1 week 2 days ago - Docs: man:munged(8) - Main PID: 1451 (munged) - Tasks: 4 (limit: 24335) - Memory: 2.2M (peak: 2.5M) - CPU: 4.710s - CGroup: /system.slice/munge.service - └─1451 /usr/sbin/munged - -Feb 04 00:55:55 de01 systemd[1]: Started MUNGE authentication service. -``` - -Enable and start slurmd: - -```bash -systemctl enable slurmd -systemctl start slurmd -systemctl status slurmd -``` - -The output should be: - -``` -● slurmd.service - Slurm node daemon - Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; preset: disabled) - Active: active (running) since Fri 2026-02-13 05:59:32 UTC; 4s ago - Main PID: 30727 (slurmd) - Tasks: 1 - Memory: 1.3M (peak: 1.5M) - CPU: 16ms - CGroup: /system.slice/slurmd.service - └─30727 /usr/sbin/slurmd --systemd - -Feb 13 05:59:32 de01.openchami.cluster systemd[1]: Stopped Slurm node daemon. -Feb 13 05:59:32 de01.openchami.cluster systemd[1]: slurmd.service: Consumed 3.533s CPU time, 3.0M memory peak. -Feb 13 05:59:32 de01.openchami.cluster systemd[1]: Starting Slurm node daemon... -Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: _read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults -Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: _read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults -Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: CPU frequency setting not configured for this node -Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: slurmd version 24.05.5 started -Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: slurmd started on Fri, 13 Feb 2026 05:59:32 +0000 -Feb 13 05:59:32 de01.openchami.cluster systemd[1]: Started Slurm node daemon. -Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=3892 TmpDisk=778 Uptime=796812 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) -``` - -Disable the firewall and reset the nft ruleset in the compute node: - -```bash -systemctl stop firewalld -systemctl disable firewalld - -nft flush ruleset -nft list ruleset -``` - -Start Slurm service daemons in the **head node**: - -```bash -sudo systemctl start slurmdbd -sudo systemctl start slurmctld -``` - -Restart Slurm service daemons in the **compute node** to ensure changes are applied: - -```bash -systemctl restart slurmd -``` - -## 1.8 Test Munge and Slurm - -Test munge on the **head node**: - -```bash -# Try to munge and unmunge to access the compute node -munge -n | ssh root@172.16.0.1 unmunge -``` - -The output should be: - -``` -STATUS: Success (0) -ENCODE_HOST: ??? (192.168.200.2) -ENCODE_TIME: 2026-02-13 05:33:34 +0000 (1770960814) -DECODE_TIME: 2026-02-13 05:33:34 +0000 (1770960814) -TTL: 300 -CIPHER: aes128 (4) -MAC: sha256 (5) -ZIP: none (0) -UID: ??? (1000) -GID: ??? (1000) -LENGTH: 0 -``` - -{{< callout context="note" title="Note" icon="outline/info-circle" >}} -In the case of an error about "Offending ECDSA key in ~/.ssh/known_hosts:3", remove the compute node from the known hosts file and try the 'scp' command again: - -``` -ssh-keygen -R 172.16.0.1 -``` - -Alternatively, setup an `ignore.conf` file per [Section 2.8.3](https://openchami.org/docs/tutorial/#283-logging-into-the-compute-node) of the tutorial, to prevent this issue. -{{< /callout >}} - -Test that you can submit a job from the **head node**. +Test that you can submit a job from the **head node**. Check that the node is present and idle: @@ -2322,5 +2122,3 @@ If something goes wrong and your compute node goes down, restart it with this co ```bash sudo scontrol update NodeName=de01 State=RESUME ``` - - From 19909cece2d8b859edc28d3fb11bcf92023fddc5 Mon Sep 17 00:00:00 2001 From: Luna Morrow Date: Wed, 18 Mar 2026 12:22:42 +1000 Subject: [PATCH 22/23] Apply minor formatting suggestions from David and update incorrect 'chown' command Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index 7d64559..ba0636c 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -253,7 +253,7 @@ Copy the munge key to the normal user's home directory, so that the compute node ```bash sudo cp /etc/munge/munge.key ~/ -sudo chown "$(id -u):$(id -u)" munge.key +sudo chown "$(id -u):$(id -u)" ~/munge.key ``` {{< callout context="note" title="Note" icon="outline/info-circle" >}} @@ -2042,6 +2042,13 @@ sudo systemctl restart slurmdbd sudo systemctl restart slurmctld ``` +Now is the time to delete the copy of `munge.key` in the normal user's home directory on the **head node**: + +```bash +rm ~/munge.key +``` + + ## 1.8 Test Munge and Slurm Test munge on the **head node**: From dd1508888556a8bc100493b967f6c3ec29448fe2 Mon Sep 17 00:00:00 2001 From: Luna Morrow Date: Tue, 24 Mar 2026 12:08:09 +1000 Subject: [PATCH 23/23] Update documentation with small tweaks from David's suggestions - mainly fixing the name of the ACCESS and SECRET tokens for S3 and making a comment into a note to improve visibility Signed-off-by: Luna Morrow --- content/docs/guides/install_slurm.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index ba0636c..8e8729a 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -1149,7 +1149,11 @@ cmds: - cmd: 'dnf remove -y munge-libs-0.5.13-* munge-0.5.13-*' ``` -Run podman container to run image build command. The S3_ACCESS and S3_SECRET tokens are set in the tutorial [here](https://openchami.org/docs/tutorial/#233-install-and-configure-s3-clients). +Run podman container to run image build command. + +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +The ROOT_ACCESS_KEY and ROOT_SECRET_KEY tokens are set in the tutorial [here](https://openchami.org/docs/tutorial/#233-install-and-configure-s3-clients). +{{< /callout >}} ```bash podman run \