diff --git a/content/docs/guides/install_slurm.md b/content/docs/guides/install_slurm.md index 04223da..8e8729a 100644 --- a/content/docs/guides/install_slurm.md +++ b/content/docs/guides/install_slurm.md @@ -249,6 +249,21 @@ sudo -u munge /usr/sbin/mungekey -v sudo systemctl enable --now munge ``` +Copy the munge key to the normal user's home directory, so that the compute node can fetch it while booting. + +```bash +sudo cp /etc/munge/munge.key ~/ +sudo chown "$(id -u):$(id -u)" ~/munge.key +``` + +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +Make sure to delete the copy of `munge.key` in the normal user's home directory after the compute node is setup! + +```bash +rm ~/munge.key +``` +{{< /callout >}} + Install mariaDB: ```bash @@ -920,6 +935,13 @@ EOF {{< /tab >}} {{< /tabs >}} +Start Slurm service daemons: + +```bash +sudo systemctl start slurmdbd +sudo systemctl start slurmctld +``` + ## 1.4 Make a Local Slurm Repository and Serve it with Nginx Create configuration file to mount into Nginx container: @@ -1079,18 +1101,23 @@ repos: - alias: 'Slurm' url: 'http://localhost:8080/slurm-24.05.5' - packages: - boxes - figlet - git - nfs-utils + - bind-utils + - openldap-clients + - sssd + - sssd-ldap + - oddjob-mkhomedir - tcpdump - traceroute - vim - curl - rpm-build - shadow-utils + - sshpass - pwgen - jq - libconfuse @@ -1119,10 +1146,14 @@ cmds: - cmd: 'rpmbuild -tb munge-0.5.18.tar.xz' - cmd: 'cd /root/rpmbuild' - cmd: 'rpm --install --verbose --force /root/rpmbuild/RPMS/x86_64/munge-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-debuginfo-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-debugsource-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-devel-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-libs-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-libs-debuginfo-0.5.18-1.el9.x86_64.rpm' - - cmd: 'dnf remove -y munge-libs-0.5.13-13.el9 munge-0.5.13-13.el9' + - cmd: 'dnf remove -y munge-libs-0.5.13-* munge-0.5.13-*' ``` -Run podman container to run image build command. The S3_ACCESS and S3_SECRET tokens are set in the tutorial [here](https://openchami.org/docs/tutorial/#233-install-and-configure-s3-clients). +Run podman container to run image build command. + +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +The ROOT_ACCESS_KEY and ROOT_SECRET_KEY tokens are set in the tutorial [here](https://openchami.org/docs/tutorial/#233-install-and-configure-s3-clients). +{{< /callout >}} ```bash podman run \ @@ -1277,6 +1308,127 @@ Configure cloud-init for compute group: **Edit as root: `/etc/openchami/data/cloud-init/ci-group-compute.yaml`** +{{< tabs "cloud-init-compute-configs" >}} +{{< tab "Bare Metal Head" >}} + +```yaml {title="/etc/openchami/data/cloud-init/ci-group-compute.yaml"} +- name: compute + description: "compute config" + file: + encoding: plain + content: | + ## template: jinja + #cloud-config + merge_how: + - name: list + settings: [append] + - name: dict + settings: [no_replace, recurse_list] + users: + - name: root + ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }} + disable_root: false + + write_files: + - path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo + 172.16.0.1 de01.openchami.cluster de01 + + - path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + + bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + + runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset +``` + +{{< /tab >}} +{{< tab "Cloud Instance Head" >}} + +```yaml {title="/etc/openchami/data/cloud-init/ci-group-compute.yaml"} +- name: compute + description: "compute config" + file: + encoding: plain + content: | + ## template: jinja + #cloud-config + merge_how: + - name: list + settings: [append] + - name: dict + settings: [no_replace, recurse_list] + users: + - name: root + ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }} + disable_root: false + + write_files: + - path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo + 172.16.0.1 de01.openchami.cluster de01 + + - path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + demo.openchami.cluster:/etc/sssd /etc/sssd nfs defaults 0 0 + demo.openchami.cluster:/etc/openldap /etc/openldap nfs defaults 0 0 + + bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + + runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset +``` + +{{< /tab >}} +{{< tab "VM Head" >}} + ```yaml {title="/etc/openchami/data/cloud-init/ci-group-compute.yaml"} - name: compute description: "compute config" @@ -1293,9 +1445,51 @@ Configure cloud-init for compute group: users: - name: root ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }} - disable_root: false + disable_root: false + + write_files: + - path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo head + 172.16.0.1 de01.openchami.cluster de01 + + - path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + demo.openchami.cluster:/etc/sssd /etc/sssd nfs defaults 0 0 + demo.openchami.cluster:/etc/openldap /etc/openldap nfs defaults 0 0 + + bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + + runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset ``` +{{< /tab >}} +{{< /tabs >}} + + Now, set this configuration for the compute group: ```bash @@ -1310,6 +1504,115 @@ ochami cloud-init group get config compute The cloud-config file created within the YAML above should get print out: +{{< tabs "compute-config-outputs" >}} +{{< tab "Bare Metal Head" >}} + +```yaml +## template: jinja +#cloud-config +merge_how: +- name: list + settings: [append] +- name: dict + settings: [no_replace, recurse_list] +users: + - name: root + ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }} +disable_root: false + +write_files: +- path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo + 172.16.0.1 de01.openchami.cluster de01 + +- path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + +bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + +runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset +``` + +{{< /tab >}} +{{< tab "Cloud Instance Head" >}} + +```yaml +## template: jinja +#cloud-config +merge_how: +- name: list + settings: [append] +- name: dict + settings: [no_replace, recurse_list] +users: + - name: root + ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }} +disable_root: false + +write_files: +- path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo + 172.16.0.1 de01.openchami.cluster de01 + +- path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + +bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + +runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset +``` + +{{< /tab >}} +{{< tab "VM Head" >}} + ```yaml ## template: jinja #cloud-config @@ -1322,10 +1625,49 @@ users: - name: root ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }} disable_root: false + +write_files: +- path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo head + 172.16.0.1 de01.openchami.cluster de01 + +- path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + +bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + +runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset ``` +{{< /tab >}} +{{< /tabs >}} + `ochami` has basic per-group template rendering available that can be used to -check that the Jinja2 is rendering properly for a node. Check if for the first +check that the Jinja2 is rendering properly for a node. Check it for the first compute node (x1000c0s0b0n0): ```bash @@ -1340,7 +1682,117 @@ make sure that the `IMPERSONATION` environment variable is set in The SSH key that was created above should appear in the config: +{{< tabs "compute-node-config-outputs" >}} +{{< tab "Bare Metal Head" >}} + +```yaml +## template: jinja +#cloud-config +merge_how: +- name: list + settings: [append] +- name: dict + settings: [no_replace, recurse_list] +users: + - name: root + ssh_authorized_keys: [''] +disable_root: false + +write_files: +- path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo + 172.16.0.1 de01.openchami.cluster de01 + +- path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + +bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + +runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset +``` + +{{< /tab >}} +{{< tab "Cloud Instance Head" >}} + +```yaml +## template: jinja +#cloud-config +merge_how: +- name: list + settings: [append] +- name: dict + settings: [no_replace, recurse_list] +users: + - name: root + ssh_authorized_keys: [''] +disable_root: false + +write_files: +- path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo + 172.16.0.1 de01.openchami.cluster de01 + +- path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + +bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + +runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset +``` + +{{< /tab >}} +{{< tab "VM Head" >}} + ```yaml +## template: jinja #cloud-config merge_how: - name: list @@ -1350,8 +1802,47 @@ merge_how: users: - name: root ssh_authorized_keys: [''] +disable_root: false + +write_files: +- path: /etc/hosts + append: true + content: | + 172.16.0.254 demo.openchami.cluster demo head + 172.16.0.1 de01.openchami.cluster de01 + +- path: /etc/fstab + content: | + demo.openchami.cluster:/home /home nfs defaults 0 0 + demo.openchami.cluster:/etc/slurm /etc/slurm nfs defaults 0 0 + +bootcmd: + - hostnamectl set-hostname de01.openchami.cluster + - groupadd -g 666 slurm + - useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 666 -g slurm -s /sbin/nologin slurm + +runcmd: + - nmcli connection modify "System enp1s0" ipv4.dns "172.16.0.254 8.8.8.8" + - systemctl restart NetworkManager + - systemctl daemon-reload + - mount -a + - chown -R slurm:slurm /var/lib/slurm + - mkdir /var/log/slurm + - chown slurm:slurm /var/log/slurm + - usermod -u 616 munge + - groupmod -g 616 munge + - find / -writable -uid 991 -type d -exec chown -R munge:munge \{\} \; + - sshpass -p rocky scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null rocky@172.16.0.254:/home/rocky/munge.key /etc/munge/munge.key + - chown munge:munge /etc/munge/munge.key + - systemctl enable --now munge + - systemctl enable --now slurmd + - systemctl stop firewalld + - systemctl disable firewalld + - nft flush ruleset ``` +{{< /tab >}} +{{< /tabs >}} ## 1.6 Boot the Compute Node with the Slurm Compute Image @@ -1548,736 +2039,56 @@ dnf remove -y munge-libs-0.5.13- munge-0.5.13- ``` {{< /callout >}} -Create slurm config file that is identical to that of the head node. Note that you may need to update the `NodeName` info depending on the configuration of your compute node: +Restart Slurm service daemons in the **head node**: + +```bash +sudo systemctl restart slurmdbd +sudo systemctl restart slurmctld +``` + +Now is the time to delete the copy of `munge.key` in the normal user's home directory on the **head node**: + +```bash +rm ~/munge.key +``` + + +## 1.8 Test Munge and Slurm + +Test munge on the **head node**: + +```bash +# Try to munge and unmunge to access the compute node +munge -n | ssh root@172.16.0.1 unmunge +``` + +The output should be: + +``` +STATUS: Success (0) +ENCODE_HOST: ??? (192.168.200.2) +ENCODE_TIME: 2026-02-13 05:33:34 +0000 (1770960814) +DECODE_TIME: 2026-02-13 05:33:34 +0000 (1770960814) +TTL: 300 +CIPHER: aes128 (4) +MAC: sha256 (5) +ZIP: none (0) +UID: ??? (1000) +GID: ??? (1000) +LENGTH: 0 +``` {{< callout context="note" title="Note" icon="outline/info-circle" >}} -If the head node is in a VM (see [**Head Node: Using Virtual -Machine**](https://openchami.org/docs/tutorial/#05-head-node-using-virtual-machine)), -the `SlurmctldHost` will be `head` instead of `demo`. +In the case of an error about "Offending ECDSA key in ~/.ssh/known_hosts:3", remove the compute node from the known hosts file and try the 'scp' command again: + +``` +ssh-keygen -R 172.16.0.1 +``` + +Alternatively, setup an `ignore.conf` file per [Section 2.8.3](https://openchami.org/docs/tutorial/#283-logging-into-the-compute-node) of the tutorial, to prevent this issue. {{< /callout >}} -**Edit the Slurm config file as root: `/etc/slurm/slurm.conf`** - -{{< tabs "slurm-config-computenode" >}} -{{< tab "Bare Metal Head" >}} -```bash {title="/etc/slurm/slurm.conf"} -# -ClusterName=demo -SlurmctldHost=demo -# -#DisableRootJobs=NO -EnforcePartLimits=ALL -#Epilog= -#EpilogSlurmctld= -#FirstJobId=1 -#MaxJobId=67043328 -#GresTypes= -#GroupUpdateForce=0 -#GroupUpdateTime=600 -#JobFileAppend=0 -JobRequeue=0 -#JobSubmitPlugins=lua -KillOnBadExit=1 -#LaunchType=launch/slurm -#Licenses=foo*4,bar -#MailProg=/bin/mail -#MaxJobCount=10000 -#MaxStepCount=40000 -#MaxTasksPerNode=512 -MpiDefault=pmix -#MpiParams=ports=#-# -#PluginDir= -#PlugStackConfig= -PrivateData=accounts,jobs,reservations,usage,users -ProctrackType=proctrack/linuxproc -#Prolog= -PrologFlags=Contain -#PrologSlurmctld= -#PropagatePrioProcess=0 -PropagateResourceLimits=NONE -#PropagateResourceLimitsExcept= -#RebootProgram= -ReturnToService=2 -SlurmctldPidFile=/var/run/slurm/slurmctld.pid -SlurmctldPort=6817 -SlurmdPidFile=/var/run/slurm/slurmd.pid -SlurmdPort=6818 -SlurmdSpoolDir=/var/spool/slurmd -SlurmUser=slurm -SlurmdUser=root -#SrunEpilog= -#SrunProlog= -StateSaveLocation=/var/spool/slurmctld -SwitchType=switch/none -#TaskEpilog= -TaskPlugin=task/none -#TaskProlog= -#TopologyPlugin=topology/tree -#TmpFS=/tmp -#TrackWCKey=no -#TreeWidth= -#UnkillableStepProgram= -#UsePAM=0 -# -# -# TIMERS -#BatchStartTimeout=10 -CompleteWait=32 -#EpilogMsgTime=2000 -#GetEnvTimeout=2 -#HealthCheckInterval=0 -#HealthCheckProgram= -InactiveLimit=300 -KillWait=30 -MessageTimeout=30 -#ResvOverRun=0 -MinJobAge=300 -#OverTimeLimit=0 -SlurmctldTimeout=120 -SlurmdTimeout=300 -#UnkillableStepTimeout=60 -#VSizeFactor=0 -Waittime=0 -# -# -# SCHEDULING -DefMemPerCPU=2048 -#MaxMemPerCPU=0 -#SchedulerTimeSlice=30 -SchedulerType=sched/backfill -SelectType=select/cons_tres -SelectTypeParameters=CR_Core_Memory -SchedulerParameters=defer,bf_continue,bf_interval=60,bf_resolution=300,bf_window=1440,bf_busy_nodes,default_queue_depth=1000,bf_max_job_start=200,bf_max_job_test=500,max_switch_wait=1800 -DependencyParameters=kill_invalid_depend -# -# -# JOB PRIORITY -#PriorityFlags= -#PriorityType=priority/multifactor -#PriorityDecayHalfLife= -#PriorityCalcPeriod= -#PriorityFavorSmall= -#PriorityMaxAge= -#PriorityUsageResetPeriod= -#PriorityWeightAge= -#PriorityWeightFairshare= -#PriorityWeightJobSize= -#PriorityWeightPartition= -#PriorityWeightQOS= -# -# -# LOGGING AND ACCOUNTING -AccountingStorageEnforce=safe,associations,limits,qos -#AccountingStorageHost= -#AccountingStoragePass= -#AccountingStoragePort= -AccountingStorageType=accounting_storage/slurmdbd -#AccountingStorageUser= -#AccountingStoreFlags= -#JobCompHost= -#JobCompLoc= -#JobCompPass= -#JobCompPort= -JobCompType=jobcomp/none -#JobCompUser= -JobContainerType=job_container/tmpfs -JobAcctGatherFrequency=30 -JobAcctGatherType=jobacct_gather/cgroup -SlurmctldDebug=info -SlurmctldLogFile=/var/log/slurm/slurmctld.log -SlurmdDebug=info -SlurmdLogFile=/var/log/slurm/slurmd.log -#SlurmSchedLogFile= -#SlurmSchedLogLevel= -#DebugFlags= -# -# -# POWER SAVE SUPPORT FOR IDLE NODES (optional) -#SuspendProgram= -#ResumeProgram= -#SuspendTimeout= -#ResumeTimeout= -#ResumeRate= -#SuspendExcNodes= -#SuspendExcParts= -#SuspendRate= -#SuspendTime= -# -# -# CUSTOM CONFIGS -LaunchParameters=use_interactive_step -#SlurmctldParameters=enable_configless -# -# -# COMPUTE NODES ## GET CONF WITH `slurmd -C` -NodeName=de01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3892 - -PartitionName=main Nodes=de01 Default=YES State=UP OverSubscribe=NO PreemptMode=OFF -``` -{{< /tab >}} -{{< tab "Cloud Instance Head" >}} -```bash {title="/etc/slurm/slurm.conf"} -# -ClusterName=demo -SlurmctldHost=demo -# -#DisableRootJobs=NO -EnforcePartLimits=ALL -#Epilog= -#EpilogSlurmctld= -#FirstJobId=1 -#MaxJobId=67043328 -#GresTypes= -#GroupUpdateForce=0 -#GroupUpdateTime=600 -#JobFileAppend=0 -JobRequeue=0 -#JobSubmitPlugins=lua -KillOnBadExit=1 -#LaunchType=launch/slurm -#Licenses=foo*4,bar -#MailProg=/bin/mail -#MaxJobCount=10000 -#MaxStepCount=40000 -#MaxTasksPerNode=512 -MpiDefault=pmix -#MpiParams=ports=#-# -#PluginDir= -#PlugStackConfig= -PrivateData=accounts,jobs,reservations,usage,users -ProctrackType=proctrack/linuxproc -#Prolog= -PrologFlags=Contain -#PrologSlurmctld= -#PropagatePrioProcess=0 -PropagateResourceLimits=NONE -#PropagateResourceLimitsExcept= -#RebootProgram= -ReturnToService=2 -SlurmctldPidFile=/var/run/slurm/slurmctld.pid -SlurmctldPort=6817 -SlurmdPidFile=/var/run/slurm/slurmd.pid -SlurmdPort=6818 -SlurmdSpoolDir=/var/spool/slurmd -SlurmUser=slurm -SlurmdUser=root -#SrunEpilog= -#SrunProlog= -StateSaveLocation=/var/spool/slurmctld -SwitchType=switch/none -#TaskEpilog= -TaskPlugin=task/none -#TaskProlog= -#TopologyPlugin=topology/tree -#TmpFS=/tmp -#TrackWCKey=no -#TreeWidth= -#UnkillableStepProgram= -#UsePAM=0 -# -# -# TIMERS -#BatchStartTimeout=10 -CompleteWait=32 -#EpilogMsgTime=2000 -#GetEnvTimeout=2 -#HealthCheckInterval=0 -#HealthCheckProgram= -InactiveLimit=300 -KillWait=30 -MessageTimeout=30 -#ResvOverRun=0 -MinJobAge=300 -#OverTimeLimit=0 -SlurmctldTimeout=120 -SlurmdTimeout=300 -#UnkillableStepTimeout=60 -#VSizeFactor=0 -Waittime=0 -# -# -# SCHEDULING -DefMemPerCPU=2048 -#MaxMemPerCPU=0 -#SchedulerTimeSlice=30 -SchedulerType=sched/backfill -SelectType=select/cons_tres -SelectTypeParameters=CR_Core_Memory -SchedulerParameters=defer,bf_continue,bf_interval=60,bf_resolution=300,bf_window=1440,bf_busy_nodes,default_queue_depth=1000,bf_max_job_start=200,bf_max_job_test=500,max_switch_wait=1800 -DependencyParameters=kill_invalid_depend -# -# -# JOB PRIORITY -#PriorityFlags= -#PriorityType=priority/multifactor -#PriorityDecayHalfLife= -#PriorityCalcPeriod= -#PriorityFavorSmall= -#PriorityMaxAge= -#PriorityUsageResetPeriod= -#PriorityWeightAge= -#PriorityWeightFairshare= -#PriorityWeightJobSize= -#PriorityWeightPartition= -#PriorityWeightQOS= -# -# -# LOGGING AND ACCOUNTING -AccountingStorageEnforce=safe,associations,limits,qos -#AccountingStorageHost= -#AccountingStoragePass= -#AccountingStoragePort= -AccountingStorageType=accounting_storage/slurmdbd -#AccountingStorageUser= -#AccountingStoreFlags= -#JobCompHost= -#JobCompLoc= -#JobCompPass= -#JobCompPort= -JobCompType=jobcomp/none -#JobCompUser= -JobContainerType=job_container/tmpfs -JobAcctGatherFrequency=30 -JobAcctGatherType=jobacct_gather/cgroup -SlurmctldDebug=info -SlurmctldLogFile=/var/log/slurm/slurmctld.log -SlurmdDebug=info -SlurmdLogFile=/var/log/slurm/slurmd.log -#SlurmSchedLogFile= -#SlurmSchedLogLevel= -#DebugFlags= -# -# -# POWER SAVE SUPPORT FOR IDLE NODES (optional) -#SuspendProgram= -#ResumeProgram= -#SuspendTimeout= -#ResumeTimeout= -#ResumeRate= -#SuspendExcNodes= -#SuspendExcParts= -#SuspendRate= -#SuspendTime= -# -# -# CUSTOM CONFIGS -LaunchParameters=use_interactive_step -#SlurmctldParameters=enable_configless -# -# -# COMPUTE NODES ## GET CONF WITH `slurmd -C` -NodeName=de01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3892 - -PartitionName=main Nodes=de01 Default=YES State=UP OverSubscribe=NO PreemptMode=OFF -``` -{{< /tab >}} -{{< tab "VM Head" >}} -```bash {title="/etc/slurm/slurm.conf"} -# -ClusterName=demo -SlurmctldHost=head -# -#DisableRootJobs=NO -EnforcePartLimits=ALL -#Epilog= -#EpilogSlurmctld= -#FirstJobId=1 -#MaxJobId=67043328 -#GresTypes= -#GroupUpdateForce=0 -#GroupUpdateTime=600 -#JobFileAppend=0 -JobRequeue=0 -#JobSubmitPlugins=lua -KillOnBadExit=1 -#LaunchType=launch/slurm -#Licenses=foo*4,bar -#MailProg=/bin/mail -#MaxJobCount=10000 -#MaxStepCount=40000 -#MaxTasksPerNode=512 -MpiDefault=pmix -#MpiParams=ports=#-# -#PluginDir= -#PlugStackConfig= -PrivateData=accounts,jobs,reservations,usage,users -ProctrackType=proctrack/linuxproc -#Prolog= -PrologFlags=Contain -#PrologSlurmctld= -#PropagatePrioProcess=0 -PropagateResourceLimits=NONE -#PropagateResourceLimitsExcept= -#RebootProgram= -ReturnToService=2 -SlurmctldPidFile=/var/run/slurm/slurmctld.pid -SlurmctldPort=6817 -SlurmdPidFile=/var/run/slurm/slurmd.pid -SlurmdPort=6818 -SlurmdSpoolDir=/var/spool/slurmd -SlurmUser=slurm -SlurmdUser=root -#SrunEpilog= -#SrunProlog= -StateSaveLocation=/var/spool/slurmctld -SwitchType=switch/none -#TaskEpilog= -TaskPlugin=task/none -#TaskProlog= -#TopologyPlugin=topology/tree -#TmpFS=/tmp -#TrackWCKey=no -#TreeWidth= -#UnkillableStepProgram= -#UsePAM=0 -# -# -# TIMERS -#BatchStartTimeout=10 -CompleteWait=32 -#EpilogMsgTime=2000 -#GetEnvTimeout=2 -#HealthCheckInterval=0 -#HealthCheckProgram= -InactiveLimit=300 -KillWait=30 -MessageTimeout=30 -#ResvOverRun=0 -MinJobAge=300 -#OverTimeLimit=0 -SlurmctldTimeout=120 -SlurmdTimeout=300 -#UnkillableStepTimeout=60 -#VSizeFactor=0 -Waittime=0 -# -# -# SCHEDULING -DefMemPerCPU=2048 -#MaxMemPerCPU=0 -#SchedulerTimeSlice=30 -SchedulerType=sched/backfill -SelectType=select/cons_tres -SelectTypeParameters=CR_Core_Memory -SchedulerParameters=defer,bf_continue,bf_interval=60,bf_resolution=300,bf_window=1440,bf_busy_nodes,default_queue_depth=1000,bf_max_job_start=200,bf_max_job_test=500,max_switch_wait=1800 -DependencyParameters=kill_invalid_depend -# -# -# JOB PRIORITY -#PriorityFlags= -#PriorityType=priority/multifactor -#PriorityDecayHalfLife= -#PriorityCalcPeriod= -#PriorityFavorSmall= -#PriorityMaxAge= -#PriorityUsageResetPeriod= -#PriorityWeightAge= -#PriorityWeightFairshare= -#PriorityWeightJobSize= -#PriorityWeightPartition= -#PriorityWeightQOS= -# -# -# LOGGING AND ACCOUNTING -AccountingStorageEnforce=safe,associations,limits,qos -#AccountingStorageHost= -#AccountingStoragePass= -#AccountingStoragePort= -AccountingStorageType=accounting_storage/slurmdbd -#AccountingStorageUser= -#AccountingStoreFlags= -#JobCompHost= -#JobCompLoc= -#JobCompPass= -#JobCompPort= -JobCompType=jobcomp/none -#JobCompUser= -JobContainerType=job_container/tmpfs -JobAcctGatherFrequency=30 -JobAcctGatherType=jobacct_gather/cgroup -SlurmctldDebug=info -SlurmctldLogFile=/var/log/slurm/slurmctld.log -SlurmdDebug=info -SlurmdLogFile=/var/log/slurm/slurmd.log -#SlurmSchedLogFile= -#SlurmSchedLogLevel= -#DebugFlags= -# -# -# POWER SAVE SUPPORT FOR IDLE NODES (optional) -#SuspendProgram= -#ResumeProgram= -#SuspendTimeout= -#ResumeTimeout= -#ResumeRate= -#SuspendExcNodes= -#SuspendExcParts= -#SuspendRate= -#SuspendTime= -# -# -# CUSTOM CONFIGS -LaunchParameters=use_interactive_step -#SlurmctldParameters=enable_configless -# -# -# COMPUTE NODES ## GET CONF WITH `slurmd -C` -NodeName=de01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3892 - -PartitionName=main Nodes=de01 Default=YES State=UP OverSubscribe=NO PreemptMode=OFF -``` -{{< /tab >}} -{{< /tabs >}} - -Configure the hosts file with addresses for both the head node and the compute node: - -{{< tabs "compute-hosts" >}} -{{< tab "Bare Metal Head" >}} - -```bash -cat <}} -{{< tab "Cloud Instance Head" >}} - -```bash -cat <}} -{{< tab "VM Head" >}} - -```bash -cat <}} -{{< /tabs >}} - -Create the Slurm user on the compute node: - -```bash -SLURMID=666 -groupadd -g $SLURMID slurm -useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u $SLURMID -g slurm -s /sbin/nologin slurm -``` - -Update Slurm file and directory ownership: - -```bash -chown -R slurm:slurm /etc/slurm/ -chown -R slurm:slurm /var/lib/slurm -``` - -{{< callout context="note" title="Note" icon="outline/info-circle" >}} -Use `find / -name "slurm"` to make sure everything that needs to be changed is identified. Note that not all results need ownership modified though, such as directories under `/run/`, `/usr/` or `/var/`! -{{< /callout >}} - -Create the directory /var/log/slurm as it doesn't exist yet, and set ownership to Slurm: - -```bash -mkdir /var/log/slurm -chown slurm:slurm /var/log/slurm -``` - -Creating job_container.conf file that matches the one in the head node: - -```bash -SLURMTMPDIR=/lscratch - -cat <}} -If you get the following error: -`usermod: user munge is currently used by process ` - -Kill the process and repeat above two commands: -`kill -15 ` -{{< /callout >}} - -Update munge file/directory ownership: - -```bash -find / -mount -writable -type d -uid 991 -exec chown -R munge:munge \{\} \; -``` - -Copy the munge key from the head node to the compute node. - -**Inside the head node:** - -```bash -cd ~ -sudo cp /etc/munge/munge.key ./ -sudo chown "$(id -u):$(id -u)" munge.key -scp ./munge.key root@172.16.0.1:~/ -``` - -**Inside the compute node:** - -```bash -mv munge.key /etc/munge/munge.key -chown munge:munge /etc/munge/munge.key -``` - -{{< callout context="note" title="Note" icon="outline/info-circle" >}} -In the case of an error about "Offending ECDSA key in ~/.ssh/known_hosts:3", remove the compute node from the known hosts file and try the 'scp' command again: - -``` -ssh-keygen -R 172.16.0.1 -``` - -Alternatively, setup an `ignore.conf` file per [Section 2.8.3](https://openchami.org/docs/tutorial/#283-logging-into-the-compute-node) of the tutorial, to prevent this issue. -{{< /callout >}} - -Continuing **inside the compute node**, setup and start the services for Slurm. - -Enable and start munge service: - -```bash -systemctl enable munge.service -systemctl start munge.service -systemctl status munge.service -``` - -The output should be: - -``` -● munge.service - MUNGE authentication service - Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled) - Active: active (running) since Wed 2026-02-04 00:55:55 UTC; 1 week 2 days ago - Docs: man:munged(8) - Main PID: 1451 (munged) - Tasks: 4 (limit: 24335) - Memory: 2.2M (peak: 2.5M) - CPU: 4.710s - CGroup: /system.slice/munge.service - └─1451 /usr/sbin/munged - -Feb 04 00:55:55 de01 systemd[1]: Started MUNGE authentication service. -``` - -Enable and start slurmd: - -```bash -systemctl enable slurmd -systemctl start slurmd -systemctl status slurmd -``` - -The output should be: - -``` -● slurmd.service - Slurm node daemon - Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; preset: disabled) - Active: active (running) since Fri 2026-02-13 05:59:32 UTC; 4s ago - Main PID: 30727 (slurmd) - Tasks: 1 - Memory: 1.3M (peak: 1.5M) - CPU: 16ms - CGroup: /system.slice/slurmd.service - └─30727 /usr/sbin/slurmd --systemd - -Feb 13 05:59:32 de01.openchami.cluster systemd[1]: Stopped Slurm node daemon. -Feb 13 05:59:32 de01.openchami.cluster systemd[1]: slurmd.service: Consumed 3.533s CPU time, 3.0M memory peak. -Feb 13 05:59:32 de01.openchami.cluster systemd[1]: Starting Slurm node daemon... -Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: _read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults -Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: _read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults -Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: CPU frequency setting not configured for this node -Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: slurmd version 24.05.5 started -Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: slurmd started on Fri, 13 Feb 2026 05:59:32 +0000 -Feb 13 05:59:32 de01.openchami.cluster systemd[1]: Started Slurm node daemon. -Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=3892 TmpDisk=778 Uptime=796812 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) -``` - -Disable the firewall and reset the nft ruleset in the compute node: - -```bash -systemctl stop firewalld -systemctl disable firewalld - -nft flush ruleset -nft list ruleset -``` - -Start Slurm service daemons in the **head node**: - -```bash -sudo systemctl start slurmdbd -sudo systemctl start slurmctld -``` - -Restart Slurm service daemons in the **compute node** to ensure changes are applied: - -```bash -systemctl restart slurmd -``` - -## 1.8 Test Munge and Slurm - -Test munge on the **head node**: - -```bash -# Try to munge and unmunge to access the compute node -munge -n | ssh root@172.16.0.1 unmunge -``` - -The output should be: - -``` -STATUS: Success (0) -ENCODE_HOST: ??? (192.168.200.2) -ENCODE_TIME: 2026-02-13 05:33:34 +0000 (1770960814) -DECODE_TIME: 2026-02-13 05:33:34 +0000 (1770960814) -TTL: 300 -CIPHER: aes128 (4) -MAC: sha256 (5) -ZIP: none (0) -UID: ??? (1000) -GID: ??? (1000) -LENGTH: 0 -``` - -{{< callout context="note" title="Note" icon="outline/info-circle" >}} -In the case of an error about "Offending ECDSA key in ~/.ssh/known_hosts:3", remove the compute node from the known hosts file and try the 'scp' command again: - -``` -ssh-keygen -R 172.16.0.1 -``` - -Alternatively, setup an `ignore.conf` file per [Section 2.8.3](https://openchami.org/docs/tutorial/#283-logging-into-the-compute-node) of the tutorial, to prevent this issue. -{{< /callout >}} - -Test that you can submit a job from the **head node**. +Test that you can submit a job from the **head node**. Check that the node is present and idle: @@ -2322,5 +2133,3 @@ If something goes wrong and your compute node goes down, restart it with this co ```bash sudo scontrol update NodeName=de01 State=RESUME ``` - -