- SLURM Multinode Docker And Singularity Container Orchestration
- Building Containers with ssh for multinode capability
- Running via Sigularity Containers
- Tensorflow multinode using Containers
- PyTorch multinode using Containers
- Advanced usage of
srun
This repo provides examples and scripts to orchestrate docker and singularity
containers that work across nodes. The docker options and session launcher
boiler code is done via orchestration helper script srun_docker.sh
and singularity is handled via srun_singularity.sh.
A typical interactive Slurm run looks like this:
salloc -N 2 -p <some_partition> # using two nodes
# docker where --privileged option is for RDMA support
srun srun_docker.sh \
--container=<your_container> \
--privileged \
--script=./<your_job_script>.sh
# singularity (does not have or need --privileged option)
srun srun_singularity.sh \
--container=<your_container> \
--script=./<your_job_script>.shProcedure
- Srun (sbatch also works) the
srun_docker.shorsrun_singularity.shscript on allocated nodes:- First node acts as the "master"/coordinator and "worker" node.
- Remaining nodes are worker nodes.
- Workers nodes
- Start container run sshd and sleep/wait within the container.
- The container is running and processes can be launched within the context of this container service. When the mpirun is launched in master node it will startup processes in these worker nodes.
- Master node
- Start container
- Run a loop trying to ssh to worker nodes to verify that those are running.
- Start sshd within the container and launch the job script within the container.
- Once the job script finishes, stop sshd, ssh to the workers and kill their sessions, and finally stop/rm launched containers.
There are numerous examples of the general idea of this approach such as:
https://github.com/uber/horovod/blob/master/docs/docker.md#running-on-multiple-machines
Most of these published examples run the docker containers as root. This makes it
difficult for working with linux systems where user permissions are relied on
for access to data and code. The srun_docker.sh script with the configuration
described below orchestrates the docker containers to run with the users id and
privileges.
In regards to singularity typical multinode and particularly MPI usage model
with singularity is to call mpirun from outside the container:
https://www.sylabs.io/guides/2.6/user-guide/faq.html#why-do-we-call-mpirun-from-outside-the-container-rather-than-inside
But it is possible to setup mpirun from within the container as well. The
srun_singularity.sh script does the "within" approach. Main benefit is that
one does not have to install/setup MPI outside of the container which can be
burdensome and complicates interoperability with MPI library within the container.
Both docker and singularity approaches in such a manner require setting up user
sshd and ssh configs as described below. The run_dock_asuser setup is only
required for docker since singularity natively runs as user simplifying the
setup compared to docker in this regard.
Setup an mpisshconfig directory to enable generic sshd user configurations
for containers. A convenience script create_mpisshconfig.sh
is provided to do this. Just run the script and it will generate an mpisshconfig
directory that can be copied somewhere within the user's home directory.
The script srun_docker.sh has an option --sshconfigdir that can be set
to match this location (or modify the srun_docker.sh with a desired default
value).
It is important to have correct permissions on the config files:
$ tree -p ~/mpisshconfig/
$HOME/mpisshconfig/
├── [-rw-r--r--] moduli
├── [-rw-r--r--] ssh_config
├── [-rw-r--r--] sshd_config
├── [-rw-------] ssh_host_dsa_key
├── [-rw-r--r--] ssh_host_dsa_key.pub
├── [-rw-------] ssh_host_ecdsa_key
├── [-rw-r--r--] ssh_host_ecdsa_key.pub
├── [-rw-------] ssh_host_ed25519_key
├── [-rw-r--r--] ssh_host_ed25519_key.pub
├── [-rw-------] ssh_host_rsa_key
├── [-rw-r--r--] ssh_host_rsa_key.pub
# a bunch of other filesThese files were just copied from a container's /etc/ssh with ssh installed.
What matters most are the settings in sshd_config. The key files can be
regenerated.
The HostKey paths need to correspond to your home directory. The port needs to
be set to match ~/.ssh/config (more on port settings below), PermitRootLogin
should be set to yes, StrictModes set to no, and UsePAM set to no.
# CHANGE THIS IN sshd_config from avolkov to your username
HostKey /home/avolkov/mpisshconfig/ssh_host_rsa_key
HostKey /home/avolkov/mpisshconfig/ssh_host_dsa_key
HostKey /home/avolkov/mpisshconfig/ssh_host_ecdsa_key
HostKey /home/avolkov/mpisshconfig/ssh_host_ed25519_keyAgain, the create_mpisshconfig.sh script sets
up the sshd_config file with the above modifications. Refer to it for
reference and customizations.
Under the hood the srun_docker.sh launches sshd within containers as:
/usr/sbin/sshd -p $sshdport -f ${sshconfigdir}/sshd_configThe multinode containers orchestration relies on ssh communication between
them. One way to setup a generic user oriented ssh authentication is via ssh
config. Suppose (use sinfo on SLURM to view the partition and node names) the
compute nodes on partition dgx-1v are called dgx[01-04], and another
partition hsw_v100 has nodes hsw[01-20], one can set default user ssh configs
in the ~/.ssh/config file:
# FILE: ~/.ssh/config
# Generate ~/.ssh/id_rsa_mpi via:
# ssh-keygen -f ${HOME}/.ssh/id_rsa_mpi -t rsa -b 4096 -C "your_email@example.com"
Host dgx* hsw*
Port 12345
PubKeyAuthentication yes
StrictHostKeyChecking no
# UserKnownHostsFile /dev/null
UserKnownHostsFile ~/.ssh/known_hosts
IdentityFile ~/.ssh/id_rsa_mpiThe user can change the port to anything they desire instead of 12345.
Assuming that ssh only works for allocated nodes, the port value should just
be set to something that does not conflict with the running applications. One
should not have to worry about conflicting ports with other users unless the
nodes are setup in non-exclusive mode on SLURM. In the non-exclusive setups
users would need to somehow coordinate or insure their ports do not conflict.
The script srun_docker.sh has an option --sshdport that can be set to match
this port (or modify the srun_docker.sh with a desired default value). This
port value should also be set in the sshd_config file as described previously.
Running containers as user requires setting a few additional options when launching a docker container. For example:
USEROPTS="-u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME"
getent group > group
getent passwd > passwd
USERGROUPOPTS="-v $PWD/passwd:/etc/passwd:ro -v $PWD/group:/etc/group:ro"
docker run --rm -it $USEROPTS $USERGROUPOPTS <otheropts> <somecontainer-and-cmds> Once inside the launched container as above, the user will appear with their id
instead of typical root user. The above settings are somewhat verbose therefore
a wrapper script for launching a container as user is used. The
srun_docker.sh uses run_dock_asuser.sh to launch the docker service
sessions. Please download the script run_dock_asuser.sh from this
location:
https://github.com/avolkov1/helper_scripts_for_containers/blob/master/run_dock_asuser.sh
Place the script somewhere on the PATH. Recommended location: ~/bin/run_dock_asuser.sh
The $HOME/bin directory is typically added to the user's PATH in
~/.bash_profile. If not then modify either ~/.bash_profile or ~/.bashrc to
append $HOME/bin to PATH. PATH=$PATH:$HOME/bin
The srun_docker.sh script also should be installed
somewhere on the PATH. The $HOME/bin directory is a good location for the
srun_docker.sh script as well.
The ssh approach is used in these demos to enable containers to communicate across node boundaries within a cluster. This requires that containers have ssh installed. The typical Dockerfile commands to do this are:
# some Dockerfile
# FROM ...
# setup your application/framework/library
FROM ubuntu:16.04
# Install OpenSSH for MPI to communicate between containers
RUN apt-get update && apt-get install -y --no-install-recommends \
openssh-client openssh-server && \
mkdir -p /var/run/sshd && \
rm -rf /var/lib/apt/lists/*
# Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config
# ref: https://docs.docker.com/engine/examples/running_ssh_service/#build-an-eg_sshd-image
RUN sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
# SSH login fix. Otherwise user is kicked off after login
RUN sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshdFurther references for multinode containers and setup:
-
Dockerize SSH Service - https://docs.docker.com/engine/examples/running_ssh_service/
-
Mellanox Docker Support - https://community.mellanox.com/docs/DOC-2971
Refer to the dockerfile example with install command:${MOFED_DIR}/mlnxofedinstall --user-space-only --without-fw-update --all -q
Documentation about singularity can be found here:
https://www.sylabs.io/docs/
The examples below where docker containers are used, these same containers can
be converted via utility docker2singularity. Refer to docker2singularity
instructions here:
https://hub.docker.com/r/singularityware/docker2singularity/
Example for conversion:
# Convert Tensorflow container
docker pull nvcr.io/nvidian/sae/avolkov:tf1.8.0py3_cuda9.0_cudnn7_nccl2.2.13_hvd_ompi3_ibverbs
docker run -v /var/run/docker.sock:/var/run/docker.sock \
-v /cm/shared/singularity:/output \
--privileged -t --rm \
singularityware/docker2singularity:v2.6 \
nvcr.io/nvidian/sae/avolkov:tf1.8.0py3_cuda9.0_cudnn7_nccl2.2.13_hvd_ompi3_ibverbs
# Convert PyTorch container
docker pull nvcr.io/nvidian/sae/avolkov:pytorch_hvd_apex
docker run -v /var/run/docker.sock:/var/run/docker.sock \
-v /cm/shared/singularity:/output \
--privileged -t --rm \
singularityware/docker2singularity:v2.6 nvcr.io/nvidian/sae/avolkov:pytorch_hvd_apexIt is also possible to setup and use a singularity registry or just place the singularity images on some shared filesystem.
There are a variety of examples on the internet for setting up multinode docker
containerized Tensorflow workloads. Uber posted instructions for Horovod:
https://github.com/uber/horovod/blob/master/docs/docker.md#running-on-multiple-machines
The examples below demonstrate how to do this on a SLURM cluster. A variety of
Dockerfiles with Tensorflow that also install ssh and can be used for these
demos are posted here:
https://github.com/avolkov1/shared_dockerfiles/tree/master/tensorflow
The dockerfile Dockerfile.tf1.8.0py3_cuda9.0_cudnn7_nccl2.2.13_hvd_ompi3_ibverbs is used for the example below. Typically when working with docker containers in a cluster environment, one needs to build and push the container to a docker registry from where the compute nodes will have access to the registry. The example below pushes the container to a private registry space on NGC (Nvidia GPU Cloud registry):
export TAG=tf1.8.0py3_cuda9.0_cudnn7_nccl2.2.13_hvd_ompi3_ibverbs
docker build \
-t nvcr.io/nvidian/sae/avolkov:$TAG \
-f Dockerfile.${TAG} \
$(pwd)
docker push nvcr.io/nvidian/sae/avolkov:${TAG}A sample work script hvd_mnist_example.sh
has been provided. Here is an example for running on a SLURM cluster:
salloc -N 2 -p dgx-1v # 2 nodes. Change -N to however many nodes you would like.
# use [--<remain_args>] for passing additional parameters to script
srun srun_docker.sh \
--container=nvcr.io/nvidian/sae/avolkov:tf1.8.0py3_cuda9.0_cudnn7_nccl2.2.13_hvd_ompi3_ibverbs \
--privileged \
--script=./tensorflow_mnode/hvd_mnist_example.sh
# singularity very similar. The docker container converted to singularity via
# docker2singularity and stored in a cluster shared directory /cm/shared/singularity/
srun srun_singularity.sh \
--container=/cm/shared/singularity/tf1.8.0py3.simg \
--script=./tensorflow_mnode/hvd_mnist_example.shThe script uses the Horovod tensorflow_mnist.py example that can be found here:
https://github.com/uber/horovod/blob/master/examples/tensorflow_mnist.py
The repo example tensorflow_mnist.py
has been slightly modified with a barrier to avoid downloading mnist dataset
redundantly.
Note the options --container and --script. These are required.
For additional instructions and help refer to:
srun_docker.sh --helpSome other options to note are:
--slots_per_node - When formulating the hostlist array specify slots per
node. Typically with multigpu jobs 1 slot per GPU so then slots per
node is number of GPUs per node. This is the default that is
automatically set. If undersubscribing or oversubscribing GPUs or doing
model parallelism, or for any other reason specify slots_per_node
as needed.
--datamnts - Data directory(s) to mount into the container. Comma separated.
Ex: "--datamnts=/datasets,/scratch" would map "/datasets:/datasets"
and "/scratch:/scratch" in the container.
--dockopts - Additional docker options not covered above. These are passed
to the docker service session. Use quotes to keep the additional
options together. Example:
--dockopts="--ipc=host -e MYVAR=SOMEVALUE -v /datasets:/data"
The "--ipc=host" can be used for MPS with nvidia-docker2. Any
additional docker option that is not exposed above can be set through
this option. In the example the "/datasets" is mapped to "/data" in the
container instead of using "--datamnts".
--privileged - This option is typically necessary for RDMA. Refer to
run_dock_asuser --help for more information about this option. With
some containers seems to cause network issues so disabled by default.
Nvidia docker ignores NV_GPU and NVIDIA_VISIBLE_DEVICES when run with
privileged option. Use CUDA_VISIBLE_DEVICES to mask CUDA processes.
--<remain_args> - Additional args to pass through to scripts. These must
not conflict wth args for this launcher script i.e. don't use sshdport
for script.
--script_help - Pass --help to script.One can change which code to run, which container to use, what
directories/volumes to mount (home path is automatically mounted), how many
slots (slots are typically mapped to GPUs) to use, etc. In the script when orchestrating mpirun use the
injected environment variables hostlist and np for convenience.
mpirun -H $hostlist -np $np \
# etc.Running Horovod code with PyTorch is very similar to running with Tensorflow. The dockerfile Dockerfile.pytorch_hvd_apex is used for the example below. Again, similar to Tensorflow case the container should be built and pushed to a registry accessible by compute nodes.
export TAG=pytorch_hvd_apex
docker build \
-t nvcr.io/nvidian/sae/avolkov:$TAG \
-f Dockerfile.${TAG} \
$(pwd)
docker push nvcr.io/nvidian/sae/avolkov:${TAG}A sample work script pytorch_hvd_mnist_example.sh
has been provided. Example for running on a SLURM cluster:
salloc -N 2 -p dgx-1v # 2 nodes. Change -N to however many nodes you would like.
srun srun_docker.sh \
--container=nvcr.io/nvidian/sae/avolkov:pytorch_hvd_apex \
--privileged \
--script=./pytorch_mnode/pytorch_hvd_mnist_example.sh
# using singularity
srun srun_singularity.sh \
--container=/cm/shared/singularity/pytorch_hvd_apex.simg \
--script=./pytorch_mnode/pytorch_hvd_mnist_example.shThe script uses the Horovod pytorch_hvd_mnist.py code based on
pytorch_mnist.py example that can be found here:
https://github.com/uber/horovod/blob/master/examples/pytorch_mnist.py
The repo example pytorch_hvd_mnist.py
has been slightly modified with a barrier so as to not redundantly download data.
Two additional examples are provided:
-
pytorch_apex_mnist_example.sh,pytorch_apex_mnist.py-The reference code is
main.pytaken from examples here:
https://github.com/NVIDIA/apexThe
pytorch_apex_mnist.pyhas been modified with a barrier to avoid the race condition on downloading mnist datasets. -
pytorch_dist_mnist_example.sh,pytorch_dist_mnist.py-Same as the apex example above, but modified to use PyTorch distributed framework (nccl backend) without Apex for comparison.
These examples demonstrate usage of mpirun to run non-MPI code. One could
have used pdsh instead to the same effect.
The idea is to illustrate that srun_docker.sh and srun_singularity.sh are
dynamic and versatile wrappers for enabling one to run multinode containers in
various scenarios on SLURM. Refer to the pdsh example for apex (which is not MPI based)
pytorch_apex_mnist_example_pdsh.sh.
Above were examples of orchestrating srun_docker.sh script via srun. Here is
a list of sometimes useful srun launch commands.
salloc -N 2 -p dgx-1v # 2 nodes. Change -N to however many nodes you would like.
# run on just one node even though 2 nodes are allocated
srun -N 1 srun_docker.sh <additional parameters and options>
# assume nodes dgx01 and dgx02 are allocated. Run on dgx02
srun -N 1 --exclude=dgx01 srun_docker.sh --nodelist=dgx02 <additional parameters and options>
# Using a subset of GPUs without privileged option
NV_GPU=2,3 srun srun_docker.sh \
--slots_per_node=2 \
--container=nvcr.io/nvidian/sae/avolkov:pytorch_hvd_apex \
--script=./pytorch_mnode/pytorch_apex_mnist_example_pdsh.sh --epochs=5
# Using a subset of GPUs with privileged option
# Tensorflow MPI Horovod approach
CUDA_VISIBLE_DEVICES=2,3 srun srun_docker.sh \
--slots_per_node=2 --envlist=CUDA_VISIBLE_DEVICES \
--privileged \
--container=nvcr.io/nvidian/sae/avolkov:tf1.8.0py3_cuda9.0_cudnn7_nccl2.2.13_hvd_ompi3_ibverbs \
--script=./tensorflow_mnode/hvd_mnist_example.sh
# PyTorch Non-MPI approach
CUDA_VISIBLE_DEVICES=2,3 srun srun_docker.sh \
--slots_per_node=2 --envlist=CUDA_VISIBLE_DEVICES \
--privileged \
--container=nvcr.io/nvidian/sae/avolkov:pytorch_hvd_apex \
--script=./pytorch_mnode/pytorch_apex_mnist_example_pdsh.sh --epochs=5The above srun variations would work with singularity as well, just use the
srun_singularity.sh script, specify a singularity image/container, and ommit
the --privileged option (it is not needed). The NV_GPU environment var is
only for nvidia-docker. If using a resource manager for GPU reservations,
such as gres under SLURM, singularity will adhere to the resources reserved,
which is not guaranteed for docker hence the NV_GPU usage in the
srun_docker.sh as a workaround.
These examples are straightforward to convert to sbatch scripts. Example:
sbatch -N 2 -p dgx-1v \
--output="slurm-%j-pytorch_hvd_mnist.out" \
--wrap=./pytorch_mnode/sbatch_pytorch.sh
sbatch -N 2 -p dgx-1v \
--output="slurm-%j-pytorch_hvd_mnist.out" \
--wrap=./pytorch_mnode/sbatch_pytorch_singularity.shRefer to the scripts sbatch_pytorch.sh and
sbatch_pytorch_singularity.sh
for sbatch example details.