This SLURM plugin simplifies the integration of jobs running in your SLURM cluster by matching them with Juice Agents running outside the cluster.
- SLURM user submits a job to the SLURM controller with pool and gpu count
- SLURM ctld uses the prolog script to request GPU resources from the Juice Controller.
- On success it stores the encrypted session details on the shared filesystem.
- On failure it requeues it to retry later
- SLURM ctld then schedules the job to a SLURM compute node which loads the encrypted Juice session details from disk and connects to the agent and runs the workload
- SLURM ctld releases the session with the epilog script
The plugin consists of two parts:
The plugin exposes Juice options to the slurm user and sets up required settings that are sent to the Prolog scripts for scheduling.
The Prolog script is responsible for assigning GPUs to the SLURM task. It takes the GPU requests, communicates with the Juice controller and reserves the GPUs for the task. It then exports a data file that the SLURM task uses to run the remote GPUs. If the required GPU resources could not be found, the job will be requeued and tried later.
The Epilog script releases the GPUs when the Job is complete.
- Download the Spank plugin and Slurm hooks from this repository.
- Use
maketo build the plugin with the version of Slurm you intend to use.
cd spank-plugin
make- Add the
juice.sofile to your slurm nodes and configure it in the slurmplugstack.confwith two arguments, the path to your juice binary and a directory for shared state. These paths needs to be accessible on all nodes in the SLURM cluster.
plugstack.conf:
juice.so /opt/juice/juice /opt/juice/.data
The Prolog script requires a Juice M2M Token to communicate with the Juice Controller. For more details on how to create this token see M2M Tokens.
Copy the 4 files in the slurm-hooks directory to your slurmctld node. Add your M2M Token to the the juice-config.sh file and then update your slurmctld to invoke the Prolog and Epliog scripts.
slurm.conf:
PrologSlurmctld=/hooks/juice-prolog.sh
EpilogSlurmctld=/hooks/juice-epilog.sh
To run a SLURM job using a remote GPU the Juice pool must be specified on the SLURM command-line when running sbatch or srun
Example:
srun --juice-pool=test hostname -ABy default Juice will assign a single GPU, to request additional GPUs add the --juice-gpu-count parameter
srun --juice-pool=test --juice-gpu-count=2 hostname -ANOTE: You cannot use GPUs from multiple agents, there must be a Juice Agent in the organization that has the required number of GPUs or more.
By default Juice will allocate the entire GPU, to allocate a portion of GPU resources specify the --juice-vram parameter
srun --juice-pool=test --juice-vram=1 hostname -AThis will allocate 1 GiB of VRAM memory and allow others to take additional slices of VRAM.
