Slurm

This SLURM plugin simplifies the integration of jobs running in your SLURM cluster by matching them with Juice Agents running outside the cluster.

Architecture

How SLURM jobs request GPU resources

SLURM user submits a job to the SLURM controller with pool and gpu count
SLURM ctld uses the prolog script to request GPU resources from the Juice Controller.
1. On success it stores the encrypted session details on the shared filesystem.
2. On failure it requeues it to retry later
SLURM ctld then schedules the job to a SLURM compute node which loads the encrypted Juice session details from disk and connects to the agent and runs the workload
SLURM ctld releases the session with the epilog script

Implementation

The plugin consists of two parts:

The Spank Plugin

The plugin exposes Juice options to the slurm user and sets up required settings that are sent to the Prolog scripts for scheduling.

The Prolog and Epilog Scripts

The Prolog script is responsible for assigning GPUs to the SLURM task. It takes the GPU requests, communicates with the Juice controller and reserves the GPUs for the task. It then exports a data file that the SLURM task uses to run the remote GPUs. If the required GPU resources could not be found, the job will be requeued and tried later.

The Epilog script releases the GPUs when the Job is complete.

Set up

Download and Compile

Download the Spank plugin and Slurm hooks from this repository.
Use make to build the plugin with the version of Slurm you intend to use.

cd spank-plugin
make

Configure the plugin - all nodes

Add the juice.so file to your slurm nodes and configure it in the slurm plugstack.conf with two arguments, the path to your juice binary and a directory for shared state. These paths needs to be accessible on all nodes in the SLURM cluster.

plugstack.conf:

juice.so /opt/juice/juice /opt/juice/.data

Configure the Prolog and Epilog Scripts - slurmctld

The Prolog script requires a Juice M2M Token to communicate with the Juice Controller. For more details on how to create this token see M2M Tokens.

Copy the 4 files in the slurm-hooks directory to your slurmctld node. Add your M2M Token to the the juice-config.sh file and then update your slurmctld to invoke the Prolog and Epliog scripts.

slurm.conf:

PrologSlurmctld=/hooks/juice-prolog.sh
EpilogSlurmctld=/hooks/juice-epilog.sh

Usage

To run a SLURM job using a remote GPU the Juice pool must be specified on the SLURM command-line when running sbatch or srun

Example:

srun --juice-pool=test hostname -A

By default Juice will assign a single GPU, to request additional GPUs add the --juice-gpu-count parameter

srun --juice-pool=test --juice-gpu-count=2 hostname -A

NOTE: You cannot use GPUs from multiple agents, there must be a Juice Agent in the organization that has the required number of GPUs or more.

GPU Slicing

By default Juice will allocate the entire GPU, to allocate a portion of GPU resources specify the --juice-vram parameter

srun --juice-pool=test --juice-vram=1 hostname -A

This will allocate 1 GiB of VRAM memory and allow others to take additional slices of VRAM.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dev-cluster		dev-cluster
slurm-hooks		slurm-hooks
spank-plugin		spank-plugin
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Slurm

Architecture

How SLURM jobs request GPU resources

Implementation

The Spank Plugin

The Prolog and Epilog Scripts

Set up

Download and Compile

Configure the plugin - all nodes

Configure the Prolog and Epilog Scripts - slurmctld

Usage

GPU Slicing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Slurm

Architecture

How SLURM jobs request GPU resources

Implementation

The Spank Plugin

The Prolog and Epilog Scripts

Set up

Download and Compile

Configure the plugin - all nodes

Configure the Prolog and Epilog Scripts - slurmctld

Usage

GPU Slicing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages