hpc1/scheduling-submission.qmd at main · ARCTraining/hpc1 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
---
title: "Session 5: Job Scheduling and Submission"
subtitle: "Running Jobs on the Slurm Scheduler"
format: html
---

# Session content

## Session aims

By the end of this session, you will be able to:

- Understand what a job scheduler is and why it's essential for HPC systems
- Write and submit batch job scripts using Slurm
- Monitor, manage, and cancel running jobs effectively
- Request different types of compute resources (CPU, memory, GPUs, time)
- Use job arrays to efficiently run multiple similar tasks
- Apply best practices for job submission and resource allocation

[**View Interactive Slides: Job Scheduling and Submission**](scheduling-submission-slides.qmd){.btn .btn-primary target="_blank"}

## Background: What is a Job Scheduler?

High Performance Computing (HPC) systems are shared by many users, each submitting their own jobs — code they want to run using the cluster's compute power.

A **job scheduler** is the system that:

- Organizes when and where jobs run
- Allocates the requested resources (CPU cores, memory, GPUs)
- Ensures fair access to shared resources for all users

### Why Do We Need a Scheduler?

::: {.columns}
::: {.column width="50%"}
**Without a scheduler:**

- Manual coordination between users
- Resource conflicts
- Inefficient resource usage
- Chaos with thousands of CPUs
:::

::: {.column width="50%"}
**With a scheduler:**

- Fair resource allocation
- Automatic job management
- Efficient resource utilization
- Priority-based execution
:::
:::

## Slurm: The Scheduler on Aire

At Leeds, the **Slurm** scheduler (Simple Linux Utility for Resource Management) manages all jobs on the Aire cluster.

### Job Submission Workflow

1. **Write** a job script describing your requirements
2. **Submit** the job to Slurm with `sbatch`
3. **Queue** - Slurm places your job in a queue
4. **Execute** - When resources are available, Slurm starts your job
5. **Complete** - Job finishes and resources are freed

```{mermaid}
flowchart LR
    A[Write Job Script] --> B[Submit with sbatch]
    B --> C[Job in Queue]
    C --> D[Resources Available?]
    D -->|Yes| E[Job Runs]
    D -->|No| C
    E --> F[Job Completes]
```

## Basic Slurm Commands

### Essential Commands

| Command | Purpose | Example |
|---------|---------|---------|
| `sbatch` | Submit a job | `sbatch myjob.sh` |
| `squeue` | View job queue | `squeue -u $USER` |
| `scancel` | Cancel a job | `scancel 12345` |
| `sinfo` | View node information | `sinfo` |
| `sacct` | View job accounting | `sacct -j 12345` |

### Checking the Queue

```bash
squeue                    # Show all jobs
squeue -u $USER          # Show only your jobs
squeue -u $USER --long   # Detailed view of your jobs
```

Example output:
```
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
12345      test myjob.sh  user123  R    0:05:23      1 node001
12346      test analyze  user123 PD       0:00      2 (Resources)
```

**Job States:**
- `R` = Running
- `PD` = Pending (waiting for resources)
- `CG` = Completing
- `CD` = Completed

## Writing Job Scripts

A job script is a shell script with special Slurm directives that tell the scheduler what resources you need.

### Basic Job Script Template

```bash
#!/bin/bash
#SBATCH --job-name=myjob          # Job name
#SBATCH --partition=test          # Partition to use
#SBATCH --time=01:00:00           # Time limit (1 hour)
#SBATCH --nodes=1                 # Number of nodes
#SBATCH --ntasks=1                # Number of tasks
#SBATCH --cpus-per-task=4         # CPUs per task
#SBATCH --mem=8G                  # Memory per node
#SBATCH --output=myjob_%j.out     # Output file (%j = job ID)
#SBATCH --error=myjob_%j.err      # Error file

# Load required modules
module load python/3.13.0

# Change to working directory
cd $SLURM_SUBMIT_DIR

# Run your program
echo "Job started at $(date)"
echo "Running on node: $(hostname)"
echo "Job ID: $SLURM_JOB_ID"

python my_script.py

echo "Job finished at $(date)"
```

### Key SBATCH Directives

| Directive | Purpose | Example |
|-----------|---------|---------|
| `--job-name` | Name for your job | `--job-name=analysis` |
| `--partition` | Queue to use | `--partition=standard` |
| `--time` | Maximum runtime | `--time=02:30:00` |
| `--nodes` | Number of nodes | `--nodes=2` |
| `--ntasks` | Number of tasks | `--ntasks=8` |
| `--cpus-per-task` | CPUs per task | `--cpus-per-task=4` |
| `--mem` | Memory per node | `--mem=16G` |
| `--output` | Output file | `--output=job_%j.out` |

## Submitting and Managing Jobs

### Submit a Job

```bash
sbatch myjob.sh
```

Output: `Submitted batch job 12345`

### Monitor Your Jobs

```bash
# Check if your job is running
squeue -u $USER

# Get detailed job information
scontrol show job 12345

# Check job history
sacct -j 12345
```

### Cancel Jobs

```bash
# Cancel specific job
scancel 12345

# Cancel all your jobs
scancel -u $USER

# Cancel jobs by name
scancel --name=myjob
```

## Resource Requests

### CPU and Memory

```bash
# Single core job
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G

# Multi-core job (8 cores, shared memory)
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G

# Multi-task job (MPI)
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2G
```

### Time Limits

```bash
# Format: DD-HH:MM:SS or HH:MM:SS
#SBATCH --time=30:00        # 30 minutes
#SBATCH --time=2:00:00      # 2 hours
#SBATCH --time=1-12:00:00   # 1 day, 12 hours
```

### GPU Resources

```bash
# Request 1 GPU
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1

# Request specific GPU type
#SBATCH --partition=gpu
#SBATCH --gres=gpu:v100:2
```

### High Memory Jobs

```bash
# Request high memory node
#SBATCH --partition=himem
#SBATCH --mem=500G
```

## Common Job Patterns

### Serial Job (Single Core)

```bash
#!/bin/bash
#SBATCH --job-name=serial_job
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G

module load python/3.13.0
python serial_script.py
```

### Parallel Job (Shared Memory)

```bash
#!/bin/bash
#SBATCH --job-name=parallel_job
#SBATCH --time=02:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G

module load gcc/14.2.0
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

./my_openmp_program
```

### MPI Job (Distributed Memory)

```bash
#!/bin/bash
#SBATCH --job-name=mpi_job
#SBATCH --time=04:00:00
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G

module load openmpi/4.1.4
mpirun ./my_mpi_program
```

## Job Arrays

Use job arrays to submit many similar jobs efficiently:

```bash
#!/bin/bash
#SBATCH --job-name=array_job
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --array=1-10          # Submit jobs 1 through 10
#SBATCH --output=job_%A_%a.out # %A = array job ID, %a = task ID

# Process different input files
INPUT_FILE="input_${SLURM_ARRAY_TASK_ID}.txt"
OUTPUT_FILE="output_${SLURM_ARRAY_TASK_ID}.txt"

module load python/3.13.0
python process_file.py $INPUT_FILE $OUTPUT_FILE
```

## Best Practices

### Resource Estimation

::: {.callout-tip}
## Right-Sizing Your Jobs

- **Start small**: Test with minimal resources first
- **Monitor usage**: Use `sacct` to check actual resource usage
- **Don't over-request**: Only ask for what you need
- **Time limits**: Be realistic but add some buffer time
:::

### File Management

```bash
# Use scratch space for temporary files
#SBATCH --chdir=$SCRATCH/myproject

# Copy results back to home
cd $SLURM_SUBMIT_DIR
cp $SCRATCH/myproject/results.txt ./
```

### Error Handling

```bash
# Add error checking to your scripts
set -e  # Exit on any error

# Check if files exist before processing
if [ ! -f "input.txt" ]; then
    echo "Error: input.txt not found"
    exit 1
fi
```

## Troubleshooting Common Issues

### Job Won't Start

**Symptoms**: Job stays in PD (pending) state

**Common causes**:
- Requesting too many resources
- Wrong partition name
- Resource limits exceeded
- System maintenance

**Solutions**:
- Check `squeue -u $USER` for reason codes
- Use `sinfo` to see available resources
- Reduce resource requests if appropriate

### Job Fails Immediately

**Symptoms**: Job completes quickly with non-zero exit code

**Common causes**:
- Module not loaded
- Input files missing
- Insufficient memory
- Wrong file paths

**Solutions**:
- Check error files (`*.err`)
- Test script interactively first
- Verify all paths and dependencies

### Out of Memory Errors

**Symptoms**: Job killed due to memory usage

**Solutions**:
- Increase `--mem` or `--mem-per-cpu`
- Use memory profiling tools
- Consider algorithm optimizations

# Exercise

## Submit Your First Job

Create and submit a simple job script:

1. **Create the job script** (`first_job.sh`):
```bash
#!/bin/bash
#SBATCH --job-name=first_job
#SBATCH --time=00:05:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH --output=first_job_%j.out

echo "Hello from $(hostname)!"
echo "Job ID: $SLURM_JOB_ID"
echo "Current time: $(date)"
sleep 60
echo "Job completed!"
```

2. **Submit the job**:
```bash
sbatch first_job.sh
```

3. **Monitor the job**:
```bash
squeue -u $USER
```

4. **Check the output** when it completes:
```bash
cat first_job_*.out
```

## Python Hello World Job

Now let's create a more practical job that loads a module and runs Python code:

1. **Create a Python script** (`hello_world.py`):

```bash
cat > hello_world.py << 'EOF'
#!/usr/bin/env python3
import sys
import datetime

print("Hello from Python on HPC!")
print(f"Python version: {sys.version}")
print(f"Script running at: {datetime.datetime.now()}")
print(f"Python executable: {sys.executable}")

# Do some simple computation
numbers = [1, 2, 3, 4, 5]
squared = [x**2 for x in numbers]
print(f"Original numbers: {numbers}")
print(f"Squared numbers: {squared}")
print(f"Sum of squares: {sum(squared)}")

print("Python job completed successfully!")
EOF
```

2. **Create the job script** (`python_job.sh`):

You can check what versions of Python are available:

```bash
module avail python
```

:::{.callout-tip}

Usually, you will want to load the Miniforge module and activate a Conda environment instead of using the base system Python. See our documentation on [dependency management](https://arcdocs.leeds.ac.uk/aire/usage/dependency_management.html)

:::

```bash
cat > python_job.sh << 'EOF'
#!/bin/bash
#SBATCH --job-name=python_hello
#SBATCH --time=00:02:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH --output=python_job_%j.out
#SBATCH --error=python_job_%j.err

# Load the Python module
echo "Loading Python module..."
module load python/3.13.0

# Show which Python we're using
echo "Using Python: $(which python)"
echo "Python version: $(python --version)"

# Run our Python script
echo "Running Python script..."
python hello_world.py

echo "Job completed at $(date)"
EOF
```

3. **Submit the job**:
```bash
sbatch python_job.sh
```

4. **Monitor the job**:
```bash
squeue -u $USER
```

5. **Check the output** when it completes:
```bash
cat python_job_*.out
```

You should see output showing:
- The Python module being loaded
- Python version information
- The hello world message and computation results
- Confirmation that the job completed successfully

::: {.callout-tip}
## What This Exercise Demonstrates

- **Module loading**: How to load software in job scripts
- **Python execution**: Running Python code on compute nodes
- **Job monitoring**: Using output files to verify successful execution
- **Resource specification**: Appropriate resource requests for simple scripts
:::

---

# Summary

::: {.callout-note}
## Key Takeaways

- **Slurm manages all jobs** on Aire through a fair scheduling system
- **Job scripts define requirements** using `#SBATCH` directives
- **Right-size your requests** - don't over-request resources
- **Monitor your jobs** with `squeue` and `sacct`
- **Use appropriate partitions** for different types of work
- **Test interactively first** before submitting large jobs
:::

---

## Next Steps

Now you can submit and manage jobs on Aire! Let's move on to [Session 6: Best Practices and Troubleshooting](best-practices-troubleshooting.qmd) to learn how to optimize your HPC workflows.

## Additional Resources

- [Aire Job Submission Guide](https://arcdocs.leeds.ac.uk/aire/usage/jobsubmission/start.html)
- [Aire job types documentation](https://arcdocs.leeds.ac.uk/aire/usage/job_type.html#job-types)
- [Slurm Documentation](https://slurm.schedmd.com/)
- [Resource Planning Guide](https://arcdocs.leeds.ac.uk/aire/usage/jobsubmission/resource_planning.html)