rc-slides/hpc1.qmd at main · ARCTraining/rc-slides · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
---
format:
  clean-revealjs:
    self-contained: true
    navigation-mode: linear
    controls-layout: bottom-right
    controls: false
    footer: "[Research IT Website]({{< var rc.website >}}) | [Research IT Query]({{< var rc.servicedesk >}}) | [Courses Material]({{< var rc.material >}})"
code: HPC 1
name: Introduction to High Performance Computing
---

{{< include _title.qmd >}}
{{< include _team.qmd >}}

## Purpose of HPC1

- Introducing Research Computing and the HPCs at Leeds
- Hands on with Linux and ARC4
- Running code
- Batch and Interactive jobs
- Data management
- Joys of parallel jobs
- Advanced job submissions

## Useful Links

- ARC Website: {{< var rc.website >}}
- ARC Documentation {{< var rc.documentation >}}
- General queries: {{< var rc.servicedesk >}}

## Training

![](assets/img/hpc0/training.png)

## ARC/HPC Training Portfolio

{{< var rc.training >}}

![](assets/img/hpc0/portfolio.png)

## Introductions and Motivations

- Who are you and why are you here?
- What problems are you encountering with your computational work now?
- Why / how do you think HPC will help?

## Key Concepts

- High Performance Computing (HPC)
- High Throughput Computing (HTC)
- "Supercomputing"

## Applications

![](assets/img/hpc1/spectrum.png)

## Terminology

- **Node**: the physical machine/server. In current systems, a node would typically include one or more processors, as well as memory and other hardware.

- **Processor**: the central processing unit (CPU) inside the node, which contains one or more  cores.

- **Core**: Refers to the basic computation unit of the CPU. This is the unit that carries out the actual computations.

## Leeds facilities

- ARC3 brought into service in 2017
- ARC4 brought into service in 2019

## A Supercomputer isn’t...

![](assets/img/hpc1/knob.png)

## Single computer vs grid of computers

![](assets/img/hpc1/hpc-diagram.png)

## Serial and parallel programs

- Serial programs run on a single CPU core, solving one problem at a time.
- Parallel programs run across multiple CPU cores, splitting the workload between them and solving the problem faster.

## Serial Program

![](assets/img/hpc1/serial.png)

## Parallel Program

![](assets/img/hpc1/parallel.png)

## Amdahl's Law

![](assets/img/hpc1/AmdahlsLaw.svg)

::: aside
Daniels220 at English Wikipedia, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons
:::

## Basic parallel machine

![](assets/img/hpc1/basic-parallel.png)

## Differences from Desktop computing?

- We don’t log on to compute nodes directly
  - submit jobs via a batch scheduling system
- Not a GUI-based environment
- System is shared with many other users
- Resources more tightly monitored and controlled
  - Memory
  - CPU usage (‘cores’)
  - Time

## Benefits of using HPC

- Speed
- Volume
- Cost
- Efficiency
- Convenience

## Parallel Paradigms

From a systems perspective:

- Shared memory parallelism
- Distributed memory parallelism

Unless you are writing your own codes, the software developer takes care of this.

## Basic HPC system layout

![](assets/img/hpc1/hpc-layout.png)

## Exercise 1.1

What do the following Linux commands do? How might they be used on the HPC service?

- `ls`
- `pwd`
- `mkdir`
- `cp`
- `wget`
- `rm`

## Exercise 1.2
On the HPC service, you have a ‘HOME’ directory of 10GB and can create a
directory on the /nobackup drive.

Using the man pages (or Google…) investigate how you could use the following
commands to manage your storage:

- `quota`
- `df`
- `du`

## Exercise 1.3
Linux systems include a number of file compression routines.

Find out which ones are available on the cluster and use them to create a
compressed archive of a directory and its contents.

## Exercise 1.4
How could you read a PDF file or an HTML document on the cluster?

## HPC at Leeds

ARC4 is the latest incarnation of central HPC at Leeds.

HPC currently comprised of two facilities called ARC3 and ARC4

All Faculties are shareholders and so it is important that all who can benefit
from the use of this facility do so.

## ARC3 {.smaller}

2 x login nodes: 24 cores and 128GB RAM

252 x standard compute nodes: 24 cores and 128GB RAM (=6048 cores); 100GB SSD

4 x High Memory (24 cores and 768GB nodes)

6 x P100 GPU nodes (24 core, 128GB and 4 x NVidia P100)

2 x K80 GPU nodes (24 core, 128GB and 2 x NVidia K80)

2 x Intel Xeon Phi (Knights Landing) vector processor nodes

836TB of high speed storage: /nobackup

## ARC4 {.smaller}

2 x login nodes: 40 cores and 192GB RAM

149 x standard compute nodes: 40 cores and 192GB RAM (=5960 cores); 100GB SSD

2 x High Memory (40 cores and 768GB nodes)

3 x V100 GPU nodes (40 core, 192GB and 4 x NVidia V100)

1.2PB of high speed storage: /nobackup

## Exercise 2

We're going to download some files to have a play with:

```{.bash code-line-numbers=false}
git clone https://github.com/ARCTraining/hpc1-files.git
```

## SGE

SGE is a sophisticated scheduler:

- Can define usage policies.
- Control maximum limits.
- Fair distribution of resources.
- Produces detailed usage accounting information.

## Queue tools

- `qsub`: submit a batch job
- `qrsh`: run an interactive job
- `qdel`: delete a job
- `qstat`: Details on queued or running jobs
- `qacct`: Details on previous completed jobs
- `qsched`: A hint as to when your job might run

## Submit some serial R jobs

We're going to submit a job from the `1_R` directory

## Scheduling notes{.smaller}

Our clusters adopt a “fair share” policy

- Jobs preferentially run based on current and previous usage from Faculty. Same applies when comparing users in same Faculty.
- The lower the usage, the higher the priority (and vice versa).
- “Backfilling” is used to fit smaller jobs in between the top priority jobs. All jobs have specified run time, and so the scheduler will run lower priority jobs if they will start and finish before the highest priority jobs are scheduled to start. Thus indicating a realistic runtime for a job, will make short jobs eligible to be backfilled, potentially shortening their wait-time.

## Submit a serial Python job

Can you now do the same for a Python job in the 2_Python directory?

To run this Python code you do not need any modules loaded and can run it with:

`python example1.py`

## Normal end of Part 1

Questions, recap

## Drives and Directories

HPC users have access to two storage areas:

- A HOME directory
- Space on /nobackup

## Home Directory

This is:

- Private to you
- Backed up
- Limited to 10GB storage (ARC)
- Shared between machines

## /nobackup

- Each HPC cluster has its own high speed storage service called /nobackup
- You need to make your own directory (using `mkdir`)
- Nothing is backed up
- Files will expire after 90 days not being used
- Need to set permissions to make files private on ARC3 but not on ARC4.

## Local storage

Each compute node has a small SSD.

- 1Gbyte allocated per job by default via `$TMPDIR`
- Typically much faster than other storage available
- Can be increased if required:

   ```{.bash code-line-numbers=false}
   #$ -l disk=10G
   ```
- Limits vary depending on node type, but at least 100G

[More on local storage](https://arcdocs.leeds.ac.uk/usage/scratch.html)

## Transferring files and data

- `scp` or `rsync` command line utility
- `wget` (to download from a remote server)
- `git` (version control)
- `smbclient` to copy from local M:/ and N:/ drives on campus
- Google Drive and OneDrive (via `rclone`)
- graphical programs like Cyberduck or Filezilla (or indeed MobaXterm)

## Module system

`module`

- `avail` - what software could I add
- `list` - show what is active
- `add|load` - enable software
- `rm|unload` - disable software
- `help` - show details of software
- `swap|switch` - swap modules

## Shared vs Distributed memory jobs

![](assets/img/hpc1/shared-vs-distributed.png)

## Submit some parallel jobs

Let's look at and compare a few submissions:

- serial
- threaded
- distributed

## GPUs

Three different types of GPU available

ARC3:

- 2xK80 (1 node)
- 4xP100 (6 nodes)

ARC4:

- 4xV100 (3 nodes)

Some extras in private queues

## Submitting a GPU job

ARC4:
```{.bash code-line-numbers=false}
#$ -l coproc_v100=1
```
ARC3:
```{.bash code-line-numbers=false}
#$ -l coproc_p100=1
```
or
```{.bash code-line-numbers=false}
#$ -l coproc_k80=1
```

Should not ask for memory or CPU cores

[More on GPGPU](https://arcdocs.leeds.ac.uk/usage/gpgpu.html)

## Large memory nodes

ARC4:
```{.bash code-line-numbers=false}
#$ -l node_type=40core-768G
```
ARC3:
```{.bash code-line-numbers=false}
#$ -l node_type=24core-768G
```

Also allows jobs to run for up to 96hrs

## Interactive jobs

General advice, don't use them unless you have to.

```{.bash code-line-numbers=false}
qrsh -l h_rt=0:15:0 -pty y bash -i
```

[More in interactive jobs](https://arcdocs.leeds.ac.uk/usage/interactive.html)

## Task arrays

When you want to run lots of similar jobs

```{.bash code-line-numbers=false}
# Run 100 jobs from 1-100
#$ -t 1-100
# Don't run more than two at a time
#$ -tc 2

if[ $SGE_TASK_ID == $SGE_TASK_FIRST ]; then
  echo I am the first job
fi

echo I am job $SGE_TASK_ID

if [ $SGE_TASK_ID == $SGE_TASK_LAST ]; then
  echo I am the last job
fi
```


[More on task arrays](https://arcdocs.leeds.ac.uk/usage/taskarrays.html)

## Restartable jobs

For when 48hrs isn't enough

At its simplest, just finish with a return code of 99 from the last line of your code, and the job will be rescheduled:

```{.bash code-line-numbers=false}
exit 99
```

[More on restartable jobs](https://arcdocs.leeds.ac.uk/usage/advanced.html#restartable-jobs)

## Questions and time for recap

- Something we've not covered that you'd like a look at
- Anything we have covered but you'd like to go over more
- A further look at [arcdocs](https://arcdocs.leeds.ac.uk)

{{< include _end.qmd >}}