Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 11 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
We supply a small microbenchmarking script for PyTorch training on ROCm.

To execute:
`python micro_benchmarking_pytorch.py --network <network name> [--batch-size <batch size> ] [--iterations <number of iterations>] [--fp16 <0 or 1> ] [--dataparallel|--distributed_dataparallel] [--device_ids <comma separated list (no spaces) of GPU indices (0-indexed) to run dataparallel/distributed_dataparallel api on>]`

`./micro_benchmarking_pytorch.py --network <network name> [--batch-size <batch size> ] [--iterations <number of iterations>] [--fp16 <0 or 1> ] [--dataparallel|--distributed_dataparallel] [--device_ids <comma separated list (no spaces) of GPU indices (0-indexed) to run dataparallel/distributed_dataparallel api on>]`

Possible network names are: `alexnet`, `densenet121`, `inception_v3`, `resnet50`, `resnet101`, `SqueezeNet`, `vgg16` etc.

Expand All @@ -14,17 +15,18 @@ For mGPU runs, `--distributed_dataparallel` with 1 GPU per process is recommende
Eg.
for a 1-GPU resnet50 run:
```
python3 micro_benchmarking_pytorch.py --network resnet50
./micro_benchmarking_pytorch.py --network resnet50
```
for a 2-GPU run on a single node:
```
python3 micro_benchmarking_pytorch.py --device_ids=0 --network resnet50 --distributed_dataparallel --rank 0 --world-size 2 --dist-backend nccl --dist-url tcp://127.0.0.1:4332 &
python3 micro_benchmarking_pytorch.py --device_ids=1 --network resnet50 --distributed_dataparallel --rank 1 --world-size 2 --dist-backend nccl --dist-url tcp://127.0.0.1:4332 &
./micro_benchmarking_pytorch.py --device_ids=0 --network resnet50 --distributed_dataparallel --rank 0 --world-size 2 --dist-backend nccl --dist-url tcp://127.0.0.1:4332 &
./micro_benchmarking_pytorch.py --device_ids=1 --network resnet50 --distributed_dataparallel --rank 1 --world-size 2 --dist-backend nccl --dist-url tcp://127.0.0.1:4332 &
```
Specify any available port in the `dist-url`.

To run FlopsProfiler (with deepspeed.profiling.flops_profiler imported):
`python micro_benchmarking_pytorch.py --network resnet50 --amp-opt-level=2 --batch-size=256 --iterations=20 --flops-prof-step 10`

`./micro_benchmarking_pytorch.py --network resnet50 --amp-opt-level=2 --batch-size=256 --iterations=20 --flops-prof-step 10`

## Performance tuning
If performance on a specific card and/or model is found to be lacking, typically some gains can be made by tuning MIOpen. For this, `export MIOPEN_FIND_ENFORCE=3` prior to running the model. This will take some time if untuned configurations are encountered and write to a local performance database. More information on this can be found in the [MIOpen documentation](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/perfdatabase.html).
Expand Down Expand Up @@ -53,14 +55,15 @@ Added the `--compile` option opens up PyTorch 2.0 capabilities, which comes with
With the required `--compile` option, these additional options are now available from the command line with the `--compileContext` flag. Here are a few examples:

```bash
python micro_benchmarking_pytorch.py --network resnet50 --compile # default run
./micro_benchmarking_pytorch.py --network resnet50 --compile # default run
```

```bash
python micro_benchmarking_pytorch.py --network resnet50 --compile --compileContext "{'mode': 'max-autotune', 'fullgraph': 'True'}"
./micro_benchmarking_pytorch.py --network resnet50 --compile --compileContext "{'mode': 'max-autotune', 'fullgraph': 'True'}"
```

```bash
python micro_benchmarking_pytorch.py --network resnet50 --compile --compileContext "{'options': {'static-memory': 'True', 'matmul-padding': 'True'}}"
./micro_benchmarking_pytorch.py --network resnet50 --compile --compileContext "{'options': {'static-memory': 'True', 'matmul-padding': 'True'}}"
```

Note: you cannot pass the `mode` and `options` options together.
5 changes: 4 additions & 1 deletion TorchTensorOpsBench/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
To run the microbenchmark for an op:

```
python torch_tensor_ops_bench.py --op <op_name>
./torch_tensor_ops_bench.py --op <op_name>
```

The script also takes optional arguments:

```
--dtype [=fp32 | fp16 | bf16]
--device [=cuda | cpu]
--input-dim dims separated by '-', default "64-1024-1024"
--op-type [=None | binary(for a binary op)]
```
25 changes: 11 additions & 14 deletions TorchTensorOpsBench/run.sh
Original file line number Diff line number Diff line change
@@ -1,18 +1,15 @@
#!/usr/bin/env bash
#!/bin/bash

# run model ops in fp32, fp16 and bf16
printf "########## Running model ops with fp32 type ###########\n"
python3 torch_tensor_ops_bench.py --run-model-ops --dtype fp32 |& tee model_ops_fp32.log
printf "\n########## Running model ops with fp16 type ###########\n"
python3 torch_tensor_ops_bench.py --run-model-ops --dtype fp16 |& tee model_ops_fp16.log
printf "\n########## Running model ops with bf16 type ###########\n"
python3 torch_tensor_ops_bench.py --run-model-ops --dtype bf16 |& tee model_ops_bf16.log
echo -e "########## Running model ops with fp32 type ###########"
./torch_tensor_ops_bench.py --run-model-ops --dtype fp32 |& tee model_ops_fp32.log
echo -e "\n########## Running model ops with fp16 type ###########"
./torch_tensor_ops_bench.py --run-model-ops --dtype fp16 |& tee model_ops_fp16.log
echo -e "\n########## Running model ops with bf16 type ###########"
./torch_tensor_ops_bench.py --run-model-ops --dtype bf16 |& tee model_ops_bf16.log

# run predefined ops with generic tensor size of 64-1024-1024
printf "\n########## Running pre-defined ops with fp32 type ###########\n"
python3 torch_tensor_ops_bench.py --run-predefined --dtype fp32 |& tee predefined_ops_fp32.log
printf "\n########## Running pre-defined ops with fp16 type ###########\n"
python3 torch_tensor_ops_bench.py --run-predefined --dtype fp16 |& tee predefined_ops_fp16.log

printf "Done\n"

echo -e "\n########## Running pre-defined ops with fp32 type ###########"
./torch_tensor_ops_bench.py --run-predefined --dtype fp32 |& tee predefined_ops_fp32.log
echo -e "\n########## Running pre-defined ops with fp16 type ###########"
./torch_tensor_ops_bench.py --run-predefined --dtype fp16 |& tee predefined_ops_fp16.log
2 changes: 2 additions & 0 deletions TorchTensorOpsBench/torch_tensor_ops_bench.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
#!/usr/bin/env python3

import torch
import torch.nn as nn

Expand Down
2 changes: 2 additions & 0 deletions micro_benchmarking_pytorch.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
#!/usr/bin/env python3

import torch
import torchvision
import random
Expand Down
2 changes: 2 additions & 0 deletions shufflenet.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
#!/usr/bin/env python3

import os
import sys
import torch
Expand Down
2 changes: 2 additions & 0 deletions shufflenet_v2.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
#!/usr/bin/env python3

import os
import sys
import torch
Expand Down
2 changes: 2 additions & 0 deletions xception.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
#!/usr/bin/env python3

import math
import torch.nn as nn
import torch.nn.functional as F
Expand Down