Skip to content

Apex and torchaudio micro benchmarking#21

Open
amd-sriram wants to merge 22 commits intomasterfrom
apex_micro_benchmarking
Open

Apex and torchaudio micro benchmarking#21
amd-sriram wants to merge 22 commits intomasterfrom
apex_micro_benchmarking

Conversation

@amd-sriram
Copy link

@amd-sriram amd-sriram commented Oct 13, 2025

Motivation

Create microbenchmarking code for apex and torchaudio.
It should load example models and calculate time per batch and throughput for these models.

Technical Details

The below diagram explains the main steps in the microbenchmark code:

microbenchmark (1)

The microbenchark

  1. loads different models with no pretrained weights
  2. calculates time per batch - the average time for forward and/or backward passes for batchsize input done iteration times and throughput (batches processed per second).
warm up forward and backward pass
start time = ...
for index in range(iterations):
     # perform forward and/or backward pass
end time = ...
time per batch = (end time - start time) / iterations
throughput = batchsize / time per batch
  1. warmup (two forward and/or backward call is made) before profiling using deepspeed.profiling.flops_profiler.FlopsProfiler

Apex microbenchmarking

Based on this example https://github.com/ROCm/apex/tree/master/examples/imagenet, changing the micro_benchmarking_pytorch to apex:
https://github.com/ROCm/pytorch-micro-benchmarking/blob/apex_micro_benchmarking/micro_benchmarking_apex.py

Pytorch code Apex code
network = fp16util .network_to_half(network) network = apex.fp16_utils .FP16Model(network)
torch.nn.parallel.DistributedDataParallel apex.parallel.DistributedDataParallel
pytorch group norm model = apex.parallel.convert_syncbn_model(model)
amp.initialize() apex.amp.initialize(..., keep_batchnorm_fp32, loss_scale)
torch.optim.SGD(...) apex.optimizers.FusedSGD(...)

Add the following arguments:

    parser.add_argument('--sync_bn', action='store_true', help='enabling apex sync BN.')
    parser.add_argument('--keep-batchnorm-fp32', type=str, default=None)
    parser.add_argument('--loss-scale', type=str, default=None)

Use the same torchvision models used in pytorch microbenchmarking

  • classification (vgg, densenet, alexnet, resnet, etc.),
  • detection (mobilenet),
  • segmentation (mobilenet, fcn resnet, deeplab resenet),
  • visision transformers (vit, swin, newer efficientnet models).

Torchaudio

For torchaudio, first we understand the different models available, their inputs, outputs and loss. These inputs, outputs, losses need to be defined for each model type.

Taking models from https://docs.pytorch.org/audio/stable/models.html, classifying into different audio tasks that use different inputs and outputs

task  models  input  loss function
ASR automatic speech recognition  conformer  acoustic features, lengths CTC loss
  deepspeech  acoustic features  CTC loss
  emformer  acoustic features, lengths CTC loss
  wav2letter  acoustic features CTC loss
Source separation  conv tasnet base  waveform with 1 channel  Si-SDR loss
hdemucs ...    waveform with 2 channels L1 loss
Acoustic models  wav2vec2 ...  waveform  CTC loss
hubert ...  waveform CTC loss
wavlm ...  waveform CTC loss
Speech quality  squim objective base  1 waveform  L1 output[0] + 2 * output[2] + 0.5 * output[3] + 2 * L1 signal estimation
  squim subjective base  2 waveforms  L1 loss
Text to speech  wavernn  waveform, spectrogram  cross entropy
  tacotron2  tokens, tokens length, mel spectrogram, spectrogram length  MSE loss 
Speech representation hubert_pretrained ... waveform, labels Hubert loss

Test Plan

docker - registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16972_ubuntu24.04_py3.12_pytorch_release-2.9_7e1940d4

Running the different models for apex and torchaudio microbenchmarks.

python3 micro_benchmarking_apex.py --network resnet50

python3 micro_benchmarking_audio.py --network wav2vec2_base

Test Result

Apex microbenchmark output

--------------------SUMMARY--------------------------
Microbenchmark for network : resnet50
Num devices: 1
Dtype: FP32
Mini batch size [img] : 64
Time per mini-batch : 0.042455744743347165
Throughput [img/sec] : 1507.4520630103616

Torchaudio microbenchmark output

--------------------SUMMARY--------------------------
Microbenchmark for network : wav2vec2_base
Num devices: 1
Dtype: FP32
Mini batch size [ waveform ] : 64
Time per mini-batch : 0.020818877220153808
Throughput [ waveform /sec] : 3074.133120783503

Submission Checklist

@amd-sriram amd-sriram self-assigned this Oct 13, 2025
@amd-sriram amd-sriram marked this pull request as ready for review February 9, 2026 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant