Skip to content

Runtime analysis of tests in torchbench suite #108

@PaliC

Description

@PaliC

In the vein of the work in #98 I was running an analysis of the runtimes of all of the tests in triton bench. You can take a look at it here

https://gist.github.com/PaliC/b1f3469b91fe340ba40e583695c92fcb

To repro checkout this pr #109
Then run python BackendBench/scripts/runtime_histogram.py

The numbers are relative to the runtime of torch.empty(0, device='cuda') which we can imagine to be what cuda / torch overhead to roughly look like.

Takeaways

There are three things to note from this.

  1. The first is that a good filter for a test is that taking at least 1.4x the runtime of torch.empty(0, device='cuda') signifies that the op does indeed do something. This is verifiable by synthetic ops showing as well.
  2. The big inputs script likely needs to be rewritten to use do_bench or just limit the size of the tensor like 5GB which is massive but shouldn't cause things like softmax to OOM.
  3. The most relevant one is what to do with the 12 ops whose runtime does not seem to scale.

The ops are

  1. aten.as_strided_.default                     0.99x (n=2)
  2. aten.split.Tensor                            0.99x (n=37)
  3. aten._unsafe_view.default                    0.99x (n=672)
  4. aten.new_empty_strided.default               1.00x (n=45)
  5. aten._sparse_coo_tensor_with_dims_and_tensors.default     1.00x (n=124) d
  6. aten.split_with_sizes.default                1.00x (n=44)
  7. aten.unbind.int                              1.00x (n=49) d
  8. aten.unsqueeze_.default                      1.00x (n=5)
  9. aten.logical_and_.default                    1.00x (n=1)
 10. aten.le.Scalar                               1.01x (n=6)
 11. aten.new_empty.default                       1.01x (n=15)
 12. aten.floor.default                           1.01x (n=1)
 13. aten.log2.default                            1.44x (n=2)

Notet that when we run similar tests for synthetic tests (which are very very large) we get

  1. aten.as_strided_.default                     0.99x (n=2)
  2. aten.new_empty_strided.default               1.00x (n=10)
  3. aten.split.Tensor                            1.00x (n=10)
  4. aten.unsqueeze_.default                      1.00x (n=5)
  5. aten._sparse_coo_tensor_with_dims_and_tensors.default     1.00x (n=2)
  6. aten.new_empty.default                       1.00x (n=10)
  7. aten._unsafe_view.default                    1.01x (n=3)
  8. aten.unbind.int                              1.01x (n=10)
  9. aten.split_with_sizes.default                1.01x (n=3)
 10. aten.new_ones.default                        1.69x (n=10)
 11. aten.round.default                           1.82x (n=1)
 12. aten.floor.default                           1.84x (n=1)

This indicates that the following ops either need to only be tested for correctness or removed from the benchmark

aten.as_strided_.default , 
aten.split.Tensor,
 aten._unsafe_view.default, 
aten.new_empty_strided.default,
 aten._sparse_coo_tensor_with_dims_and_tensors.default, 
aten.split_with_sizes.default,
 aten.unbind.int ,
 aten.unsqueeze_.default,
 aten.new_empty.default 

For the following ops we should mixin synthetic benchmarks as they are significantly longer

aten.logical_and_.default  
aten.le.Scalar
aten.floor.default  

Action items

  • For torchbench suite split off performance and correctness tests. Performance tests in this case is a strict subset of correctness tests.
  • Filter out tests that run slower than 1.3x the cuda no-op as correctness only tests. This number is chosen off the histogram analysis as we have a bunch of tests in the 0.991x - 1.212x range and none in the 1.212x - 1.482x range (but most are slower) .
  • Filter out the first list of ops as correctness only tests
  • Mixin synthetic benchmarks for the second list of ops
  • Fix the big_inputs script to cap the size of inputs at 5GB to avoid OOMs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions