Runtime analysis of tests in torchbench suite

In the vein of the work in https://github.com/meta-pytorch/BackendBench/pull/98 I was running an analysis of the runtimes of all of the tests in triton bench. You can take a look at it here

https://gist.github.com/PaliC/b1f3469b91fe340ba40e583695c92fcb

To repro checkout this pr https://github.com/meta-pytorch/BackendBench/pull/109 
Then run `python BackendBench/scripts/runtime_histogram.py `

The numbers are relative to the runtime of `torch.empty(0, device='cuda')` which we can imagine to be what cuda / torch overhead to roughly look like.

## Takeaways
There are three things to note from this. 
1. The first is that a good filter for a test is that taking at least 1.4x the runtime of torch.empty(0, device='cuda') signifies that the op does indeed do something. This is verifiable by synthetic ops showing  as well.
2. The big inputs script likely needs to be rewritten to use do_bench or just limit the size of the tensor like 5GB which is massive but shouldn't cause things like softmax to OOM.
3. The most relevant one is what to do with the 12 ops whose runtime does not seem to scale.

The ops are
```bash
  1. aten.as_strided_.default                     0.99x (n=2)
  2. aten.split.Tensor                            0.99x (n=37)
  3. aten._unsafe_view.default                    0.99x (n=672)
  4. aten.new_empty_strided.default               1.00x (n=45)
  5. aten._sparse_coo_tensor_with_dims_and_tensors.default     1.00x (n=124) d
  6. aten.split_with_sizes.default                1.00x (n=44)
  7. aten.unbind.int                              1.00x (n=49) d
  8. aten.unsqueeze_.default                      1.00x (n=5)
  9. aten.logical_and_.default                    1.00x (n=1)
 10. aten.le.Scalar                               1.01x (n=6)
 11. aten.new_empty.default                       1.01x (n=15)
 12. aten.floor.default                           1.01x (n=1)
 13. aten.log2.default                            1.44x (n=2)
 ```

Notet that when we run similar tests for synthetic tests (which are very very large) we get 
```bash
  1. aten.as_strided_.default                     0.99x (n=2)
  2. aten.new_empty_strided.default               1.00x (n=10)
  3. aten.split.Tensor                            1.00x (n=10)
  4. aten.unsqueeze_.default                      1.00x (n=5)
  5. aten._sparse_coo_tensor_with_dims_and_tensors.default     1.00x (n=2)
  6. aten.new_empty.default                       1.00x (n=10)
  7. aten._unsafe_view.default                    1.01x (n=3)
  8. aten.unbind.int                              1.01x (n=10)
  9. aten.split_with_sizes.default                1.01x (n=3)
 10. aten.new_ones.default                        1.69x (n=10)
 11. aten.round.default                           1.82x (n=1)
 12. aten.floor.default                           1.84x (n=1)
```

This indicates that the following ops either need to only be tested for correctness or removed from the benchmark
```bash
aten.as_strided_.default , 
aten.split.Tensor,
 aten._unsafe_view.default, 
aten.new_empty_strided.default,
 aten._sparse_coo_tensor_with_dims_and_tensors.default, 
aten.split_with_sizes.default,
 aten.unbind.int ,
 aten.unsqueeze_.default,
 aten.new_empty.default 
```

For the following ops we should mixin synthetic benchmarks as they are significantly longer
```bash
aten.logical_and_.default  
aten.le.Scalar
aten.floor.default  
```

### Action items
- [ ] For torchbench suite split off performance and correctness tests. Performance tests in this case is a strict subset of correctness tests.
- [ ] Filter out tests that run slower than 1.3x the cuda no-op as correctness only tests. This number is chosen off the histogram analysis as we have a bunch of tests in the `0.991x -   1.212x` range and none in the `1.212x -   1.482x ` range (but most are slower) .
- [ ] Filter out the first list of ops as correctness only tests
- [ ] Mixin synthetic benchmarks for the second list of ops 
- [ ] Fix the big_inputs script to cap the size of inputs at 5GB to avoid OOMs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Runtime analysis of tests in torchbench suite #108

Takeaways

Action items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Runtime analysis of tests in torchbench suite #108

Description

Takeaways

Action items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions