-
Notifications
You must be signed in to change notification settings - Fork 15
Description
In the vein of the work in #98 I was running an analysis of the runtimes of all of the tests in triton bench. You can take a look at it here
https://gist.github.com/PaliC/b1f3469b91fe340ba40e583695c92fcb
To repro checkout this pr #109
Then run python BackendBench/scripts/runtime_histogram.py
The numbers are relative to the runtime of torch.empty(0, device='cuda') which we can imagine to be what cuda / torch overhead to roughly look like.
Takeaways
There are three things to note from this.
- The first is that a good filter for a test is that taking at least 1.4x the runtime of torch.empty(0, device='cuda') signifies that the op does indeed do something. This is verifiable by synthetic ops showing as well.
- The big inputs script likely needs to be rewritten to use do_bench or just limit the size of the tensor like 5GB which is massive but shouldn't cause things like softmax to OOM.
- The most relevant one is what to do with the 12 ops whose runtime does not seem to scale.
The ops are
1. aten.as_strided_.default 0.99x (n=2)
2. aten.split.Tensor 0.99x (n=37)
3. aten._unsafe_view.default 0.99x (n=672)
4. aten.new_empty_strided.default 1.00x (n=45)
5. aten._sparse_coo_tensor_with_dims_and_tensors.default 1.00x (n=124) d
6. aten.split_with_sizes.default 1.00x (n=44)
7. aten.unbind.int 1.00x (n=49) d
8. aten.unsqueeze_.default 1.00x (n=5)
9. aten.logical_and_.default 1.00x (n=1)
10. aten.le.Scalar 1.01x (n=6)
11. aten.new_empty.default 1.01x (n=15)
12. aten.floor.default 1.01x (n=1)
13. aten.log2.default 1.44x (n=2)Notet that when we run similar tests for synthetic tests (which are very very large) we get
1. aten.as_strided_.default 0.99x (n=2)
2. aten.new_empty_strided.default 1.00x (n=10)
3. aten.split.Tensor 1.00x (n=10)
4. aten.unsqueeze_.default 1.00x (n=5)
5. aten._sparse_coo_tensor_with_dims_and_tensors.default 1.00x (n=2)
6. aten.new_empty.default 1.00x (n=10)
7. aten._unsafe_view.default 1.01x (n=3)
8. aten.unbind.int 1.01x (n=10)
9. aten.split_with_sizes.default 1.01x (n=3)
10. aten.new_ones.default 1.69x (n=10)
11. aten.round.default 1.82x (n=1)
12. aten.floor.default 1.84x (n=1)This indicates that the following ops either need to only be tested for correctness or removed from the benchmark
aten.as_strided_.default ,
aten.split.Tensor,
aten._unsafe_view.default,
aten.new_empty_strided.default,
aten._sparse_coo_tensor_with_dims_and_tensors.default,
aten.split_with_sizes.default,
aten.unbind.int ,
aten.unsqueeze_.default,
aten.new_empty.default For the following ops we should mixin synthetic benchmarks as they are significantly longer
aten.logical_and_.default
aten.le.Scalar
aten.floor.default Action items
- For torchbench suite split off performance and correctness tests. Performance tests in this case is a strict subset of correctness tests.
- Filter out tests that run slower than 1.3x the cuda no-op as correctness only tests. This number is chosen off the histogram analysis as we have a bunch of tests in the
0.991x - 1.212xrange and none in the1.212x - 1.482xrange (but most are slower) . - Filter out the first list of ops as correctness only tests
- Mixin synthetic benchmarks for the second list of ops
- Fix the big_inputs script to cap the size of inputs at 5GB to avoid OOMs