In order to make sure that llms don't game our benchmark, we should make sure our inputs are not very gamable. (For example if we use an op that calculates mean with a large size and torch.randn() we'll end up with mean 0 every time). We need to solve for this. For this we can do one of three things imo
- Add a lot of high variance noise to the tensors
- Cycle through various distributions with different properties such that a random one is used during testing
- Compose compound distributions