WandB loss curves (e.g. [here](https://wandb.ai/awfidius/pure-transformer/runs/ehf0othc)) show a sawtooth form, correlated with batch ID. Batches are [randomized](https://github.com/awf/awf-jaxutils/blob/2590cc78a4ab017e0f6bcd1ccded1f63bbd9fc6a/dataset.py#L67) and this occurs even with [1-bit gradients](https://github.com/awf/functional-transformer/blob/780073081d65df06a5c0c31dc4f9d2c8285625a0/main.py#L178), so it's not Adam... 