I get a insanely high flop throughput (like 17 PF) on my Mac M2 Ultra. Using mlx seems to produce reasonable numbers synchronizing the GPU with mx.eval(output_marix)