ScaleRL needs FP32 on the inference side too

#16 implements the first step to get ScaleRL-like stability improvements by upscaling the logits to FP32 on the training backend, but we still need this from the inference backend, too.