I recently ran eval on some of the latest models since the paper came out and ran it with the dom_reward and a reward text model of 4o. I wasn't able to get nearly as good performance as the old 4o from table 5. How was table 5 numbers achieved? Can you provide the arguments for the evaluate.py file and settings to reproduce the same?