What are the exact settings/arguments for reproducing the results of table 5 in the paper?

I recently ran eval on some of the latest models since the paper came out and ran it with the dom_reward and a reward text model of 4o. I wasn't able to get nearly as good performance as the old 4o from table 5. How was table 5 numbers achieved? Can you provide the arguments for the `evaluate.py` file and settings to reproduce the same?