Could you explain how to reproduce the benchmark results reported for DeepSeek-R1 best-of-10 and OpenAI o1 best-of-10? Thanks!