Hi, this is an interesting work! I was wondering if you could share more details about the configuration used to obtain the Tau2Bench results (e.g., settings, prompts, or domains evaluated).
I’ve tried reproducing the results on the airline, retail, and telecom domains, but I consistently observe that the model always calls transfer_to_human_agents and never invokes call_expert to trigger the available expert models. I therefore cannot reproduce the numbers in the paper and there is a large gap.
Could you help clarify what might be missing or misconfigured on my side? Or could you release the full trace directly?
Thank you!
Hi, this is an interesting work! I was wondering if you could share more details about the configuration used to obtain the Tau2Bench results (e.g., settings, prompts, or domains evaluated).
I’ve tried reproducing the results on the airline, retail, and telecom domains, but I consistently observe that the model always calls
transfer_to_human_agentsand never invokescall_expertto trigger the available expert models. I therefore cannot reproduce the numbers in the paper and there is a large gap.Could you help clarify what might be missing or misconfigured on my side? Or could you release the full trace directly?
Thank you!