Thank you in advance for your reply.
The SLURM script in evaluation/tau2-bench/run.py launches 4 vllm instances of the orchestrator model. Can you provide more context as to why that is the case? For other benchmarks such as HLE, there is only a single orchestrator model.
Thank you in advance for your reply.
The SLURM script in evaluation/tau2-bench/run.py launches 4 vllm instances of the orchestrator model. Can you provide more context as to why that is the case? For other benchmarks such as HLE, there is only a single orchestrator model.