-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Hi guys,
When a task completes, I see a result.json like this in the eval/ folder:
{
"task_id": 0,
"task_name": "SystemBluetoothTurnOn",
"task_idx": 0,
"task_description": "Turn bluetooth on.",
"max_steps": 15,
"success": 0.0,
"agent_success": false,
"steps_taken": 1,
"execution_time": 12.689261,
"reasoning": false,
"final_thought": "To turn on Bluetooth, we first need to open the Settings app, but it is not currently open. Please navigate to the Settings app first.",
"logs": [],
"timestamp": "2025-06-26T16:18:20.153066",
"error": null,
"trajectory": [],
"trajectory_stats": {
"total_steps": 0,
"planning_steps": 0,
"execution_steps": 0
},
"device": "emulator-5554"
}In this instance, the result.json correctly identifies that the task failed (I watched it, and can confirm).
However, I have noticed that on a lot of occasions, your result.json will incorrectly mark the agent's actions as a 'success', despite the task actually failing. For example here, when I ran the add contact task, you can see that despite success being a 0.0 float (I imagine you get this reward back from the Android Env environment?), the 'agent_success' is marked as boolean 'true'.
{
"task_id": 0,
"task_name": "ContactsAddContact",
"task_idx": 0,
"task_description": "Create a new contact for Emilia Gonzalez. Their number is +14240925675.",
"max_steps": 18,
"success": 0.0,
"agent_success": true,
"steps_taken": 7,
"execution_time": 77.803622,
"reasoning": false,
"final_thought": "Successfully created new contact for Emilia Gonzalez",
"logs": [],
"timestamp": "2025-06-26T16:01:55.911568",
"error": null,
"trajectory": [],
"trajectory_stats": {
"total_steps": 0,
"planning_steps": 0,
"execution_steps": 0
},
"device": "emulator-5554"
}Can I just please confirm that the 63% success rate you achieve on the AndroidWorld benchmark is based on the float 'success' value, and not the boolean 'agent_success' value?
Also, would you mind please publishing some code / config so I can reproduce your results on the benchmark? I am trying a few tasks here and there with your framework but most of them are failing, so perhaps I am using the wrong settings / using it incorrectly!
thanks for your work on this!