The ProcThor Agent is designed to navigate within simulated environments using a Vision Language Model (VLM). The system employs a structured interaction loop where the model receives visual observations and executes specific tool calls.
The interaction follows a strict template where the User acts as the environment interface, providing:
- The Goal (e.g., "Visit all rooms").
- A history of recent actions and observations.
- The current visual observation (RGB image).
The Assistant (LLM) then analyzes the visual input and context to determine the next best action, outputting a structured function call.
The agent is equipped with a precise set of tools to manipulate its position and orientation. These tools are defined with specific arguments to ensure deterministic control over the agent's movement.
- Navigation: Moves the agent in cardinal directions (
Ahead,Back,Left,Right) with configurable magnitude (0.1 to 1.0). - Rotation: Rotates the view (
Left,Right) by fixed degrees (15, 30, 45, 90). - Done: Signals the completion of the task.
- Python 3.12
- Run
pip install -r requirements.txt - Run
cp .env.example .envto create environment file. then add api key intoOPENAI_API_KEY=TOKEN
python scripts/interactive_wasd.pypython scripts/ai_agent.pypython scripts/ai_agent_chunked.pyEvaluate agent navigation performance on ProcTHOR environments.
python scripts/create_benchmark_dataset.py --num 50 --split test --output benchmark_dataset.jsonlpython scripts/run_benchmark.py --benchmark benchmark_dataset.jsonl --max_steps 70Results saved to benchmark_results.jsonl. Trajectory visualizations saved to benchmark_visualizations/.
Analyze execution logs, calculate metrics (redundancy, blocked actions), and generate annotated videos of agent performance.
python scripts/analyze_benchmark_logs.py benchmark_results/<benchmark_run_directory>This generates result_detailed.json and videos in analysis_videos/ within the run directory.


