127 lines (88 loc) · 4.78 KB

Stage III: Tool-Enhanced Vision Language Models

Tool-enhanced VLMs extend tool-enhanced LLMs by replacing the language model with a vision-language model, enabling direct visual interaction. Unlike tool-enhanced LLMs, planners here take raw images as input, reducing information loss and improving efficiency.

📄 Image-of-Thought

Title: Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models
Venue: arXiv 2024

📄 P2G

Title: Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models
Venue: arXiv 2024

📄 LLAVA-PLUS

Title: LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Venue: ICLR 2024
GitHub: Link

📄 VIREO

Title: From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis
Venue: EMNLP 2024
GitHub: Link

📄 Openthinkimg

Title: OPENTHINKIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Venue: arXiv 2025
GitHub: Link

📄 SKETCHPAD

Title: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
Venue: NeurIPS 2024
GitHub: Link

📄 VisionLLM v2

Title: VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Venue: NeurIPS 2024
GitHub: Link

📄 VITRON

Title: VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Venue: NeurIPS 2024
GitHub: Link

📄 Syn

Title: Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA
Venue: CVPR 2024

📄 VTool-R1

Title: VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
Venue: arXiv 2025
GitHub: Link

📄 Self-Imagine

Title: Self-Imagine: Effective Unimodal Reasoning with Multimodal Models using Self-Imagination
Venue: arXiv 2024
GitHub: Link

📄 Creative Agents

Title: Creative Agents: Empowering Agents with Imagination for Creative Tasks
Venue: UAI 2025
GitHub: Link

📄 CoT-VLA

Title: CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
Venue: CVPR 2025
Website: Link