Tool-enhanced VLMs extend tool-enhanced LLMs by replacing the language model with a vision-language model, enabling direct visual interaction. Unlike tool-enhanced LLMs, planners here take raw images as input, reducing information loss and improving efficiency.
- Title: Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models
- Venue: arXiv 2024
📄 P2G
- Title: Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models
- Venue: arXiv 2024
- Title: LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
- Venue: ICLR 2024
- GitHub: Link
📄 VIREO
- Title: From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis
- Venue: EMNLP 2024
- GitHub: Link
- Title: OPENTHINKIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
- Venue: arXiv 2025
- GitHub: Link
- Title: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
- Venue: NeurIPS 2024
- GitHub: Link
- Title: VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
- Venue: NeurIPS 2024
- GitHub: Link
📄 VITRON
- Title: VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
- Venue: NeurIPS 2024
- GitHub: Link
📄 Syn
- Title: Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA
- Venue: CVPR 2024
📄 VTool-R1
- Title: VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
- Venue: arXiv 2025
- GitHub: Link
- Title: Self-Imagine: Effective Unimodal Reasoning with Multimodal Models using Self-Imagination
- Venue: arXiv 2024
- GitHub: Link
- Title: Creative Agents: Empowering Agents with Imagination for Creative Tasks
- Venue: UAI 2025
- GitHub: Link
📄 CoT-VLA
- Title: CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
- Venue: CVPR 2025
- Website: Link







