Skip to content

Latest commit

 

History

History
127 lines (88 loc) · 4.78 KB

File metadata and controls

127 lines (88 loc) · 4.78 KB

Stage III: Tool-Enhanced Vision Language Models

Tool-enhanced VLMs extend tool-enhanced LLMs by replacing the language model with a vision-language model, enabling direct visual interaction. Unlike tool-enhanced LLMs, planners here take raw images as input, reducing information loss and improving efficiency.


  • Title: Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models
  • Venue: arXiv 2024

📄 P2G

  • Title: Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models
  • Venue: arXiv 2024

  • Title: LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
  • Venue: ICLR 2024
  • GitHub: Link

📄 VIREO

  • Title: From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis
  • Venue: EMNLP 2024
  • GitHub: Link

VIREO Demo


  • Title: OPENTHINKIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
  • Venue: arXiv 2025
  • GitHub: Link

Openthinkimg Demo


  • Title: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
  • Venue: NeurIPS 2024
  • GitHub: Link

SKETCHPAD Demo


  • Title: VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
  • Venue: NeurIPS 2024
  • GitHub: Link

VisionLLM v2 Demo


📄 VITRON

  • Title: VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
  • Venue: NeurIPS 2024
  • GitHub: Link

VITRON Demo


📄 Syn

  • Title: Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA
  • Venue: CVPR 2024

  • Title: VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
  • Venue: arXiv 2025
  • GitHub: Link

VTool-R1 Demo


  • Title: Self-Imagine: Effective Unimodal Reasoning with Multimodal Models using Self-Imagination
  • Venue: arXiv 2024
  • GitHub: Link

  • Title: Creative Agents: Empowering Agents with Imagination for Creative Tasks
  • Venue: UAI 2025
  • GitHub: Link

Creative Agents Demo


📄 CoT-VLA

  • Title: CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
  • Venue: CVPR 2025
  • Website: Link

CoT-VLA Demo