205 lines (147 loc) · 7.95 KB

Stage II: Tool-Enhanced Large Language Models

This paradigm typically involves two key components: generating actions from the current state, and transitioning between states by executing those actions.

📄 ViperGPT

Title: ViperGPT: Visual Inference via Python Execution for Reasoning
Venue: ICCV 2023
GitHub: Link

📄 Chameleon

Title: Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Venue: NeurIPS 2023
GitHub: Link

📄 Visprog

Title: Visual Programming: Compositional visual reasoning without training
Venue: CVPR 2023
GitHub: Link

📄 Visual ChatGPT

Title: Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Venue: arXiv 2023
GitHub: Link

📄 HuggingGPT

Title: HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Venue: NeurIPS 2023
GitHub: Link

📄 GPT4Tools

Title: GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
Venue: NeurIPS 2023
GitHub: Link

📄 InternGPT

Title: InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language
Venue: arXiv 2023
GitHub: Link

📄 ViotGPT

Title: VIoTGPT: Learning to Schedule Vision Tools Towards Intelligent Video Internet of Things
Venue: AAAI 2025
GitHub: Link

📄 MM-REACT

Title: MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Venue: arXiv 2023
GitHub: Link

📄 VisRep

Title: Self-training large language models for improved visual program synthesis with visual reinforcement
Venue: CVPR 2024

📄 CRAFT

Title: CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
Venue: ICLR 2024
GitHub: Link

📄 CLOVA

Title: CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update
Venue: CVPR 2024
GitHub: Link

📄 HYDRA

Title: HYDRA
Venue: ECCV 2024
GitHub: Link

📄 ContextualCoder

Title: ContextualCoder: Adaptive In-context Prompting for Programmatic Visual Question Answering
Venue: TMM 2025

📄 ViUniT

Title: Visual Unit Tests for More Robust Visual Programming
Venue: CVPR 2025
GitHub: Link

📄 SYNAPSE

Title: SYNAPSE: SYmbolic Neural-Aided Preference Synthesis Engine
Venue: AAAI 2025
GitHub: Link

📄 Naver

Title: NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning
Venue: ICCV 2025
GitHub: Link

📄 DWIM

Title: DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning
Venue: ICCV 2025
GitHub: Link
Website: Link

📄 LEFT

Title: What’s Left? Concept Grounding with Logic-Enhanced Foundation Models
Venue: NeurIPS 2023
GitHub: Link