`Awesome LLM/MLLM for Image Segmentation`

Related Task

[ReferIS]: Referring Image Segmentation
[GRES]: Generalized Referring Expression Segmentation
[ReasonIS]: Reasoning Image Segmentation
[ReasonInstIS]: Reasoning Instance Image Segmentation
[SiD]: Segmentation in Dialogue
[GCG]: Grounded Conversation Generation
[ReasonMIS]: Multi-image pixel-grounded Reasoning Segmentation [MGSC]: MultiGranularity Segmentation and Captioning
[ImgSemSeg]: Image Semantic Segmentation
[ImgInstSeg]: Image Instance Segmentation
[ImgPanSeg]: Image Panoptic Segmentation
[ImgInteractSeg]: Image Iteractive Segmentation
[OVSeg]: Open-Vocabulary Segmentation
[VideoObjSeg]: Video Object Segmentation
[ReferVOS]: Referring Video Object Segmentation
[ReasonVOS]: Reasoning Video Object Segmentation
[VideoInteractSeg]: Video Iteractive Segmentation

Note: Only the tasks assessed in the paper are listed here.

Survey

Image Segmentation in Foundation Model Era: A Survey. arXiv'2024. [paper] | [project]
Reasoning Segmentation for Images and Videos: A Survey. arXiv'2025. [paper]

General

MLLM with Segmentation Capability

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks. NeurIPS'2023. [paper] | [code]
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks. NeurIPS'2024. [paper] | [code]
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. CVPR'2024. [paper] | [code]
NExT-Chat: An LMM for Chat, Detection and Segmentation. ICML'2024. [paper] | [code] | [project]
PaliGemma: A versatile 3B VLM for transfer. arXiv'2024. [paper] | [code]
PaliGemma 2: A Family of Versatile VLMs for Transfer. arXiv'2024. [paper] | [code]
REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding. arXiv'2025 [paper] | [code]
Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs. arXiv'2025 [paper] | [code]

Segmentation Model with LLM/MLLM

[ReferIS][ReasonIS] LISA: Reasoning Segmentation via Large Language Model. CVPR'2024. [paper] | [code]
[ReferIS][ReasonIS][ReasonInstIS][SiD] LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model. arXiv'2023. [paper] | [code]
[ReferIS][GRES] GSVA: Generalized Segmentation via Multimodal Large Language Models. CVPR'2024. [paper] | [code]
[ReferIS][ReasonIS] PixelLM: Pixel Reasoning with Large Multimodal Model. CVPR'2024. [paper] | [code] | [project]
[ReferIS][GCG] GLaMM: Pixel Grounding Large Multimodal Model. CVPR'2024. [paper] | [code] | [project]
[ReferIS][GRES][ReasonIS] GROUNDHOG: Grounding Large Language Models to Holistic Segmentation. CVPR'2024. [paper] | [project]
[ReasonIS] LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning. CVPRW'2024. [paper] | [code]
[ReferIS][ReasonIS][GCG][ImgSemSeg][ImgInstSeg][ImgInteractSeg][ImgInteractSeg] OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding. NeurIPS'2024. [paper] | [code] | [project]
[ReferIS][GRES][ImgSemSeg][ImgInstSeg][ImgInteractSeg][ImgInteractSeg][VideoObjSeg][OVSeg] PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model. ECCV'2024. [paper] | [code]
[ReferIS][GRES][ReasonIS] SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation. ECCV'2024. [paper] | [code]
[ReferIS][ReasonIS][ReasonVOS] VISA: Reasoning Video Object Segmentation via Large Language Models. ECCV'2024. [paper] | [code]
[ImgSemSeg][ReferIS][GRES] LaSagnA: Language-based Segmentation Assistant for Complex Queries. arXiv'2024. [paper] | [code]
[ReferIS] EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model. arXiv'2024. [paper] | [code]
[ReferIS][GRES][ReasonIS][GCG][MGSC] Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model. AAAI'2025. [paper] | [code] | [project]
[ReferIS][GRES] Text4Seg: Reimagining Image Segmentation as Text Generation. ICLR'2025. [paper] | [code] | [project]
[ReferIS][ReasonIS][ReferVOS][ReasonVOS] InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models. arXiv'2024. [paper] | [code]
[ReasonMIS] PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation. arXiv'2024. [paper] | [project]
[ReferIS][GCG][ReferVOS][VideoInteractSeg] Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos. CVPRW'2025. [paper] | [code] | [project]
[ReferIS][GRES][ReasonIS] HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model. ICCV'2025. [paper] | [code]
[ReferIS] Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding. arXiv'2025. [paper] | [project]
[RES][ReferIS] Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement. arXiv'2025. [paper] | [code]
[RES][ReferIS] Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts. arXiv'2025. [paper] | [project]
[RES][ReferIS] POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation. CVPR'2025. [paper] | [project]
[ReasonIS][ReferIS] VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning. arXiv'2025. [paper] | [code]
[ReferIS] Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model. CVPR'2024 [paper] | [project]
[ReferIS] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories. CVPR'2025 [paper] | [code] | [project]
[ReferIS][ReasonIS] Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding. CVPR'2025 [paper]
[ReferIS][ReasonIS] SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning. arXiv'2025 [paper]
[ImgSemSeg][ImgInstSeg][ReferIS] ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model. arXiv'2025 [paper]
[ReferIS][ReasonIS] PIXELTHINK: Towards Efficient Chain-of-Pixel Reasoning. arXiv'2025 [paper] | [code] | [project]
[ReferIS][ImgSemSeg] LlamaSeg: Image Segmentation via Autoregressive Mask Generation. arXiv'2025 [paper]
[ReferIS][OVSeg] ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation. arXiv'2025 [paper] | [code]
[ReferVOS][ReasonVOS] VIDEOMOLMO: Spatio-Temporal Grounding Meets Pointing. arXiv'2025 [paper] | [code]
[ReasonIS][ReferIX] RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought. ACL'2025 [paper]
[ReferVOS][ReasonVOS][ReasonIS][ReferIS]: One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos. NeurIPS'2024 [paper] | [code]
[RES][GCG] GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model. CVPR'2025 [paper]
[ReferIS][ReasonIS] Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning. arXiv'2025 [paper][project][code]
[ReferIS][ReasonIS][GCG][GRES][ImgSemSeg][ImgInstSeg][ImgInteractSeg][ImgInteractSeg]: X-SAM: From Segment Anything to Any Segmentation. arXiv'2025 [paper]code]
[ReferIS][ReasonIS] LENS: Learning to Segment Anything with Unified Reinforced Reasoning. arXiv'2025 [paper][code]
[ReferIS][ReasonIS][ReferVOS][ReasonVOS][ImgInteractSeg][VideoInteractSeg] UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning. NeurIPS'2025 [paper][code]
[ReferIS][ImgSemSeg][ImgInstSeg][ReasonIS] UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface. NeurIPS'2025 [paper][code]
[ReferIS][ImgSemSeg][ImgInstSeg][OVSeg] Advancing Visual Large Language Model for Multi-granular Versatile Perception. ICCV'2025 [paper][code]
[ReferVOS][ReferIS] MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding. arXiv'2025 [paper][code]
[ReferIS][ImgInteractSeg] ARGenSeg: Image Segmentation with Autoregressive Image Generation Model. NeurIPS'2025 [paper]
[ReferIS][ReasonIS] Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction. arXiv'2025 [paper] | [code]

Benchmark

[ReasonIS][ReferIS] MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation. ICLR'2025. [paper] | [code]
[ReasonIS][ReferIS] Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels. CVPR'2025. [paper]
[ReferIS][GRES] GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding. ICCV'2025. [paper] | [code]

Remote sensing

MLLM with Segmentation Capability

RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts. arXiv'2024. [paper] | [code] | [project]
GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding. arXiv'2024. [paper] | [code]
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing. arXiv'2025. [paper]
Falcon: A Remote Sensing Vision-Language Foundation Model. arXiv'2025. [paper] | [code]
EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models. arXiv'2025. [paper] | [code]
RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow. AAAI'2026. [paper] | [code]

Segmentation Model with LLM/MLLM

[ReferIS][GCG] GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing. ICML'2025. [paper] | [code] | [project]
[ReferIS][ReasonIS] SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model. arXiv'2025. [paper] | [code] | [project]
[ReasonIS] LISAt: Language-Instructed Segmentation Assistant for Satellite Imagery. NeurIPS'2025. [paper] | [code] | [project]
[ReferIS][GCG] GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing. MM'2025 [paper]
[ReferIS][ReasonIS] Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning. arXiv'2025 [paper] | [code] | [project]
[ReferIS][ReasonIS][ImgInteractSeg] UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes. arXiv'2025 [paper] | [code]
[ReferIS][ReasonIS] SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images. arXiv'2025 [paper] | [code]

Benchmark

DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response. NeurIPS'2025. [paper]
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks. ICCV'2025 [paper] | [code]

Contributing

If you find any errors, or you wish to add some papers, please feel free to contribute to this list by contacting me or by creating a pull request using the following Markdown format:

- Paper Name. Conference'Year. [[paper](link)] | [[code](link)] | ...

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`Awesome LLM/MLLM for Image Segmentation`

Related Task

Survey

General

MLLM with Segmentation Capability

Segmentation Model with LLM/MLLM

Benchmark

Remote sensing

MLLM with Segmentation Capability

Segmentation Model with LLM/MLLM

Benchmark

Contributing

Related Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome LLM/MLLM for Image Segmentation

Related Task

Survey

General

MLLM with Segmentation Capability

Segmentation Model with LLM/MLLM

Benchmark

Remote sensing

MLLM with Segmentation Capability

Segmentation Model with LLM/MLLM

Benchmark

Contributing

Related Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

`Awesome LLM/MLLM for Image Segmentation`

Packages