Skip to content

earth-insights/awesome-MLLM-for-image-segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 

Repository files navigation

Maintenance Awesome GitHub watchers GitHub stars GitHub forks

Awesome LLM/MLLM for Image Segmentation

Related Task

[ReferIS]: Referring Image Segmentation
[GRES]: Generalized Referring Expression Segmentation
[ReasonIS]: Reasoning Image Segmentation
[ReasonInstIS]: Reasoning Instance Image Segmentation
[SiD]: Segmentation in Dialogue
[GCG]: Grounded Conversation Generation
[ReasonMIS]: Multi-image pixel-grounded Reasoning Segmentation [MGSC]: MultiGranularity Segmentation and Captioning
[ImgSemSeg]: Image Semantic Segmentation
[ImgInstSeg]: Image Instance Segmentation
[ImgPanSeg]: Image Panoptic Segmentation
[ImgInteractSeg]: Image Iteractive Segmentation
[OVSeg]: Open-Vocabulary Segmentation
[VideoObjSeg]: Video Object Segmentation
[ReferVOS]: Referring Video Object Segmentation
[ReasonVOS]: Reasoning Video Object Segmentation
[VideoInteractSeg]: Video Iteractive Segmentation

Note: Only the tasks assessed in the paper are listed here.

Survey

  • Image Segmentation in Foundation Model Era: A Survey. arXiv'2024. [paper] | [project]
  • Reasoning Segmentation for Images and Videos: A Survey. arXiv'2025. [paper]

General

MLLM with Segmentation Capability

  • VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks. NeurIPS'2023. [paper] | [code]
  • VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks. NeurIPS'2024. [paper] | [code]
  • Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. CVPR'2024. [paper] | [code]
  • NExT-Chat: An LMM for Chat, Detection and Segmentation. ICML'2024. [paper] | [code] | [project]
  • PaliGemma: A versatile 3B VLM for transfer. arXiv'2024. [paper] | [code]
  • PaliGemma 2: A Family of Versatile VLMs for Transfer. arXiv'2024. [paper] | [code]
  • REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding. arXiv'2025 [paper] | [code]
  • Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs. arXiv'2025 [paper] | [code]

Segmentation Model with LLM/MLLM

  • [ReferIS][ReasonIS] LISA: Reasoning Segmentation via Large Language Model. CVPR'2024. [paper] | [code]
  • [ReferIS][ReasonIS][ReasonInstIS][SiD] LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model. arXiv'2023. [paper] | [code]
  • [ReferIS][GRES] GSVA: Generalized Segmentation via Multimodal Large Language Models. CVPR'2024. [paper] | [code]
  • [ReferIS][ReasonIS] PixelLM: Pixel Reasoning with Large Multimodal Model. CVPR'2024. [paper] | [code] | [project]
  • [ReferIS][GCG] GLaMM: Pixel Grounding Large Multimodal Model. CVPR'2024. [paper] | [code] | [project]
  • [ReferIS][GRES][ReasonIS] GROUNDHOG: Grounding Large Language Models to Holistic Segmentation. CVPR'2024. [paper] | [project]
  • [ReasonIS] LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning. CVPRW'2024. [paper] | [code]
  • [ReferIS][ReasonIS][GCG][ImgSemSeg][ImgInstSeg][ImgInteractSeg][ImgInteractSeg] OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding. NeurIPS'2024. [paper] | [code] | [project]
  • [ReferIS][GRES][ImgSemSeg][ImgInstSeg][ImgInteractSeg][ImgInteractSeg][VideoObjSeg][OVSeg] PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model. ECCV'2024. [paper] | [code]
  • [ReferIS][GRES][ReasonIS] SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation. ECCV'2024. [paper] | [code]
  • [ReferIS][ReasonIS][ReasonVOS] VISA: Reasoning Video Object Segmentation via Large Language Models. ECCV'2024. [paper] | [code]
  • [ImgSemSeg][ReferIS][GRES] LaSagnA: Language-based Segmentation Assistant for Complex Queries. arXiv'2024. [paper] | [code]
  • [ReferIS] EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model. arXiv'2024. [paper] | [code]
  • [ReferIS][GRES][ReasonIS][GCG][MGSC] Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model. AAAI'2025. [paper] | [code] | [project]
  • [ReferIS][GRES] Text4Seg: Reimagining Image Segmentation as Text Generation. ICLR'2025. [paper] | [code] | [project]
  • [ReferIS][ReasonIS][ReferVOS][ReasonVOS] InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models. arXiv'2024. [paper] | [code]
  • [ReasonMIS] PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation. arXiv'2024. [paper] | [project]
  • [ReferIS][GCG][ReferVOS][VideoInteractSeg] Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos. CVPRW'2025. [paper] | [code] | [project]
  • [ReferIS][GRES][ReasonIS] HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model. ICCV'2025. [paper] | [code]
  • [ReferIS] Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding. arXiv'2025. [paper] | [project]
  • [RES][ReferIS] Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement. arXiv'2025. [paper] | [code]
  • [RES][ReferIS] Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts. arXiv'2025. [paper] | [project]
  • [RES][ReferIS] POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation. CVPR'2025. [paper] | [project]
  • [ReasonIS][ReferIS] VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning. arXiv'2025. [paper] | [code]
  • [ReferIS] Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model. CVPR'2024 [paper] | [project]
  • [ReferIS] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories. CVPR'2025 [paper] | [code] | [project]
  • [ReferIS][ReasonIS] Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding. CVPR'2025 [paper]
  • [ReferIS][ReasonIS] SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning. arXiv'2025 [paper]
  • [ImgSemSeg][ImgInstSeg][ReferIS] ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model. arXiv'2025 [paper]
  • [ReferIS][ReasonIS] PIXELTHINK: Towards Efficient Chain-of-Pixel Reasoning. arXiv'2025 [paper] | [code] | [project]
  • [ReferIS][ImgSemSeg] LlamaSeg: Image Segmentation via Autoregressive Mask Generation. arXiv'2025 [paper]
  • [ReferIS][OVSeg] ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation. arXiv'2025 [paper] | [code]
  • [ReferVOS][ReasonVOS] VIDEOMOLMO: Spatio-Temporal Grounding Meets Pointing. arXiv'2025 [paper] | [code]
  • [ReasonIS][ReferIX] RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought. ACL'2025 [paper]
  • [ReferVOS][ReasonVOS][ReasonIS][ReferIS]: One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos. NeurIPS'2024 [paper] | [code]
  • [RES][GCG] GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model. CVPR'2025 [paper]
  • [ReferIS][ReasonIS] Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning. arXiv'2025 [paper][project][code]
  • [ReferIS][ReasonIS][GCG][GRES][ImgSemSeg][ImgInstSeg][ImgInteractSeg][ImgInteractSeg]: X-SAM: From Segment Anything to Any Segmentation. arXiv'2025 [paper]code]
  • [ReferIS][ReasonIS] LENS: Learning to Segment Anything with Unified Reinforced Reasoning. arXiv'2025 [paper][code]
  • [ReferIS][ReasonIS][ReferVOS][ReasonVOS][ImgInteractSeg][VideoInteractSeg] UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning. NeurIPS'2025 [paper][code]
  • [ReferIS][ImgSemSeg][ImgInstSeg][ReasonIS] UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface. NeurIPS'2025 [paper][code]
  • [ReferIS][ImgSemSeg][ImgInstSeg][OVSeg] Advancing Visual Large Language Model for Multi-granular Versatile Perception. ICCV'2025 [paper][code]
  • [ReferVOS][ReferIS] MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding. arXiv'2025 [paper][code]
  • [ReferIS][ImgInteractSeg] ARGenSeg: Image Segmentation with Autoregressive Image Generation Model. NeurIPS'2025 [paper]
  • [ReferIS][ReasonIS] Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction. arXiv'2025 [paper] | [code]

Benchmark

  • [ReasonIS][ReferIS] MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation. ICLR'2025. [paper] | [code]
  • [ReasonIS][ReferIS] Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels. CVPR'2025. [paper]
  • [ReferIS][GRES] GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding. ICCV'2025. [paper] | [code]

Remote sensing

MLLM with Segmentation Capability

  • RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts. arXiv'2024. [paper] | [code] | [project]
  • GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding. arXiv'2024. [paper] | [code]
  • GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing. arXiv'2025. [paper]
  • Falcon: A Remote Sensing Vision-Language Foundation Model. arXiv'2025. [paper] | [code]
  • EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models. arXiv'2025. [paper] | [code]
  • RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow. AAAI'2026. [paper] | [code]

Segmentation Model with LLM/MLLM

  • [ReferIS][GCG] GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing. ICML'2025. [paper] | [code] | [project]
  • [ReferIS][ReasonIS] SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model. arXiv'2025. [paper] | [code] | [project]
  • [ReasonIS] LISAt: Language-Instructed Segmentation Assistant for Satellite Imagery. NeurIPS'2025. [paper] | [code] | [project]
  • [ReferIS][GCG] GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing. MM'2025 [paper]
  • [ReferIS][ReasonIS] Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning. arXiv'2025 [paper] | [code] | [project]
  • [ReferIS][ReasonIS][ImgInteractSeg] UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes. arXiv'2025 [paper] | [code]
  • [ReferIS][ReasonIS] SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images. arXiv'2025 [paper] | [code]

Benchmark

  • DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response. NeurIPS'2025. [paper]
  • GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks. ICCV'2025 [paper] | [code]

Contributing

If you find any errors, or you wish to add some papers, please feel free to contribute to this list by contacting me or by creating a pull request using the following Markdown format:

- Paper Name. Conference'Year. [[paper](link)] | [[code](link)] | ...

Related Links

About

Paper list for LLM/MLLM-based image segmentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors