[ReferIS]: Referring Image Segmentation
[GRES]: Generalized Referring Expression Segmentation
[ReasonIS]: Reasoning Image Segmentation
[ReasonInstIS]: Reasoning Instance Image Segmentation
[SiD]: Segmentation in Dialogue
[GCG]: Grounded Conversation Generation
[ReasonMIS]: Multi-image pixel-grounded Reasoning Segmentation
[MGSC]: MultiGranularity Segmentation and Captioning
[ImgSemSeg]: Image Semantic Segmentation
[ImgInstSeg]: Image Instance Segmentation
[ImgPanSeg]: Image Panoptic Segmentation
[ImgInteractSeg]: Image Iteractive Segmentation
[OVSeg]: Open-Vocabulary Segmentation
[VideoObjSeg]: Video Object Segmentation
[ReferVOS]: Referring Video Object Segmentation
[ReasonVOS]: Reasoning Video Object Segmentation
[VideoInteractSeg]: Video Iteractive Segmentation
Note: Only the tasks assessed in the paper are listed here.
- Image Segmentation in Foundation Model Era: A Survey. arXiv'2024. [paper] | [project]
- Reasoning Segmentation for Images and Videos: A Survey. arXiv'2025. [paper]
- VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks. NeurIPS'2023. [paper] | [code]
- VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks. NeurIPS'2024. [paper] | [code]
- Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. CVPR'2024. [paper] | [code]
- NExT-Chat: An LMM for Chat, Detection and Segmentation. ICML'2024. [paper] | [code] | [project]
- PaliGemma: A versatile 3B VLM for transfer. arXiv'2024. [paper] | [code]
- PaliGemma 2: A Family of Versatile VLMs for Transfer. arXiv'2024. [paper] | [code]
- REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding. arXiv'2025 [paper] | [code]
- Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs. arXiv'2025 [paper] | [code]
[ReferIS][ReasonIS]LISA: Reasoning Segmentation via Large Language Model. CVPR'2024. [paper] | [code][ReferIS][ReasonIS][ReasonInstIS][SiD]LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model. arXiv'2023. [paper] | [code][ReferIS][GRES]GSVA: Generalized Segmentation via Multimodal Large Language Models. CVPR'2024. [paper] | [code][ReferIS][ReasonIS]PixelLM: Pixel Reasoning with Large Multimodal Model. CVPR'2024. [paper] | [code] | [project][ReferIS][GCG]GLaMM: Pixel Grounding Large Multimodal Model. CVPR'2024. [paper] | [code] | [project][ReferIS][GRES][ReasonIS]GROUNDHOG: Grounding Large Language Models to Holistic Segmentation. CVPR'2024. [paper] | [project][ReasonIS]LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning. CVPRW'2024. [paper] | [code][ReferIS][ReasonIS][GCG][ImgSemSeg][ImgInstSeg][ImgInteractSeg][ImgInteractSeg]OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding. NeurIPS'2024. [paper] | [code] | [project][ReferIS][GRES][ImgSemSeg][ImgInstSeg][ImgInteractSeg][ImgInteractSeg][VideoObjSeg][OVSeg]PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model. ECCV'2024. [paper] | [code][ReferIS][GRES][ReasonIS]SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation. ECCV'2024. [paper] | [code][ReferIS][ReasonIS][ReasonVOS]VISA: Reasoning Video Object Segmentation via Large Language Models. ECCV'2024. [paper] | [code][ImgSemSeg][ReferIS][GRES]LaSagnA: Language-based Segmentation Assistant for Complex Queries. arXiv'2024. [paper] | [code][ReferIS]EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model. arXiv'2024. [paper] | [code][ReferIS][GRES][ReasonIS][GCG][MGSC]Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model. AAAI'2025. [paper] | [code] | [project][ReferIS][GRES]Text4Seg: Reimagining Image Segmentation as Text Generation. ICLR'2025. [paper] | [code] | [project][ReferIS][ReasonIS][ReferVOS][ReasonVOS]InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models. arXiv'2024. [paper] | [code][ReasonMIS]PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation. arXiv'2024. [paper] | [project][ReferIS][GCG][ReferVOS][VideoInteractSeg]Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos. CVPRW'2025. [paper] | [code] | [project][ReferIS][GRES][ReasonIS]HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model. ICCV'2025. [paper] | [code][ReferIS]Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding. arXiv'2025. [paper] | [project][RES][ReferIS]Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement. arXiv'2025. [paper] | [code][RES][ReferIS]Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts. arXiv'2025. [paper] | [project][RES][ReferIS]POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation. CVPR'2025. [paper] | [project][ReasonIS][ReferIS]VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning. arXiv'2025. [paper] | [code][ReferIS]Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model. CVPR'2024 [paper] | [project][ReferIS]SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories. CVPR'2025 [paper] | [code] | [project][ReferIS][ReasonIS]Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding. CVPR'2025 [paper][ReferIS][ReasonIS]SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning. arXiv'2025 [paper][ImgSemSeg][ImgInstSeg][ReferIS]ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model. arXiv'2025 [paper][ReferIS][ReasonIS]PIXELTHINK: Towards Efficient Chain-of-Pixel Reasoning. arXiv'2025 [paper] | [code] | [project][ReferIS][ImgSemSeg]LlamaSeg: Image Segmentation via Autoregressive Mask Generation. arXiv'2025 [paper][ReferIS][OVSeg]ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation. arXiv'2025 [paper] | [code][ReferVOS][ReasonVOS]VIDEOMOLMO: Spatio-Temporal Grounding Meets Pointing. arXiv'2025 [paper] | [code][ReasonIS][ReferIX]RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought. ACL'2025 [paper][ReferVOS][ReasonVOS][ReasonIS][ReferIS]: One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos. NeurIPS'2024 [paper] | [code][RES][GCG]GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model. CVPR'2025 [paper][ReferIS][ReasonIS]Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning. arXiv'2025 [paper][project][code][ReferIS][ReasonIS][GCG][GRES][ImgSemSeg][ImgInstSeg][ImgInteractSeg][ImgInteractSeg]: X-SAM: From Segment Anything to Any Segmentation. arXiv'2025 [paper]code][ReferIS][ReasonIS]LENS: Learning to Segment Anything with Unified Reinforced Reasoning. arXiv'2025 [paper][code][ReferIS][ReasonIS][ReferVOS][ReasonVOS][ImgInteractSeg][VideoInteractSeg]UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning. NeurIPS'2025 [paper][code][ReferIS][ImgSemSeg][ImgInstSeg][ReasonIS]UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface. NeurIPS'2025 [paper][code][ReferIS][ImgSemSeg][ImgInstSeg][OVSeg]Advancing Visual Large Language Model for Multi-granular Versatile Perception. ICCV'2025 [paper][code][ReferVOS][ReferIS]MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding. arXiv'2025 [paper][code][ReferIS][ImgInteractSeg]ARGenSeg: Image Segmentation with Autoregressive Image Generation Model. NeurIPS'2025 [paper][ReferIS][ReasonIS]Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction. arXiv'2025 [paper] | [code]
[ReasonIS][ReferIS]MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation. ICLR'2025. [paper] | [code][ReasonIS][ReferIS]Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels. CVPR'2025. [paper][ReferIS][GRES]GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding. ICCV'2025. [paper] | [code]
- RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts. arXiv'2024. [paper] | [code] | [project]
- GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding. arXiv'2024. [paper] | [code]
- GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing. arXiv'2025. [paper]
- Falcon: A Remote Sensing Vision-Language Foundation Model. arXiv'2025. [paper] | [code]
- EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models. arXiv'2025. [paper] | [code]
- RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow. AAAI'2026. [paper] | [code]
[ReferIS][GCG]GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing. ICML'2025. [paper] | [code] | [project][ReferIS][ReasonIS]SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model. arXiv'2025. [paper] | [code] | [project][ReasonIS]LISAt: Language-Instructed Segmentation Assistant for Satellite Imagery. NeurIPS'2025. [paper] | [code] | [project][ReferIS][GCG]GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing. MM'2025 [paper][ReferIS][ReasonIS]Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning. arXiv'2025 [paper] | [code] | [project][ReferIS][ReasonIS][ImgInteractSeg]UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes. arXiv'2025 [paper] | [code][ReferIS][ReasonIS]SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images. arXiv'2025 [paper] | [code]
- DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response. NeurIPS'2025. [paper]
- GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks. ICCV'2025 [paper] | [code]
If you find any errors, or you wish to add some papers, please feel free to contribute to this list by contacting me or by creating a pull request using the following Markdown format:
- Paper Name. Conference'Year. [[paper](link)] | [[code](link)] | ...