This is the official PyTorch implementation of ObjEmbed. Our paper can be found at here.
If you find our work helpful, please kindly give us a star 🌟
Here is the 中文版指南.
Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases.
In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval.
ObjEmbed enjoys three key properties:
- Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval.
- Versatility: It seamlessly handles both region-level and image-level tasks.
- Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency.
Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.
- [2026.2.3] Release the code and paper.
We use WeDetect-Base-Uni as the proposal network. You can download the checkpoint at huggingface:
pytorch==2.6.1+cu124
transformers==4.57.1
trl==0.17.0
accelerate==1.10.0
- Install the environment as follows.
pip install transformers==4.57.1 trl==0.17.0 accelerate==1.10.0 -i https://mirrors.cloud.tencent.com/pypi/simple
pip install pycocotools terminaltables jsonlines tabulate ddd-dataset torchmetrics lvis -i https://mirrors.cloud.tencent.com/pypi/simple
- Evaluating on LVIS should make sure
numpy<=1.24.
# output the top1 prediction
python infer_objembed.py --objembed_checkpoint /PATH/TO/OBJEMBED --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI --image assets/demo.jpg --query "The car's license plate in HAWAII" --task rec --visualize
python infer_objembed.py --objembed_checkpoint /PATH/TO/OBJEMBED --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI --image image1.jpg image2.jpg image3.jpg --query "YOUR_QUERY" --task retrieval_by_image
cd eval_grounding
export PYTHONPATH=../
# coco / coco_o / lvis / FG-OVD / d3 / odinw13
torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 eval.py --checkpoint /PATH/TO/OBJEMBED --dataset coco --nms --task_specific_visual_prompt
# refcoco
torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 eval.py --checkpoint /PATH/TO/OBJEMBED --dataset refcoco --num_select 20 --task_specific_visual_prompt
- Please change the dataset path in Line
47-417ofeval_grounding/eval.py. - For each dataset, users should first extract proposals for each image and save them as json files. You can use
generate_proposal.pyas an example code. We provide refcoco proposals at here.
cd eval_retrieval
export PYTHONPATH=../
# sharegpt4v / dci / coco / coco_cn / d3 / flickr30k / flickr30k_cn
# sorce_1k / reircoco / ilias / ilias_i2i
torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 eval.py --checkpoint /PATH/TO/OBJEMBED --dataset sorce_1k
- Please change the dataset path in Line
19-90ofeval_retrieval/eval.py. - For each dataset, users should first extract proposals for each image and save them as json files. You can use
generate_proposal.pyas an example code. We provide refcoco proposals at here.
- ObjEmbed is based on many outstanding open-sourced projects, including WeDetect, transformers, Qwen3-VL and many others. Thank the authors of above projects for open-sourcing their assets!
If you find our work helpful for your research, please consider citing our work.
@article{fu2026objembed,
title={ObjEmbed: Towards Universal Multimodal Object Embeddings},
author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi},
journal={arXiv preprint arXiv:2602.01753},
year={2026}
}- Our models and code are under the Apache 2.0 License.





