Skip to content

(CVPR 2026) Official repository of paper "WeDetect: Fast Open-Vocabulary Object Detection as Retrieval"

Notifications You must be signed in to change notification settings

WeChatCV/WeDetect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

WeDetect: Fast Open-Vocabulary Object Detection as Retrieval

This is the official PyTorch implementation of WeDetect. Our paper can be found at here.

If you find our work helpful, please kindly give us a star 🌟

Here is the δΈ­ζ–‡η‰ˆζŒ‡ε—.

πŸ”₯ Update

  • [2026.02.21] Our paper was accepted by CVPR2026.
  • [2026.02.06] We release the WeDetect finetuning code.
  • [2026.02.03] We release the first MLLM-based object embedding model ObjEmbed based on WeDetect.
  • [2025.12.16] Release the inference code and paper.

πŸ‘€ WeDetect Family Overview

Open-vocabulary object detection aims to detect arbitrary classes via text prompts. Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, \ie, matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect:

  • State-of-the-art performance. WeDetect is a real-time detector with a dual-tower architecture. We show that, with well-curated data and full training, the non-fusion WeDetect surpasses other fusion models and establishes a strong open-vocabulary foundation.
  • Fast backtrack of historical data. WeDetect-Uni is a universal proposal generator based on WeDetect. We freeze the entire detector and only finetune an objectness prompt to retrieve generic object proposals across categories. Importantly, the proposal embeddings are class-specific and enable a new application, object retrieval, supporting retrieval objects in historical data.
  • Integration with LMMs for referring expression comprehension (REC). We further propose WeDetect-Ref, an LMM-based object classifier to handle complex referring expressions, which retrieves target objects from the proposal list extracted by WeDetect-Uni. It discards next-token prediction and classifies objects in a single forward pass.

Together, the WeDetect family unifies detection, proposal generation, object retrieval, and REC under a coherent retrieval framework, achieving state-of-the-art performance across 15 benchmarks with high inference efficiency.

πŸ“ˆ Experimental Results

πŸ“ Model Zoo

πŸ“ Results

πŸ”§ Install

Our environment

pytorch==2.5.1+cu124
transformers==4.57.1
trl==0.17.0
accelerate==1.10.0
mmcv==2.1.0
mmdet==3.3.0
mmengine==0.10.7
  • MMCV series packages are not required for WeDetect-Ref users.
  • Install the environment as follows.
pip install transformers==4.57.1 trl==0.17.0 accelerate==1.10.0 -i https://mirrors.cloud.tencent.com/pypi/simple
pip install pycocotools terminaltables jsonlines tabulate lvis supervision==0.19.0 webdataset ddd-dataset albumentations -i https://mirrors.cloud.tencent.com/pypi/simple

# WeDetect-Ref users do not need to install following packages
pip install openmim -i https://mirrors.cloud.tencent.com/pypi/simple
mim install mmcv==2.1.0
mim install mmdet==3.3.0

⭐ Demo

πŸ“ WeDetect

python3 infer_wedetect.py --config config/wedetect_large.py --checkpoint checkpoints/wedetect_large.pth --image assets/demo.jpeg --text 'ιž‹,床' --threshold 0.3
  • Note: WeDetect is a Chinese-language model, so please provide class names in Chinese. The model supports detecting multiple categories simultaneously by separating each class name with an English comma. All characters in the command should be in English, including quotation marks (except for the Chinese class names).

πŸ“ WeDetect-Uni

# output the prediction higher than the threshold
python generate_proposal.py --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI --image assets/demo.jpeg --visualize --score_thre 0.2

πŸ“ WeDetect-Ref

# output the top1 prediction
python infer_wedetect_ref.py --wedetect_ref_checkpoint /PATH/TO/WEDETECT_REF --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI --image assets/demo.jpeg --query "a photo of trees and a river" --visualize

# output the prediction higher than the threshold
python infer_wedetect_ref.py --wedetect_ref_checkpoint /PATH/TO/WEDETECT_REF --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI --image assets/demo.jpeg --query "a photo of trees and a river" --visualize --score_thre 0.3
  • WeDetect-Ref is a multilingual model. You can use either Chinese or English queries for testing, but only one query can be provided at a time.

πŸ“ Evaluation

πŸ“ WeDetect

# Evaluating WeDetect-Base on COCO
bash dist_test.sh config/wedetect_base.py /PATH/TO/WEDETECT 8
  • Please change the dataset path in the config.

πŸ“ WeDetect-Uni

# Evaluating recall on COCO
cd eval_recall
torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 eval_recall.py --wedetect_uni_checkpoint wedetect_base_uni.pth --dataset coco
  • Please change the dataset path in Line 10 of eval_recall/eval_recall.py.
  • Dataset can be coco, lvis, and paco.
# Evaluating the object retrieval task on COCO

cd eval_retrieval

# extract embedding
torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 extract_embedding.py --model wedetect --wedetect_checkpoint wedetect_base.pth --wedetect_uni_checkpoint wedetect_base_uni.pth --dataset coco

# retrieval
python3 retrieval_metric.py --model wedetect --dataset coco --thre 0.2
  • Please change the dataset path in Line 1323 of eval_retrieval/extract_embedding.py and Line 61 of eval_retrieval/retrieval_metric.py and Line 82 of eval_retrieval/retrieval_metric.py.
  • Dataset can be coco, and lvis.

πŸ“ WeDetect-Ref

  • Please refer to the folder wedetect_ref.

Finetune WeDetect on a Custom Dataset

  • Please origanize your dataset in the COCO format. And provide a classname file in Chinese similar to data/texts/coco_zh_class_texts.json.
  • Below, we use COCO2017 as an example. We finetune WeDetect-Base with 8 GPUs (24G or less is OK), four images per device, and 12 epochs.
  • Mask Refine means refining bbox by mask.

πŸ“ Open-vocabulary finetuning

# original box annotations
bash dist_train.sh config/wedetect_base_coco_full_tuning_8xbs4_2e-5.py 8

# mask refine
bash dist_train.sh config/wedetect_base_coco_full_tuning_8xbs4_2e-5_mask_refine.py 8
  • In open-vocabulary finetuning, the text encoder is retained and will be updated during training.

πŸ“ Closed-set finetuning

# Step 1: extract class embeddings
python3 generate_class_embedding.py --wedetect_checkpoint wedetect_base.pth --classname_file data/texts/coco_zh_class_texts.json

# Step 2: training wedetect vision encoder
bash dist_train.sh config/wedetect_base_coco_vision_encoder_8xbs4_2e-5.py 8
  • In closed-set finetuning, we discard the text encoder.
  • Users should first extract classname embeddings to initialize the classifier. Running the command will save an npy file. Please replace the file in the config.
Model AP AP50 AP75 APs APm APl
WeDetect-Base (zero-shot) 52.1 69.4 57.0 34.8 57.1 69.2
WeDetect-Base (OV-finetuning) 55.7 73.3 60.8 38.0 61.1 72.8
WeDetect-Base (OV-finetuning mask refine) 55.8 73.4 61.0 38.5 61.0 72.8
WeDetect-Base (CS-finetuning) 56.2 73.9 61.6 39.1 61.7 73.7

πŸ™ Acknowledgement

  • WeDetect is based on many outstanding open-sourced projects, including mmdetection, YOLO-World, transformers, Qwen3-VL and many others. Thank the authors of above projects for open-sourcing their assets!

βœ’οΈ Citation

If you find our work helpful for your research, please consider citing our work.

@article{fu2025wedetect,
  title={WeDetect: Fast Open-Vocabulary Object Detection as Retrieval},
  author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi},
  journal={arXiv preprint arXiv:2512.12309},
  year={2025}
}

πŸ“œ License

  • Our models and code are under the GPL-v3 Licence.

About

(CVPR 2026) Official repository of paper "WeDetect: Fast Open-Vocabulary Object Detection as Retrieval"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published