WeDetect: Fast Open-Vocabulary Object Detection as Retrieval

This is the official PyTorch implementation of WeDetect. Our paper can be found at here.

If you find our work helpful, please kindly give us a star 🌟

🔥 Update

[2026.02.21] Our paper was accepted by CVPR2026.
[2026.02.06] We release the WeDetect finetuning code.
[2026.02.03] We release the first MLLM-based object embedding model ObjEmbed based on WeDetect.
[2025.12.16] Release the inference code and paper.

👀 WeDetect Family Overview

Open-vocabulary object detection aims to detect arbitrary classes via text prompts. Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, \ie, matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect:

State-of-the-art performance. WeDetect is a real-time detector with a dual-tower architecture. We show that, with well-curated data and full training, the non-fusion WeDetect surpasses other fusion models and establishes a strong open-vocabulary foundation.
Fast backtrack of historical data. WeDetect-Uni is a universal proposal generator based on WeDetect. We freeze the entire detector and only finetune an objectness prompt to retrieve generic object proposals across categories. Importantly, the proposal embeddings are class-specific and enable a new application, object retrieval, supporting retrieval objects in historical data.
Integration with LMMs for referring expression comprehension (REC). We further propose WeDetect-Ref, an LMM-based object classifier to handle complex referring expressions, which retrieves target objects from the proposal list extracted by WeDetect-Uni. It discards next-token prediction and classifies objects in a single forward pass.

Together, the WeDetect family unifies detection, proposal generation, object retrieval, and REC under a coherent retrieval framework, achieving state-of-the-art performance across 15 benchmarks with high inference efficiency.

📈 Experimental Results

📍 Model Zoo

Please download the models and put them in checkpoints.
WeDetect
WeDetect-Uni
- WeDetect-Base-Uni
- WeDetect-Large-Uni
WeDetect-Ref
- WeDetect-Ref 2B
- WeDetect-Ref 4B

📍 Results

🔧 Install

Our environment

pytorch==2.5.1+cu124
transformers==4.57.1
trl==0.17.0
accelerate==1.10.0
mmcv==2.1.0
mmdet==3.3.0
mmengine==0.10.7

MMCV series packages are not required for WeDetect-Ref users.
Install the environment as follows.

pip install transformers==4.57.1 trl==0.17.0 accelerate==1.10.0 -i https://mirrors.cloud.tencent.com/pypi/simple
pip install pycocotools terminaltables jsonlines tabulate lvis supervision==0.19.0 webdataset ddd-dataset albumentations -i https://mirrors.cloud.tencent.com/pypi/simple

# WeDetect-Ref users do not need to install following packages
pip install openmim -i https://mirrors.cloud.tencent.com/pypi/simple
mim install mmcv==2.1.0
mim install mmdet==3.3.0

⭐ Demo

📍 WeDetect

python3 infer_wedetect.py --config config/wedetect_large.py --checkpoint checkpoints/wedetect_large.pth --image assets/demo.jpeg --text '鞋,床' --threshold 0.3

Note: WeDetect is a Chinese-language model, so please provide class names in Chinese. The model supports detecting multiple categories simultaneously by separating each class name with an English comma. All characters in the command should be in English, including quotation marks (except for the Chinese class names).

📍 WeDetect-Uni

# output the prediction higher than the threshold
python generate_proposal.py --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI --image assets/demo.jpeg --visualize --score_thre 0.2

📍 WeDetect-Ref

# output the top1 prediction
python infer_wedetect_ref.py --wedetect_ref_checkpoint /PATH/TO/WEDETECT_REF --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI --image assets/demo.jpeg --query "a photo of trees and a river" --visualize

# output the prediction higher than the threshold
python infer_wedetect_ref.py --wedetect_ref_checkpoint /PATH/TO/WEDETECT_REF --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI --image assets/demo.jpeg --query "a photo of trees and a river" --visualize --score_thre 0.3

WeDetect-Ref is a multilingual model. You can use either Chinese or English queries for testing, but only one query can be provided at a time.

📏 Evaluation

📍 WeDetect

# Evaluating WeDetect-Base on COCO
bash dist_test.sh config/wedetect_base.py /PATH/TO/WEDETECT 8

Please change the dataset path in the config.

📍 WeDetect-Uni

# Evaluating recall on COCO
cd eval_recall
torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 eval_recall.py --wedetect_uni_checkpoint wedetect_base_uni.pth --dataset coco

Please change the dataset path in Line 10 of eval_recall/eval_recall.py.
Dataset can be coco, lvis, and paco.

# Evaluating the object retrieval task on COCO

cd eval_retrieval

# extract embedding
torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 extract_embedding.py --model wedetect --wedetect_checkpoint wedetect_base.pth --wedetect_uni_checkpoint wedetect_base_uni.pth --dataset coco

# retrieval
python3 retrieval_metric.py --model wedetect --dataset coco --thre 0.2

Please change the dataset path in Line 1323 of eval_retrieval/extract_embedding.py and Line 61 of eval_retrieval/retrieval_metric.py and Line 82 of eval_retrieval/retrieval_metric.py.
Dataset can be coco, and lvis.

📍 WeDetect-Ref

Please refer to the folder wedetect_ref.

Finetune WeDetect on a Custom Dataset

Please origanize your dataset in the COCO format. And provide a classname file in Chinese similar to data/texts/coco_zh_class_texts.json.
Below, we use COCO2017 as an example. We finetune WeDetect-Base with 8 GPUs (24G or less is OK), four images per device, and 12 epochs.
Mask Refine means refining bbox by mask.

📍 Open-vocabulary finetuning

# original box annotations
bash dist_train.sh config/wedetect_base_coco_full_tuning_8xbs4_2e-5.py 8

# mask refine
bash dist_train.sh config/wedetect_base_coco_full_tuning_8xbs4_2e-5_mask_refine.py 8

In open-vocabulary finetuning, the text encoder is retained and will be updated during training.

📍 Closed-set finetuning

# Step 1: extract class embeddings
python3 generate_class_embedding.py --wedetect_checkpoint wedetect_base.pth --classname_file data/texts/coco_zh_class_texts.json

# Step 2: training wedetect vision encoder
bash dist_train.sh config/wedetect_base_coco_vision_encoder_8xbs4_2e-5.py 8

In closed-set finetuning, we discard the text encoder.
Users should first extract classname embeddings to initialize the classifier. Running the command will save an npy file. Please replace the file in the config.

Model	AP	AP₅₀	AP₇₅	AP_s	AP_m	AP_l
WeDetect-Base (zero-shot)	52.1	69.4	57.0	34.8	57.1	69.2
WeDetect-Base (OV-finetuning)	55.7	73.3	60.8	38.0	61.1	72.8
WeDetect-Base (OV-finetuning mask refine)	55.8	73.4	61.0	38.5	61.0	72.8
WeDetect-Base (CS-finetuning)	56.2	73.9	61.6	39.1	61.7	73.7

🙏 Acknowledgement

WeDetect is based on many outstanding open-sourced projects, including mmdetection, YOLO-World, transformers, Qwen3-VL and many others. Thank the authors of above projects for open-sourcing their assets!

✒️ Citation

If you find our work helpful for your research, please consider citing our work.

@article{fu2025wedetect,
  title={WeDetect: Fast Open-Vocabulary Object Detection as Retrieval},
  author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi},
  journal={arXiv preprint arXiv:2512.12309},
  year={2025}
}

📜 License

Our models and code are under the GPL-v3 Licence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WeDetect: Fast Open-Vocabulary Object Detection as Retrieval

🔥 Update

👀 WeDetect Family Overview

📈 Experimental Results

📍 Model Zoo

📍 Results

🔧 Install

Our environment

⭐ Demo

📍 WeDetect

📍 WeDetect-Uni

📍 WeDetect-Ref

📏 Evaluation

📍 WeDetect

📍 WeDetect-Uni

📍 WeDetect-Ref

Finetune WeDetect on a Custom Dataset

📍 Open-vocabulary finetuning

📍 Closed-set finetuning

🙏 Acknowledgement

✒️ Citation

📜 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
assets		assets
config		config
data/texts		data/texts
eval_recall		eval_recall
eval_retrieval		eval_retrieval
wedetect		wedetect
wedetect_ref		wedetect_ref
xlm-roberta-base		xlm-roberta-base
xlm-roberta-large		xlm-roberta-large
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md
dist_test.sh		dist_test.sh
dist_train.sh		dist_train.sh
generate_class_embedding.py		generate_class_embedding.py
generate_proposal.py		generate_proposal.py
infer_wedetect.py		infer_wedetect.py
infer_wedetect_ref.py		infer_wedetect_ref.py
simsun.ttc		simsun.ttc
test.py		test.py
train.py		train.py
vis.py		vis.py

WeChatCV/WeDetect

Folders and files

Latest commit

History

Repository files navigation

WeDetect: Fast Open-Vocabulary Object Detection as Retrieval

🔥 Update

👀 WeDetect Family Overview

📈 Experimental Results

📍 Model Zoo

📍 Results

🔧 Install

Our environment

⭐ Demo

📍 WeDetect

📍 WeDetect-Uni

📍 WeDetect-Ref

📏 Evaluation

📍 WeDetect

📍 WeDetect-Uni

📍 WeDetect-Ref

Finetune WeDetect on a Custom Dataset

📍 Open-vocabulary finetuning

📍 Closed-set finetuning

🙏 Acknowledgement

✒️ Citation

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages