Key Point: Rather than becoming assistants to enhance human productivity, agents hold a deeper value paradigm that they can establish workflows that serve as a flywheel, sustaining high-value data assets across AI industries. Our paper is an application in multimodal domains to demonstrate this potential. If this repo helps you, please consider giving us a ๐!
Note: This repository is also a MMDetection style codebase for Languaged-based Object Detection! Please feel free to use it for your own projects!
TL; DR: An agentic workflow including planning, tool use, and reflection steps to improve the alignment quality of language expressions and visual objects for LOD model.
This repository contains the official implementation of the following paper:
Re-Aligning Language to Visual Objects with an Agentic Workflow
Yuming Chen, Jiangyan Feng*, Haodong Zhang*, Lijun Gong, Feng Zhu, Rui Zhao, Qibin Hou#, Ming-Ming Cheng, Yibing Song#
(* denotes equal contribution. # denotes the corresponding author.)
ICLR 2025 Conference
- ๐ Table of Contents
- โจ News ๐
- ๐ ๏ธ Dependencies and Installation ๐
- ๐ Real-LOD ๐
- ๐ค Real-Agent ๐
- ๐ Real-Data ๐
- ๐ Real-Model ๐
- ๐ Citation ๐
- ๐ License ๐
- ๐ฎ Contact ๐
- ๐ค Acknowledgement ๐
โจ News ๐
Future work can be found in todo.md.
- Apr, 2025: The code of ๐ Real-LOD is publicly available!
- Apr, 2025: The ๐ Real-Data is publicly available!
- Apr, 2025: The code of ๐ Real-Model is publicly available!
- Jan, 2025: ๐ฅ Our paper is accepted by ICLR 2025!
๐ ๏ธ Dependencies and Installation ๐
We provide a simple scrpit
install.shfor installation, or refer to install.md for more details.
-
Clone and enter the repo.
git@github.com:FishAndWasabi/Real-LOD.git cd Real-LOD -
Run
install.sh.bash install.sh
-
Activate your environment!
conda activate Real-LOD
๐ Real-LOD ๐
The input data format of Real-LOD workflow:
{
"image_path": "path/to/image",
"height": image_height,
"width": image_width,
"raw_expression": raw_expression,
"global_caption": global_caption,
"object_locations": {
"chosen_object": {"id":0, "category": category_name, "bbox": [x, y, w, h]},
"other_objects": [{"id":1, "category": category_name, "bbox": [x, y, w, h]},
{"id":2, "category": category_name, "bbox": [x, y, w, h]}]
}
}image_path: Path to the image file.heightandwidth: The height and width of the image.object_locations: Object/expression pairs in the image:chosen_object: The source model used to generate expressions (e.g.,vlm_short,vlm_long, orllm).other_objects: A list of bounding boxes, each defined by[x1, y1, x2, y2].
You could run the following script to start the shell demo:
python tools/run_real_lod.py ${ANNOTATION} ${CONFIG_FILE} [optional arguments]You could run python tools/run_real_lod.py --help to get detailed information of this scripts.
Detailed arguments
positional arguments:
annotation Path to the input annotation file.
optional arguments:
-h, --help show this help message and exit
--configs CONFIGS Path to the configuration file for agent.
--max_cycles MAX_CYCLES
Maximum number of cycles for the workflow.
--debug Enable debug mode.
--save_dir SAVE_DIR Directory to save results.
๐ค Real-Agent ๐
Comming Soon!
๐ Real-Data ๐
The dataset is uploaded on Hugging Face and Baidu Yun. Below is the detailed information and corresponding data paths:
| Src | Scale | Img Num | Ins Num | Exp Num | File | Baidu Yun |
|---|---|---|---|---|---|---|
| O365 | Small | 8,513 | 64,528 | 1,974,504 | real-data-o365-small.jsonl |
Link |
| O365 | Base | 68,104 | 416,537 | 13,628,900 | real-data-o365-base.jsonl |
Link |
| O365 | Large | 574,883 | 3,390,718 | 112,061,648 | real-data-o365-large.jsonl |
Link |
| OI | Small | 19,888 | 36,069 | 1,069,254 | real-data-openimage-small.jsonl |
Link |
| OI | Base | 24,663 | 48,783 | 1,435,416 | real-data-openimage-base.jsonl |
Link |
| OI | Large | 828,314 | 1,776,100 | 81,420,000 | real-data-openimage-large.jsonl |
Link |
| LVIS | - | 94,171 | 99,815 | 3,078,400 | real-data-lvis.jsonl |
Link |
You can access the dataset through huggingface using the following commands:
# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install
# When prompted for a password, use an access token with write permissions.
# Generate one from your settings: https://huggingface.co/settings/tokens
git clone https://huggingface.co/datasets/fishandwasabi/Real-LOD-Data
cd Real-LOD-Data/real-dataNote: The annotation files are provided, and the images remain sourced from their original datasets.
The dataset is structured in the following format:
{
"filename": "path/to/image",
"height": image_height,
"width": image_width,
"pairs": {
source_model: {
"bboxes": [
[x1, y1, x2, y2],
...
],
"category": category,
"relation": single/multi,
"positive_expressions": [
positive_expression_1,
positive_expression_2,
...
],
"negative_expressions": [
negative_expression_1,
negative_expression_2,
...
]
},
...
}
}filename: Path to the image file.heightandwidth: The height and width of the image.pairs: Object/expression pairs in the image:source_model: The source model used to generate expressions (e.g.,vlm_short,vlm_long, orllm).bboxes: A list of bounding boxes, each defined by[x1, y1, x2, y2].category: The category of the object within the bounding box.relation: Specifies whether the object is associated with a single or multiple expressions.positive_expressions: A list of expressions that positively describe the object.negative_expressions: A list of expressions that do not describe the object.
๐ Real-Model ๐
You could run the following script to start the shell demo:
python demo/real-model_image_demo.py ${IMAGE_FILE} ${CONFIG_FILE} ${CHECKPOINT_FILE} --texts {TEXTS} [optional arguments]You could run python demo/real-model_image_demo.py --help to get detailed information of this scripts.
Detailed arguments
positional arguments:
inputs Input image file or folder path.
model Config or checkpoint .pth file or the model name and alias defined in metafile. The model configuration file will try to read from .pth if the
parameter is a .pth weights file.
optional arguments:
-h, --help show this help message and exit
--weights WEIGHTS Checkpoint file
--out-dir OUT_DIR Output directory of images or prediction results.
--texts TEXTS text prompt, such as "bench . car .", "$: coco"
--device DEVICE Device used for inference
--pred-score-thr PRED_SCORE_THR
bbox score threshold
--batch-size BATCH_SIZE
Inference batch size.
--show Display the image in a popup window.
--no-save-vis Do not save detection vis results
--no-save-pred Do not save detection json results
--print-result Whether to print the results.
--palette {coco,voc,citys,random,none}
Color palette used for visualization
You could run the following script to start the Gradio demo (The gradio space will be release as soon as possible):
python demo/real-model_gradio_demo.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]You could run python demo/real-model_gradio_demo.py --help to get detailed information of this scripts.
Detailed arguments
positional arguments:
config Config file
checkpoint Checkpoint file
optional arguments:
-h, --help show this help message and exit
--device DEVICE Device used for inference
--server_name SERVER_NAME
Gradio server name (default: 0.0.0.0)
--server_port SERVER_PORT
Gradio server port (default: 7860)
--score_thre SCORE_THRE
Score threshold for inference (default: 0.3)
--share Enable sharing the Gradio app (default: False)
--debug Enable debug mode for Gradio (default: False)
The tree of training data:
โโโ data
โ โโโ real-data
โ โ โโโ real-data-o365-small.jsonl
โ โ โโโ real-data-o365-base.jsonl
โ โ โโโ real-data-o365-large.jsonl
โ โ โโโ real-data-openimage-small.jsonl
โ โ โโโ real-data-openimage-base.jsonl
โ โ โโโ real-data-openimage-large.jsonl
โ โ โโโ real-data-lvis.jsonl
โ โโโ object365
โ โโโ images
โ โ โโโ train
โ โ โ โโโ xxx.jpg
โ โ โ โโโ ...
โ โโโ openimage
โ โโโ train
โ โ โโโ xxx.jpg
โ โ โโโ...
โ โโโ coco
| โโโ train2017
โ โ โโโ xxx.jpg
โ โ โโโ...To obtain the images for the datasets mentioned, please refer to the following tools and URLs:
- Object365: https://pan.baidu.com/s/1QiWm8hCJus3LstZkz6Mzdw?pwd=wmrx
- OpenImage: https://github.com/cvdfoundation/open-images-dataset
- COCO: http://images.cocodataset.org/zips/train2017.zip
python tools/train_real_model.py ${CONFIG_FILE} [optional arguments]CUDA_VISIBLE_DEVICES=${GPU_IDs} bash tools/dist_train_real_model.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]You could run python tools/train_real_model.py --help to get detailed information of this scripts.
Detailed arguments
positional arguments:
config train config file path
optional arguments:
-h, --help show this help message and exit
--work-dir WORK_DIR the dir to save logs and models
--amp enable automatic-mixed-precision training
--auto-scale-lr enable automatically scaling LR.
--resume [RESUME] If specify checkpoint path, resume from it, while if not specify, try to auto resume from the latest checkpoint in the work directory.
--cfg-options CFG_OPTIONS [CFG_OPTIONS ...]
override some settings in the used config, the key-value pair in xxx=yyy format will be merged into config file. If the value to be overwritten is a list, it should be like
key="[a,b]" or key=a,b It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" Note that the quotation marks are necessary and that no white space is allowed.
--launcher {none,pytorch,slurm,mpi}
job launcher
--local_rank LOCAL_RANK, --local-rank LOCAL_RANK
The tree of evaluation data:
โโโ data
โ โโโ d3
โ โโโ OVDEval
โ โโโ omnilabel_val_v0.1.3
โ โโโ coco
โ โโโ object365
โ โโโ openimagesv5To obtain the evaluation datasets, please refer to the following tools and URLs:
- OmniLabel: https://www.omnilabel.org/dataset/download
- DOD: https://github.com/shikras/d-cube?tab=readme-ov-file#download
- OVDEval: https://huggingface.co/datasets/omlab/OVDEval
We provide the model checkpoint Real-Model_base and Real-Model_tiny on the HuggingFace, you can access them through these code:
# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install
# When prompted for a password, use an access token with write permissions.
# Generate one from your settings: https://huggingface.co/settings/tokens
git clone https://huggingface.co/datasets/fishandwasabi/Real-LOD-Data
cd Real-LOD-Data/real-model-ckptspython tools/dist_test_real_model.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]CUDA_VISIBLE_DEVICES=${GPU_IDs} bash tools/dist_dist_test_real_model.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [optional arguments]You could run python tools/dist_test_real_model.py --help to get detailed information of this scripts.
Detailed arguments
positional arguments:
config test config file path
checkpoint checkpoint file
optional arguments:
-h, --help show this help message and exit
--work-dir WORK_DIR the directory to save the file containing evaluation metrics
--out OUT dump predictions to a pickle file for offline evaluation
--show show prediction results
--show-dir SHOW_DIR directory where painted images will be saved. If specified, it will be automatically saved to the work_dir/timestamp/show_dir
--wait-time WAIT_TIME
the interval of show (s)
--cfg-options CFG_OPTIONS [CFG_OPTIONS ...]
override some settings in the used config, the key-value pair in xxx=yyy format will be merged into config file. If the value to be overwritten is a list, it should be like
key="[a,b]" or key=a,b It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" Note that the quotation marks are necessary and that no white space is allowed.
--launcher {none,pytorch,slurm,mpi}
job launcher
--tta
--local_rank LOCAL_RANK, --local-rank LOCAL_RANK
๐ Citation ๐
If you find our repo useful for your research, please cite us:
@inproceedings{chen2025realigning,
title={Re-Aligning Language to Visual Objects with an Agentic Workflow},
author={Yuming Chen and Jiangyan Feng and Haodong Zhang and Lijun GONG and Feng Zhu and Rui Zhao and Qibin Hou and Ming-Ming Cheng and Yibing Song},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=MPJ4SMnScw}
}
There are also relvant citations for other outstanding works in this repo:
@inproceedings{dang2024instructdet,
title={Instruct{DET}: Diversifying Referring Object Detection with Generalized Instructions},
author={Ronghao Dang and Jiangyan Feng and Haodong Zhang and Chongjian GE and Lin Song and Lijun GONG and Chengju Liu and Qijun Chen and Feng Zhu and Rui Zhao and Yibing Song},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=hss35aoQ1Y}
}
@article{mmdetection,
title = {{MMDetection}: Open MMLab Detection Toolbox and Benchmark},
author = {Chen, Kai and Wang, Jiaqi and Pang, Jiangmiao and Cao, Yuhang and
Xiong, Yu and Li, Xiaoxiao and Sun, Shuyang and Feng, Wansen and
Liu, Ziwei and Xu, Jiarui and Zhang, Zheng and Cheng, Dazhi and
Zhu, Chenchen and Cheng, Tianheng and Zhao, Qijie and Li, Buyu and
Lu, Xin and Zhu, Rui and Wu, Yue and Dai, Jifeng and Wang, Jingdong
and Shi, Jianping and Ouyang, Wanli and Loy, Chen Change and Lin, Dahua},
journal= {arXiv preprint arXiv:1906.07155},
year={2019}
}
๐ License ๐
This code is licensed under the Creative Commons Attribution-NonCommercial 4.0 International for non-commercial use only. Please note that any commercial use of this code requires formal permission prior to use.
๐ฎ Contact ๐
For technical questions, please contact chenyuming[AT]mail.nankai.edu.cn.
For commercial licensing, please contact cmm[AT]nankai.edu.cn.
๐ค Acknowledgement ๐
This repository borrows heavily from mmdetection, grounding-dino, peft, transformers,and chatglm.
For images from COCO, Objects365 and OpenImage, please see and follow their terms of use: MSCOCO, Objects365, and OpenImage.
The README file is referred to LED and LE3D.
We also thank all of our contributors.

