GitHub - FishAndWasabi/Real-LOD: Offical implementation of "Re-Aligning Language to Visual Objects with an Agentic Workflow"

Re-Aligning Language to Visual Objects with an Agentic Workflow

ICLR2025

Key Point: Rather than becoming assistants to enhance human productivity, agents hold a deeper value paradigm that they can establish workflows that serve as a flywheel, sustaining high-value data assets across AI industries. Our paper is an application in multimodal domains to demonstrate this potential. If this repo helps you, please consider giving us a 🌟!

Note: This repository is also a MMDetection style codebase for Languaged-based Object Detection! Please feel free to use it for your own projects!

TL; DR: An agentic workflow including planning, tool use, and reflection steps to improve the alignment quality of language expressions and visual objects for LOD model.

This repository contains the official implementation of the following paper:

Re-Aligning Language to Visual Objects with an Agentic Workflow
Yuming Chen, Jiangyan Feng^*, Haodong Zhang^*, Lijun Gong, Feng Zhu, Rui Zhao, Qibin Hou^#, Ming-Ming Cheng, Yibing Song^#
(* denotes equal contribution. # denotes the corresponding author.)
ICLR 2025 Conference

✨ News 🔝

Future work can be found in todo.md.

Apr, 2025: The code of 🚀 Real-LOD is publicly available!
Apr, 2025: The 📕 Real-Data is publicly available!
Apr, 2025: The code of 🚂 Real-Model is publicly available!
Jan, 2025: 🔥 Our paper is accepted by ICLR 2025!

🛠️ Dependencies and Installation 🔝

We provide a simple scrpit install.sh for installation, or refer to install.md for more details.

Clone and enter the repo.

git@github.com:FishAndWasabi/Real-LOD.git
cd Real-LOD

Run install.sh.
```
bash install.sh
```
Activate your environment!
```
conda activate Real-LOD
```

🚀 Real-LOD 🔝

Data Format

The input data format of Real-LOD workflow:

{
  "image_path": "path/to/image",
  "height": image_height,
  "width": image_width,
  "raw_expression": raw_expression,
  "global_caption": global_caption,
  "object_locations": {
      "chosen_object": {"id":0, "category": category_name, "bbox": [x, y, w, h]},
      "other_objects": [{"id":1, "category": category_name, "bbox": [x, y, w, h]},
                        {"id":2, "category": category_name, "bbox": [x, y, w, h]}]
  }
}

image_path: Path to the image file.
height and width: The height and width of the image.
object_locations: Object/expression pairs in the image:
- chosen_object: The source model used to generate expressions (e.g., vlm_short, vlm_long, or llm).
- other_objects: A list of bounding boxes, each defined by [x1, y1, x2, y2].

Run

You could run the following script to start the shell demo:

python tools/run_real_lod.py ${ANNOTATION} ${CONFIG_FILE} [optional arguments]

You could run python tools/run_real_lod.py --help to get detailed information of this scripts.

Detailed arguments

positional arguments:
  annotation            Path to the input annotation file.

optional arguments:
  -h, --help            show this help message and exit
  --configs CONFIGS     Path to the configuration file for agent.
  --max_cycles MAX_CYCLES
                        Maximum number of cycles for the workflow.
  --debug               Enable debug mode.
  --save_dir SAVE_DIR   Directory to save results.

Examples

🤖 Real-Agent 🔝

Comming Soon!

📕 Real-Data 🔝

Data Information

The dataset is uploaded on Hugging Face and Baidu Yun. Below is the detailed information and corresponding data paths:

Src	Scale	Img Num	Ins Num	Exp Num	File	Baidu Yun
O365	Small	8,513	64,528	1,974,504	`real-data-o365-small.jsonl`	Link
O365	Base	68,104	416,537	13,628,900	`real-data-o365-base.jsonl`	Link
O365	Large	574,883	3,390,718	112,061,648	`real-data-o365-large.jsonl`	Link
OI	Small	19,888	36,069	1,069,254	`real-data-openimage-small.jsonl`	Link
OI	Base	24,663	48,783	1,435,416	`real-data-openimage-base.jsonl`	Link
OI	Large	828,314	1,776,100	81,420,000	`real-data-openimage-large.jsonl`	Link
LVIS	-	94,171	99,815	3,078,400	`real-data-lvis.jsonl`	Link

You can access the dataset through huggingface using the following commands:

# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install

# When prompted for a password, use an access token with write permissions.
# Generate one from your settings: https://huggingface.co/settings/tokens
git clone https://huggingface.co/datasets/fishandwasabi/Real-LOD-Data
cd Real-LOD-Data/real-data

Note: The annotation files are provided, and the images remain sourced from their original datasets.

Data Format

The dataset is structured in the following format:

{
  "filename": "path/to/image",
  "height": image_height,
  "width": image_width,
  "pairs": {
    source_model: {
      "bboxes": [
        [x1, y1, x2, y2],
        ...
      ],
      "category": category,
      "relation": single/multi,
      "positive_expressions": [
        positive_expression_1,
        positive_expression_2,
        ...
      ],
      "negative_expressions": [
        negative_expression_1,
        negative_expression_2,
        ...
      ]
    },
    ...
  }
}

filename: Path to the image file.
height and width: The height and width of the image.
pairs: Object/expression pairs in the image:
- source_model: The source model used to generate expressions (e.g., vlm_short, vlm_long, or llm).
- bboxes: A list of bounding boxes, each defined by [x1, y1, x2, y2].
- category: The category of the object within the bounding box.
- relation: Specifies whether the object is associated with a single or multiple expressions.
- positive_expressions: A list of expressions that positively describe the object.
- negative_expressions: A list of expressions that do not describe the object.

🚂 Real-Model 🔝

Demo of Real-Model

1.1 Shell

You could run the following script to start the shell demo:

python demo/real-model_image_demo.py ${IMAGE_FILE} ${CONFIG_FILE} ${CHECKPOINT_FILE} --texts {TEXTS} [optional arguments]

You could run python demo/real-model_image_demo.py --help to get detailed information of this scripts.

Detailed arguments

positional arguments:
  inputs                Input image file or folder path.
  model                 Config or checkpoint .pth file or the model name and alias defined in metafile. The model configuration file will try to read from .pth if the
                        parameter is a .pth weights file.

optional arguments:
  -h, --help            show this help message and exit
  --weights WEIGHTS     Checkpoint file
  --out-dir OUT_DIR     Output directory of images or prediction results.
  --texts TEXTS         text prompt, such as "bench . car .", "$: coco"
  --device DEVICE       Device used for inference
  --pred-score-thr PRED_SCORE_THR
                        bbox score threshold
  --batch-size BATCH_SIZE
                        Inference batch size.
  --show                Display the image in a popup window.
  --no-save-vis         Do not save detection vis results
  --no-save-pred        Do not save detection json results
  --print-result        Whether to print the results.
  --palette {coco,voc,citys,random,none}
                        Color palette used for visualization

1.2 Gradio

You could run the following script to start the Gradio demo (The gradio space will be release as soon as possible):

python demo/real-model_gradio_demo.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

You could run python demo/real-model_gradio_demo.py --help to get detailed information of this scripts.

Detailed arguments

positional arguments:
  config                Config file
  checkpoint            Checkpoint file

optional arguments:
  -h, --help            show this help message and exit
  --device DEVICE       Device used for inference
  --server_name SERVER_NAME
                        Gradio server name (default: 0.0.0.0)
  --server_port SERVER_PORT
                        Gradio server port (default: 7860)
  --score_thre SCORE_THRE
                        Score threshold for inference (default: 0.3)
  --share               Enable sharing the Gradio app (default: False)
  --debug               Enable debug mode for Gradio (default: False)

Train

1.1 Data Preparation

The tree of training data:

├── data
│   ├── real-data
│   │   ├── real-data-o365-small.jsonl
│   │   ├── real-data-o365-base.jsonl
│   │   ├── real-data-o365-large.jsonl
│   │   ├── real-data-openimage-small.jsonl
│   │   ├── real-data-openimage-base.jsonl
│   │   ├── real-data-openimage-large.jsonl
│   │   └── real-data-lvis.jsonl
│   └── object365
│       ├── images
│       │   ├── train
│       │   │   ├── xxx.jpg
│       │   │   ├── ...
│   └── openimage
│       ├── train
│       │   ├── xxx.jpg
│       │   ├──...
│   └── coco 
|       ├── train2017
│       │   ├── xxx.jpg
│       │   ├──...

To obtain the images for the datasets mentioned, please refer to the following tools and URLs:

Object365: https://pan.baidu.com/s/1QiWm8hCJus3LstZkz6Mzdw?pwd=wmrx
OpenImage: https://github.com/cvdfoundation/open-images-dataset
COCO: http://images.cocodataset.org/zips/train2017.zip

1.2 Training with single GPU

python tools/train_real_model.py ${CONFIG_FILE} [optional arguments]

1.3 Training with multi GPU

CUDA_VISIBLE_DEVICES=${GPU_IDs} bash tools/dist_train_real_model.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]

You could run python tools/train_real_model.py --help to get detailed information of this scripts.

Detailed arguments

positional arguments:
  config                train config file path

optional arguments:
  -h, --help            show this help message and exit
  --work-dir WORK_DIR   the dir to save logs and models
  --amp                 enable automatic-mixed-precision training
  --auto-scale-lr       enable automatically scaling LR.
  --resume [RESUME]     If specify checkpoint path, resume from it, while if not specify, try to auto resume from the latest checkpoint in the work directory.
  --cfg-options CFG_OPTIONS [CFG_OPTIONS ...]
                        override some settings in the used config, the key-value pair in xxx=yyy format will be merged into config file. If the value to be overwritten is a list, it should be like
                        key="[a,b]" or key=a,b It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" Note that the quotation marks are necessary and that no white space is allowed.
  --launcher {none,pytorch,slurm,mpi}
                        job launcher
  --local_rank LOCAL_RANK, --local-rank LOCAL_RANK

Evaluation

1.1 Data Preparation

The tree of evaluation data:

├── data
│   ├── d3
│   ├── OVDEval
│   ├── omnilabel_val_v0.1.3
│   └── coco
│   └── object365
│   └── openimagesv5

To obtain the evaluation datasets, please refer to the following tools and URLs:

OmniLabel: https://www.omnilabel.org/dataset/download
DOD: https://github.com/shikras/d-cube?tab=readme-ov-file#download
OVDEval: https://huggingface.co/datasets/omlab/OVDEval

1.2 Model Checkpoint

We provide the model checkpoint Real-Model_base and Real-Model_tiny on the HuggingFace, you can access them through these code:

# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install

# When prompted for a password, use an access token with write permissions.
# Generate one from your settings: https://huggingface.co/settings/tokens
git clone https://huggingface.co/datasets/fishandwasabi/Real-LOD-Data
cd Real-LOD-Data/real-model-ckpts

1.3 Evaluation with single GPU

python tools/dist_test_real_model.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

1.4 Evaluation with multi GPU

CUDA_VISIBLE_DEVICES=${GPU_IDs} bash tools/dist_dist_test_real_model.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [optional arguments]

You could run python tools/dist_test_real_model.py --help to get detailed information of this scripts.

Detailed arguments

positional arguments:
  config                test config file path
  checkpoint            checkpoint file

optional arguments:
  -h, --help            show this help message and exit
  --work-dir WORK_DIR   the directory to save the file containing evaluation metrics
  --out OUT             dump predictions to a pickle file for offline evaluation
  --show                show prediction results
  --show-dir SHOW_DIR   directory where painted images will be saved. If specified, it will be automatically saved to the work_dir/timestamp/show_dir
  --wait-time WAIT_TIME
                        the interval of show (s)
  --cfg-options CFG_OPTIONS [CFG_OPTIONS ...]
                        override some settings in the used config, the key-value pair in xxx=yyy format will be merged into config file. If the value to be overwritten is a list, it should be like
                        key="[a,b]" or key=a,b It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" Note that the quotation marks are necessary and that no white space is allowed.
  --launcher {none,pytorch,slurm,mpi}
                        job launcher
  --tta
  --local_rank LOCAL_RANK, --local-rank LOCAL_RANK

Examples

📖 Citation 🔝

If you find our repo useful for your research, please cite us:

@inproceedings{chen2025realigning,
  title={Re-Aligning Language to Visual Objects with an Agentic Workflow},
  author={Yuming Chen and Jiangyan Feng and Haodong Zhang and Lijun GONG and Feng Zhu and Rui Zhao and Qibin Hou and Ming-Ming Cheng and Yibing Song},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=MPJ4SMnScw}
}

There are also relvant citations for other outstanding works in this repo:

@inproceedings{dang2024instructdet,
  title={Instruct{DET}: Diversifying Referring Object Detection with Generalized Instructions},
  author={Ronghao Dang and Jiangyan Feng and Haodong Zhang and Chongjian GE and Lin Song and Lijun GONG and Chengju Liu and Qijun Chen and Feng Zhu and Rui Zhao and Yibing Song},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=hss35aoQ1Y}
}

@article{mmdetection,
  title   = {{MMDetection}: Open MMLab Detection Toolbox and Benchmark},
  author  = {Chen, Kai and Wang, Jiaqi and Pang, Jiangmiao and Cao, Yuhang and
             Xiong, Yu and Li, Xiaoxiao and Sun, Shuyang and Feng, Wansen and
             Liu, Ziwei and Xu, Jiarui and Zhang, Zheng and Cheng, Dazhi and
             Zhu, Chenchen and Cheng, Tianheng and Zhao, Qijie and Li, Buyu and
             Lu, Xin and Zhu, Rui and Wu, Yue and Dai, Jifeng and Wang, Jingdong
             and Shi, Jianping and Ouyang, Wanli and Loy, Chen Change and Lin, Dahua},
  journal= {arXiv preprint arXiv:1906.07155},
  year={2019}
}

📜 License 🔝

This code is licensed under the Creative Commons Attribution-NonCommercial 4.0 International for non-commercial use only. Please note that any commercial use of this code requires formal permission prior to use.

📮 Contact 🔝

For technical questions, please contact chenyuming[AT]mail.nankai.edu.cn.

For commercial licensing, please contact cmm[AT]nankai.edu.cn.

🤝 Acknowledgement 🔝

This repository borrows heavily from mmdetection, grounding-dino, peft, transformers,and chatglm.

For images from COCO, Objects365 and OpenImage, please see and follow their terms of use: MSCOCO, Objects365, and OpenImage.

The README file is referred to LED and LE3D.

We also thank all of our contributors.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
assets		assets
configs		configs
demo		demo
docs		docs
real_lod		real_lod
real_model		real_model
requirements		requirements
tools		tools
.gitignore		.gitignore
README.md		README.md
install.sh		install.sh
setup.cfg		setup.cfg
setup.py		setup.py

FishAndWasabi/Real-LOD

Folders and files

Latest commit

History

Repository files navigation

Re-Aligning Language to Visual Objects with an Agentic Workflow ICLR2025

📄 Table of Contents

✨ News 🔝

🛠️ Dependencies and Installation 🔝

🚀 Real-LOD 🔝

Data Format

Run

Examples

🤖 Real-Agent 🔝

📕 Real-Data 🔝

Data Information

Data Format

🚂 Real-Model 🔝

Demo of Real-Model

1.1 Shell

1.2 Gradio

Train

1.1 Data Preparation

1.2 Training with single GPU

1.3 Training with multi GPU

Evaluation

1.1 Data Preparation

1.2 Model Checkpoint

1.3 Evaluation with single GPU

1.4 Evaluation with multi GPU

Examples

📖 Citation 🔝

📜 License 🔝

📮 Contact 🔝

🤝 Acknowledgement 🔝

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Re-Aligning Language to Visual Objects with an Agentic Workflow

ICLR2025

Packages