Skip to content

FishAndWasabi/Real-LOD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

37 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

ICLR2025_REALLOD_LOGO

Re-Aligning Language to Visual Objects with an Agentic Workflow

ICLR2025

arXiv Project Page Hugging Face Space Hugging Face Dataset Youtube Bilibili

closed issue open issues

๐Ÿ“„ Table of Contents | โœจ ICLR Page | ๐Ÿ› ๏ธ Install | ๐Ÿ“– Citation | ๐Ÿ“œ License | โ“ FAQ

Key Point: Rather than becoming assistants to enhance human productivity, agents hold a deeper value paradigm that they can establish workflows that serve as a flywheel, sustaining high-value data assets across AI industries. Our paper is an application in multimodal domains to demonstrate this potential. If this repo helps you, please consider giving us a ๐ŸŒŸ!

Note: This repository is also a MMDetection style codebase for Languaged-based Object Detection! Please feel free to use it for your own projects!

TL; DR: An agentic workflow including planning, tool use, and reflection steps to improve the alignment quality of language expressions and visual objects for LOD model.

This repository contains the official implementation of the following paper:

Re-Aligning Language to Visual Objects with an Agentic Workflow
Yuming Chen, Jiangyan Feng*, Haodong Zhang*, Lijun Gong, Feng Zhu, Rui Zhao, Qibin Hou#, Ming-Ming Cheng, Yibing Song#
(* denotes equal contribution. # denotes the corresponding author.)
ICLR 2025 Conference

๐Ÿ“„ Table of Contents

โœจ News ๐Ÿ”

Future work can be found in todo.md.

  • Apr, 2025: The code of ๐Ÿš€ Real-LOD is publicly available!
  • Apr, 2025: The ๐Ÿ“• Real-Data is publicly available!
  • Apr, 2025: The code of ๐Ÿš‚ Real-Model is publicly available!
  • Jan, 2025: ๐Ÿ”ฅ Our paper is accepted by ICLR 2025!

๐Ÿ› ๏ธ Dependencies and Installation ๐Ÿ”

We provide a simple scrpit install.sh for installation, or refer to install.md for more details.

  1. Clone and enter the repo.

    git@github.com:FishAndWasabi/Real-LOD.git
    cd Real-LOD
  2. Run install.sh.

    bash install.sh
  3. Activate your environment!

    conda activate Real-LOD

๐Ÿš€ Real-LOD ๐Ÿ”

Data Format

The input data format of Real-LOD workflow:

{
  "image_path": "path/to/image",
  "height": image_height,
  "width": image_width,
  "raw_expression": raw_expression,
  "global_caption": global_caption,
  "object_locations": {
      "chosen_object": {"id":0, "category": category_name, "bbox": [x, y, w, h]},
      "other_objects": [{"id":1, "category": category_name, "bbox": [x, y, w, h]},
                        {"id":2, "category": category_name, "bbox": [x, y, w, h]}]
  }
}
  • image_path: Path to the image file.
  • height and width: The height and width of the image.
  • object_locations: Object/expression pairs in the image:
    • chosen_object: The source model used to generate expressions (e.g., vlm_short, vlm_long, or llm).
    • other_objects: A list of bounding boxes, each defined by [x1, y1, x2, y2].

Run

You could run the following script to start the shell demo:

python tools/run_real_lod.py ${ANNOTATION} ${CONFIG_FILE} [optional arguments]

You could run python tools/run_real_lod.py --help to get detailed information of this scripts.

Detailed arguments
positional arguments:
  annotation            Path to the input annotation file.

optional arguments:
  -h, --help            show this help message and exit
  --configs CONFIGS     Path to the configuration file for agent.
  --max_cycles MAX_CYCLES
                        Maximum number of cycles for the workflow.
  --debug               Enable debug mode.
  --save_dir SAVE_DIR   Directory to save results.

Examples

ICLR2025_REALMODEL_EXAMPLES

๐Ÿค– Real-Agent ๐Ÿ”

Comming Soon!

๐Ÿ“• Real-Data ๐Ÿ”

Data Information

The dataset is uploaded on Hugging Face and Baidu Yun. Below is the detailed information and corresponding data paths:

Src Scale Img Num Ins Num Exp Num File Baidu Yun
O365 Small 8,513 64,528 1,974,504 real-data-o365-small.jsonl Link
O365 Base 68,104 416,537 13,628,900 real-data-o365-base.jsonl Link
O365 Large 574,883 3,390,718 112,061,648 real-data-o365-large.jsonl Link
OI Small 19,888 36,069 1,069,254 real-data-openimage-small.jsonl Link
OI Base 24,663 48,783 1,435,416 real-data-openimage-base.jsonl Link
OI Large 828,314 1,776,100 81,420,000 real-data-openimage-large.jsonl Link
LVIS - 94,171 99,815 3,078,400 real-data-lvis.jsonl Link

You can access the dataset through huggingface using the following commands:

# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install

# When prompted for a password, use an access token with write permissions.
# Generate one from your settings: https://huggingface.co/settings/tokens
git clone https://huggingface.co/datasets/fishandwasabi/Real-LOD-Data
cd Real-LOD-Data/real-data

Note: The annotation files are provided, and the images remain sourced from their original datasets.

Data Format

The dataset is structured in the following format:

{
  "filename": "path/to/image",
  "height": image_height,
  "width": image_width,
  "pairs": {
    source_model: {
      "bboxes": [
        [x1, y1, x2, y2],
        ...
      ],
      "category": category,
      "relation": single/multi,
      "positive_expressions": [
        positive_expression_1,
        positive_expression_2,
        ...
      ],
      "negative_expressions": [
        negative_expression_1,
        negative_expression_2,
        ...
      ]
    },
    ...
  }
}
  • filename: Path to the image file.
  • height and width: The height and width of the image.
  • pairs: Object/expression pairs in the image:
    • source_model: The source model used to generate expressions (e.g., vlm_short, vlm_long, or llm).
    • bboxes: A list of bounding boxes, each defined by [x1, y1, x2, y2].
    • category: The category of the object within the bounding box.
    • relation: Specifies whether the object is associated with a single or multiple expressions.
    • positive_expressions: A list of expressions that positively describe the object.
    • negative_expressions: A list of expressions that do not describe the object.

๐Ÿš‚ Real-Model ๐Ÿ”

Demo of Real-Model

1.1 Shell

You could run the following script to start the shell demo:

python demo/real-model_image_demo.py ${IMAGE_FILE} ${CONFIG_FILE} ${CHECKPOINT_FILE} --texts {TEXTS} [optional arguments]

You could run python demo/real-model_image_demo.py --help to get detailed information of this scripts.

Detailed arguments
positional arguments:
  inputs                Input image file or folder path.
  model                 Config or checkpoint .pth file or the model name and alias defined in metafile. The model configuration file will try to read from .pth if the
                        parameter is a .pth weights file.

optional arguments:
  -h, --help            show this help message and exit
  --weights WEIGHTS     Checkpoint file
  --out-dir OUT_DIR     Output directory of images or prediction results.
  --texts TEXTS         text prompt, such as "bench . car .", "$: coco"
  --device DEVICE       Device used for inference
  --pred-score-thr PRED_SCORE_THR
                        bbox score threshold
  --batch-size BATCH_SIZE
                        Inference batch size.
  --show                Display the image in a popup window.
  --no-save-vis         Do not save detection vis results
  --no-save-pred        Do not save detection json results
  --print-result        Whether to print the results.
  --palette {coco,voc,citys,random,none}
                        Color palette used for visualization

1.2 Gradio

You could run the following script to start the Gradio demo (The gradio space will be release as soon as possible):

python demo/real-model_gradio_demo.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

You could run python demo/real-model_gradio_demo.py --help to get detailed information of this scripts.

Detailed arguments
positional arguments:
  config                Config file
  checkpoint            Checkpoint file

optional arguments:
  -h, --help            show this help message and exit
  --device DEVICE       Device used for inference
  --server_name SERVER_NAME
                        Gradio server name (default: 0.0.0.0)
  --server_port SERVER_PORT
                        Gradio server port (default: 7860)
  --score_thre SCORE_THRE
                        Score threshold for inference (default: 0.3)
  --share               Enable sharing the Gradio app (default: False)
  --debug               Enable debug mode for Gradio (default: False)

Train

1.1 Data Preparation

The tree of training data:

โ”œโ”€โ”€ data
โ”‚   โ”œโ”€โ”€ real-data
โ”‚   โ”‚   โ”œโ”€โ”€ real-data-o365-small.jsonl
โ”‚   โ”‚   โ”œโ”€โ”€ real-data-o365-base.jsonl
โ”‚   โ”‚   โ”œโ”€โ”€ real-data-o365-large.jsonl
โ”‚   โ”‚   โ”œโ”€โ”€ real-data-openimage-small.jsonl
โ”‚   โ”‚   โ”œโ”€โ”€ real-data-openimage-base.jsonl
โ”‚   โ”‚   โ”œโ”€โ”€ real-data-openimage-large.jsonl
โ”‚   โ”‚   โ””โ”€โ”€ real-data-lvis.jsonl
โ”‚   โ””โ”€โ”€ object365
โ”‚       โ”œโ”€โ”€ images
โ”‚       โ”‚   โ”œโ”€โ”€ train
โ”‚       โ”‚   โ”‚   โ”œโ”€โ”€ xxx.jpg
โ”‚       โ”‚   โ”‚   โ”œโ”€โ”€ ...
โ”‚   โ””โ”€โ”€ openimage
โ”‚       โ”œโ”€โ”€ train
โ”‚       โ”‚   โ”œโ”€โ”€ xxx.jpg
โ”‚       โ”‚   โ”œโ”€โ”€...
โ”‚   โ””โ”€โ”€ coco 
|       โ”œโ”€โ”€ train2017
โ”‚       โ”‚   โ”œโ”€โ”€ xxx.jpg
โ”‚       โ”‚   โ”œโ”€โ”€...

To obtain the images for the datasets mentioned, please refer to the following tools and URLs:

1.2 Training with single GPU

python tools/train_real_model.py ${CONFIG_FILE} [optional arguments]

1.3 Training with multi GPU

CUDA_VISIBLE_DEVICES=${GPU_IDs} bash tools/dist_train_real_model.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]

You could run python tools/train_real_model.py --help to get detailed information of this scripts.

Detailed arguments
positional arguments:
  config                train config file path

optional arguments:
  -h, --help            show this help message and exit
  --work-dir WORK_DIR   the dir to save logs and models
  --amp                 enable automatic-mixed-precision training
  --auto-scale-lr       enable automatically scaling LR.
  --resume [RESUME]     If specify checkpoint path, resume from it, while if not specify, try to auto resume from the latest checkpoint in the work directory.
  --cfg-options CFG_OPTIONS [CFG_OPTIONS ...]
                        override some settings in the used config, the key-value pair in xxx=yyy format will be merged into config file. If the value to be overwritten is a list, it should be like
                        key="[a,b]" or key=a,b It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" Note that the quotation marks are necessary and that no white space is allowed.
  --launcher {none,pytorch,slurm,mpi}
                        job launcher
  --local_rank LOCAL_RANK, --local-rank LOCAL_RANK

Evaluation

1.1 Data Preparation

The tree of evaluation data:

โ”œโ”€โ”€ data
โ”‚   โ”œโ”€โ”€ d3
โ”‚   โ”œโ”€โ”€ OVDEval
โ”‚   โ”œโ”€โ”€ omnilabel_val_v0.1.3
โ”‚   โ””โ”€โ”€ coco
โ”‚   โ””โ”€โ”€ object365
โ”‚   โ””โ”€โ”€ openimagesv5

To obtain the evaluation datasets, please refer to the following tools and URLs:

1.2 Model Checkpoint

We provide the model checkpoint Real-Model_base and Real-Model_tiny on the HuggingFace, you can access them through these code:

# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install

# When prompted for a password, use an access token with write permissions.
# Generate one from your settings: https://huggingface.co/settings/tokens
git clone https://huggingface.co/datasets/fishandwasabi/Real-LOD-Data
cd Real-LOD-Data/real-model-ckpts

1.3 Evaluation with single GPU

python tools/dist_test_real_model.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

1.4 Evaluation with multi GPU

CUDA_VISIBLE_DEVICES=${GPU_IDs} bash tools/dist_dist_test_real_model.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [optional arguments]

You could run python tools/dist_test_real_model.py --help to get detailed information of this scripts.

Detailed arguments
positional arguments:
  config                test config file path
  checkpoint            checkpoint file

optional arguments:
  -h, --help            show this help message and exit
  --work-dir WORK_DIR   the directory to save the file containing evaluation metrics
  --out OUT             dump predictions to a pickle file for offline evaluation
  --show                show prediction results
  --show-dir SHOW_DIR   directory where painted images will be saved. If specified, it will be automatically saved to the work_dir/timestamp/show_dir
  --wait-time WAIT_TIME
                        the interval of show (s)
  --cfg-options CFG_OPTIONS [CFG_OPTIONS ...]
                        override some settings in the used config, the key-value pair in xxx=yyy format will be merged into config file. If the value to be overwritten is a list, it should be like
                        key="[a,b]" or key=a,b It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" Note that the quotation marks are necessary and that no white space is allowed.
  --launcher {none,pytorch,slurm,mpi}
                        job launcher
  --tta
  --local_rank LOCAL_RANK, --local-rank LOCAL_RANK

Examples

ICLR2025_REALMODEL_EXAMPLES

๐Ÿ“– Citation ๐Ÿ”

If you find our repo useful for your research, please cite us:

@inproceedings{chen2025realigning,
  title={Re-Aligning Language to Visual Objects with an Agentic Workflow},
  author={Yuming Chen and Jiangyan Feng and Haodong Zhang and Lijun GONG and Feng Zhu and Rui Zhao and Qibin Hou and Ming-Ming Cheng and Yibing Song},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=MPJ4SMnScw}
}

There are also relvant citations for other outstanding works in this repo:

@inproceedings{dang2024instructdet,
  title={Instruct{DET}: Diversifying Referring Object Detection with Generalized Instructions},
  author={Ronghao Dang and Jiangyan Feng and Haodong Zhang and Chongjian GE and Lin Song and Lijun GONG and Chengju Liu and Qijun Chen and Feng Zhu and Rui Zhao and Yibing Song},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=hss35aoQ1Y}
}

@article{mmdetection,
  title   = {{MMDetection}: Open MMLab Detection Toolbox and Benchmark},
  author  = {Chen, Kai and Wang, Jiaqi and Pang, Jiangmiao and Cao, Yuhang and
             Xiong, Yu and Li, Xiaoxiao and Sun, Shuyang and Feng, Wansen and
             Liu, Ziwei and Xu, Jiarui and Zhang, Zheng and Cheng, Dazhi and
             Zhu, Chenchen and Cheng, Tianheng and Zhao, Qijie and Li, Buyu and
             Lu, Xin and Zhu, Rui and Wu, Yue and Dai, Jifeng and Wang, Jingdong
             and Shi, Jianping and Ouyang, Wanli and Loy, Chen Change and Lin, Dahua},
  journal= {arXiv preprint arXiv:1906.07155},
  year={2019}
}

๐Ÿ“œ License ๐Ÿ”

This code is licensed under the Creative Commons Attribution-NonCommercial 4.0 International for non-commercial use only. Please note that any commercial use of this code requires formal permission prior to use.

๐Ÿ“ฎ Contact ๐Ÿ”

For technical questions, please contact chenyuming[AT]mail.nankai.edu.cn.

For commercial licensing, please contact cmm[AT]nankai.edu.cn.

๐Ÿค Acknowledgement ๐Ÿ”

This repository borrows heavily from mmdetection, grounding-dino, peft, transformers,and chatglm.

For images from COCO, Objects365 and OpenImage, please see and follow their terms of use: MSCOCO, Objects365, and OpenImage.

The README file is referred to LED and LE3D.

We also thank all of our contributors.

About

Offical implementation of "Re-Aligning Language to Visual Objects with an Agentic Workflow"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published