ReVision: Multimodal Instruction Rewriting with Tiny Vision language Models

🛠️ Install

Clone this repository and navigate to MobileVLM folder

Install Package

conda create -n vir python=3.10 -y
conda activate vir
pip install --upgrade pip
pip install -r requirements.txt

Dataset

Data is here : https://huggingface.co/datasets/anonymoususerrevision/multimodal_query_rewrites

Pretrain

(This step is optional and a pretrained version of the model is already shared in the paper. If you still want to pretrain) First prepare a model with randomly initialized parameters

```Shell
python prepare_model_for_pretraining.py --vision_model_name_or_path google/siglip-base-patch16-256 --text_model_name_or_path OuteAI/Lite-Mistral-150M-v2-Instruct --dest ./ReVision-250M-64-16-random
```

In pretrain.py, change the huggingface cache appropriately os.environ["HF_DATASETS_CACHE"] and point to your local directory. Also, if you are planning to push the pretrained model to huggingface hub, change this anonymoususerrevision/ReVision-250M-64-16 to your desired model identifier. It is strongly advised to thoroughly go through the pretraining code and change variable values as needed. For changing arguments and training hyper parameters, check args.py

For pretraining run the following command.

```Shell
python pretrain.py
```

Fine Tune

Similar to pretrainig above, just change code and provide link to the appropriate dataset and run python finetune.py (or other variants provided)

Inference

Various inference scripts are provided under test_*.py and evaluate.py. For running baseline experiments with PaliGemma or Qwen, use appropriate processors and conditional generators from huggingface in evaluate.py and not use ReVisionProcessor, ReVisionForConditionalGeneration. Also the format of the prompt needs to be changed (specifically for PaliGemma) in datautils.py.

Terms of Use

The code is released under Apache License 2.0 but the data is not. The image portion of the dataset comes from existing resources. To serve the research community better, we uploaded images.zip for better reproducing our work in research community. It must not be used for any other purposes. The use of these images must comply with the respective licenses attached with the image sources. This may be taken down at any time when requested by the original owner or owners of the referenced images.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
analysis		analysis
data_creation		data_creation
intent_eval		intent_eval
model		model
visualization		visualization
LICENSE		LICENSE
README.md		README.md
args.py		args.py
datautils.py		datautils.py
evaluate.ipynb		evaluate.ipynb
finetune.py		finetune.py
finetune_with_metadata.py		finetune_with_metadata.py
finetune_with_metadata_easyocr.py		finetune_with_metadata_easyocr.py
finetune_with_selfcaption_easyocr.py		finetune_with_selfcaption_easyocr.py
finetune_with_selfmetadata.py		finetune_with_selfmetadata.py
prepare_model_for_pretraining.py		prepare_model_for_pretraining.py
pretrain.py		pretrain.py
requirements.txt		requirements.txt
test_revision.py		test_revision.py
test_revision_metadata.py		test_revision_metadata.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReVision: Multimodal Instruction Rewriting with Tiny Vision language Models

🛠️ Install

Dataset

Pretrain

Fine Tune

Inference

Terms of Use

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ReVision: Multimodal Instruction Rewriting with Tiny Vision language Models

🛠️ Install

Dataset

Pretrain

Fine Tune

Inference

Terms of Use

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages