- [2025/10/20]🔥🔥🔥TRUST-VL-13B checkpoint and TRUST-Instruct dataset are now publicly available!
- [2025/09/06]🚀🚀🚀TRUST-VL is realsed. Checkout the paper for more details.
Take your first steps with the TRUST-VL model.
- Clone this repository and install package
git clone https://github.com/YanZehong/TRUST-VL.git
cd TRUST-VL
conda create -n trustvl python=3.10 -y
conda activate trustvl
pip install --upgrade pip
pip install -e .
(Optional) Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn==2.6.3 --no-build-isolation #--no-cache-dirPlease check out 🤗 Huggingface Models for public TRUST-VL checkpoints.
git lfs install
git clone https://huggingface.co/NUSryan/TRUST-VL-13b-task
TRUST-VL training consists of three stages: In Stage 1, we begin by training the projection module for one epoch on 1.2 million image–text pairs (653K news samples from VisualNews and 558K samples from the LLaVA training corpus). This stage aligns the visual features with the language model. In Stage 2, we jointly train the LLM and the projection module for one epoch using 665K synthetic conversation samples from the LLaVA training corpus to improve the model’s ability to follow complex instructions. In Stage 3, we fine-tune the full model on 198K reasoning samples from TRUST-Instruct for three epochs to further enhance its misinformation-specific reasoning capabilities.
Similar to LLaVA, TRUST-VL is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.
Please download the 1211K subset we use in the paper here, which is based on the LAION-CC-SBU dataset.
Training script with DeepSpeed ZeRO-2: trust_vl_stage1.sh.
--mm_projector_type mlp2x_gelu: the two-layer MLP vision-language connector.--vision_tower openai/clip-vit-large-patch14-336: CLIP ViT-L/14 336px.
Please download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg - TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in ./data,
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
Training script with DeepSpeed ZeRO-3: trust_vl_stage2.sh.
Please download the annotation of the final mixture our instruction tuning data TRUST-Instruct_task198k.json, and download the images from constituting datasets.
-
VisualNews:
- Request the VisualNews Dataset at here.
- Place the files under the
./datafolder.
-
NewsCLIPpings:
- Git clone the
news_clippingsrepository. - Run
./download.sh. - More details can be found in here.
- Download already-collected evidence according to the instrustions in here.
- Git clone the
-
DGM4:
Download the DGM4 dataset through this link: DGM4. -
Factify2:
Download the Factify2 dataset according to the instruction here. -
MMFakeBench:
You should strictly follow the data usage guidelines by filling in Data Usage Protocol on Huggingface from MMFakeBench.
After downloading all of them, organize the data as follows in ./data,
├── origin
│ ├── bbc
│ ├── guardian
│ ├── usa_today
│ ├── washington_post
│ └── data.json
├── DGM4
│ ├── manipulation
│ ├── metadata
│ └── origin
├── Factify2
│ ├── data
│ └── images-train
├── MMFakeBench
│ ├── fake
│ ├── real
└── source
Training script with DeepSpeed ZeRO-3: trust_vl_stage3.sh.
In TRUST-VL, we evaluate models on a diverse set of 7 misinformation benchmarks.
# Single GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mmfakebench.sh
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/ood.sh
# Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/newsclippings.sh Note: Please ensure that the corresponding image data for each evaluation dataset has been properly downloaded before running the evaluation.
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)
@inproceedings{yan-etal-2025-trust,
title = "{TRUST}-{VL}: An Explainable News Assistant for General Multimodal Misinformation Detection",
author = "Yan, Zehong and
Qi, Peng and
Hsu, Wynne and
Lee, Mong-Li",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.284/",
pages = "5588--5604",
ISBN = "979-8-89176-332-6",
}We would like to thank LLaVA and Vicuna for their amazing works. We also appreciate the benchmarks: MMFakeBench, Factify2, DGM4, NewsCLIPpings, MOCHEG, Fakeddit, VERITE and VisualNews.
Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models. This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.
