TRUST-VL Model Card

trust-vl-logo

Model Details

TRUST-VL is a unified and explainable vision-language model for general multimodal misinformation detection. It incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.

TRUST-VL training consists of three stages: In Stage 1, we begin by training the projection module for one epoch on 1.2 million image–text pairs (653K news samples from VisualNews and 558K samples from the LLaVA training corpus). This stage aligns the visual features with the language model. In Stage 2, we jointly train the LLM and the projection module for one epoch using 665K synthetic conversation samples from the LLaVA training corpus to improve the model’s ability to follow complex instructions. In Stage 3, we fine-tune the full model on 198K reasoning samples from TRUST-Instruct for three epochs to further enhance its misinformation-specific reasoning capabilities.

Resources for More Information

Paper or resources for more information:

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)

@article{yan2025trustvl,
  title={{TRUST-VL}: An Explainable News Assistant for General Multimodal Misinformation Detection},
  author={Yan, Zehong and Qi, Peng and Hsu, Wynne and Lee, Mong Li},
  journal={arXiv preprint arXiv:2509.04448},
  year={2025}
}
Downloads last month
13
Safetensors
Model size
14B params
Tensor type
I64
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support