--- license: apache-2.0 base_model: - Qwen/Qwen2.5-Omni-7B tags: - audiovisual - video - captioner --- # AVoCaDO: An AudioVisual Video Captioner Driven by Temporal Orchestration

## ✨ Overview Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. We introduce AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance under visual-only settings. ## 🚀 Getting Started Please refer to our [Github repository](https://github.com/AVoCaDO-Captioner/AVoCaDO) for more details. ## ✒️ Citation If you find our work helpful for your research, please consider giving a star ⭐ and citing our paper. We appreciate your support! ```bibtex @article{chen2025avocado, title={AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration}, author={Chen, Xinlong and Ding, Yue and Lin, Weihong and Hua, Jingyun and Yao, Linli and Shi, Yang and Li, Bozhou and Zhang, Yuanxing and Liu, Qiang and Wan, Pengfei and others}, journal={arXiv preprint arXiv:2510.10395}, year={2025} } ```