Describe Anything Collection Multimodal Large Language Models for Detailed Localized Image and Video Captioning • 7 items • Updated 9 days ago • 61
DeH4R: A Decoupled and Hybrid Method for Road Network Graph Extraction Paper • 2508.13669 • Published Aug 19, 2025 • 1
SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment Paper • 2507.02705 • Published Jul 3, 2025 • 2
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing Paper • 2506.19848 • Published Jun 24, 2025 • 26
Qwen2.5-VL Collection Vision-language model series based on Qwen2.5 • 11 items • Updated 1 day ago • 550
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding Paper • 2504.10465 • Published Apr 14, 2025 • 27
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding Paper • 2406.19389 • Published Jun 27, 2024 • 54