MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations Paper • 2406.09401 • Published Jun 13, 2024
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models Paper • 2505.17015 • Published May 22 • 9
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence Paper • 2505.23764 • Published May 29 • 3
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding Paper • 2507.07984 • Published Jul 10 • 42
VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization Paper • 2508.05211 • Published Aug 7 • 1
G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning Paper • 2511.21688 • Published 29 days ago • 8
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control Paper • 2506.01943 • Published Jun 2 • 25
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI Paper • 2312.16170 • Published Dec 26, 2023 • 1
Fine-Grained Cross-View Geo-Localization Using a Correlation-Aware Homography Estimator Paper • 2308.16906 • Published Aug 31, 2023 • 1
PointLLM: Empowering Large Language Models to Understand Point Clouds Paper • 2308.16911 • Published Aug 31, 2023 • 2
MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training Paper • 2303.13510 • Published Mar 23, 2023 • 1