-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2507.07966
-
Can Large Language Models Understand Context?
Paper • 2402.00858 • Published • 23 -
OLMo: Accelerating the Science of Language Models
Paper • 2402.00838 • Published • 85 -
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 151 -
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity
Paper • 2401.17072 • Published • 25
-
Scaling RL to Long Videos
Paper • 2507.07966 • Published • 159 -
Group Sequence Policy Optimization
Paper • 2507.18071 • Published • 316 -
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
Paper • 2507.14111 • Published • 23 -
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
Paper • 2507.21183 • Published • 14
-
vrgamedevgirl84/Wan14BT2VFusioniX
Text-to-Video • Updated • 596 -
TheStageAI/Elastic-mochi-1-preview
Text-to-Video • Updated • 4 • 2 -
nesaorg/animatediff-base
Text-to-Video • Updated • 68 -
4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation
Paper • 2506.18839 • Published • 12
-
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Paper • 2510.02283 • Published • 96 -
Paper2Video: Automatic Video Generation from Scientific Papers
Paper • 2510.05096 • Published • 118 -
LongLive: Real-time Interactive Long Video Generation
Paper • 2509.22622 • Published • 184 -
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
Paper • 2509.08519 • Published • 128
-
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Paper • 2506.23918 • Published • 89 -
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Paper • 2504.16030 • Published • 36 -
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Paper • 2505.24867 • Published • 80 -
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Paper • 2507.01006 • Published • 250
-
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
Paper • 2506.04207 • Published • 48 -
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Paper • 2504.11468 • Published • 30 -
RLPR: Extrapolating RLVR to General Domains without Verifiers
Paper • 2506.18254 • Published • 31 -
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Paper • 2507.05255 • Published • 74
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
vrgamedevgirl84/Wan14BT2VFusioniX
Text-to-Video • Updated • 596 -
TheStageAI/Elastic-mochi-1-preview
Text-to-Video • Updated • 4 • 2 -
nesaorg/animatediff-base
Text-to-Video • Updated • 68 -
4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation
Paper • 2506.18839 • Published • 12
-
Can Large Language Models Understand Context?
Paper • 2402.00858 • Published • 23 -
OLMo: Accelerating the Science of Language Models
Paper • 2402.00838 • Published • 85 -
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 151 -
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity
Paper • 2401.17072 • Published • 25
-
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Paper • 2510.02283 • Published • 96 -
Paper2Video: Automatic Video Generation from Scientific Papers
Paper • 2510.05096 • Published • 118 -
LongLive: Real-time Interactive Long Video Generation
Paper • 2509.22622 • Published • 184 -
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
Paper • 2509.08519 • Published • 128
-
Scaling RL to Long Videos
Paper • 2507.07966 • Published • 159 -
Group Sequence Policy Optimization
Paper • 2507.18071 • Published • 316 -
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
Paper • 2507.14111 • Published • 23 -
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
Paper • 2507.21183 • Published • 14
-
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Paper • 2506.23918 • Published • 89 -
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Paper • 2504.16030 • Published • 36 -
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Paper • 2505.24867 • Published • 80 -
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Paper • 2507.01006 • Published • 250
-
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
Paper • 2506.04207 • Published • 48 -
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Paper • 2504.11468 • Published • 30 -
RLPR: Extrapolating RLVR to General Domains without Verifiers
Paper • 2506.18254 • Published • 31 -
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Paper • 2507.05255 • Published • 74