paper seminar_251001
updated
Reconstruction Alignment Improves Unified Multimodal Models
Paper
•
2509.07295
•
Published
•
40
F1: A Vision-Language-Action Model Bridging Understanding and Generation
to Actions
Paper
•
2509.06951
•
Published
•
32
UMO: Scaling Multi-Identity Consistency for Image Customization via
Matching Reward
Paper
•
2509.06818
•
Published
•
29
Interleaving Reasoning for Better Text-to-Image Generation
Paper
•
2509.06945
•
Published
•
14
RewardDance: Reward Scaling in Visual Generation
Paper
•
2509.08826
•
Published
•
73
Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with
Quantization-Aware Scheduling
Paper
•
2509.01624
•
Published
•
7
Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human
Preference
Paper
•
2509.06942
•
Published
•
17
Understand Before You Generate: Self-Guided Training for Autoregressive
Image Generation
Paper
•
2509.15185
•
Published
•
29
LLM-I: LLMs are Naturally Interleaved Multimodal Creators
Paper
•
2509.13642
•
Published
•
9
Image Tokenizer Needs Post-Training
Paper
•
2509.12474
•
Published
•
8
InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis
Paper
•
2509.10441
•
Published
•
30
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal
Conditioning
Paper
•
2509.08519
•
Published
•
128
MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware
Alignment and Disentanglement
Paper
•
2509.01977
•
Published
•
12
GenCompositor: Generative Video Compositing with Diffusion Transformer
Paper
•
2509.02460
•
Published
•
25
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable
Text-to-Image Reinforcement Learning
Paper
•
2508.20751
•
Published
•
89
Mixture of Contexts for Long Video Generation
Paper
•
2508.21058
•
Published
•
35
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid
Vision Tokenizer
Paper
•
2509.16197
•
Published
•
56
Lynx: Towards High-Fidelity Personalized Video Generation
Paper
•
2509.15496
•
Published
•
12
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion
Transformer Models
Paper
•
2509.17627
•
Published
•
66
Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal
Understanding and Generation
Paper
•
2509.19244
•
Published
•
11
Hyper-Bagel: A Unified Acceleration Framework for Multimodal
Understanding and Generation
Paper
•
2509.18824
•
Published
•
22
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation
Paper
•
2510.05094
•
Published
•
37
Free Lunch Alignment of Text-to-Image Diffusion Models without
Preference Image Pairs
Paper
•
2509.25771
•
Published
•
10
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
Paper
•
2510.01284
•
Published
•
34
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Paper
•
2510.02283
•
Published
•
96
UltraGen: High-Resolution Video Generation with Hierarchical Attention
Paper
•
2510.18775
•
Published
•
17