BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
Abstract
BayesianVLA addresses language-action grounding issues in robot manipulation by using Bayesian decomposition to prevent information collapse and improve out-of-distribution generalization.
Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose BayesianVLA, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior p(a mid v) and a language-conditioned posterior ฯ(a mid v, ell). We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, BayesianVLA significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.
Community
๐๏ธ Architecture
BayesianVLA is a novel framework designed to solve the Vision Shortcut problem in Vision-Language-Action (VLA) models.
In current VLA training, goal-driven datasets often make language instructions highly predictable from visual observations alone. This leads to Information Collapse, where the model ignores language and degenerates into a vision-only policy, failing miserably in out-of-distribution (OOD) scenarios.
BayesianVLA addresses this by:
- Bayesian Decomposition: Explicitly modeling a vision-only prior $p(a|v)$ and a language-conditioned posterior $\pi(a|v, \ell)$.
- LLR Optimization: Maximizing the Log-Likelihood Ratio (LLR) to penalize actions that rely solely on visual cues and reward actions that are truly grounded in language instructions.
โจ Key Features
- Dual-Branch Architecture: Uses learnable Latent Action Queries to decouple vision-only and language-conditioned action distributions.
- Zero Extra Data: Achieves significant performance gains (e.g., +11.3% on SimplerEnv) using the exact same datasets as baselines.
- Preserves VLM Intelligence: Effectively regularizes the model to prevent the "catastrophic forgetting" of general multimodal reasoning capabilities common in standard VLA fine-tuning.
arXiv explained breakdown of this paper ๐ https://arxivexplained.com/papers/bayesianvla-bayesian-decomposition-of-vision-language-action-models-via-latent-action-queries
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy (2025)
- TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers (2026)
- InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation (2026)
- CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos (2026)
- LatBot: Distilling Universal Latent Actions for Vision-Language-Action Models (2025)
- Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation (2025)
- mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper

