CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models (Zhao et al.)
May 20, 2026
-vanilla VLAs lack temporal reasoning and planning because they mostly learn a direct image + text to action mapping, without intermediate reasoning
-visual CoT is data-efficient because
1.it already exists in robot datasets as intermediate frames
2.it does not require extra annotations
hybrid attention
1.causal attention for visual tokens
2.full attention for action sequence generation
action-less video data - videos without robot action labels or control signals
RQ-VAE (residual-quantized variational autoencoder) - compresses continuous observations into a compact sequence of discrete tokens
-action tokens are represented by 7 tokens, with each dimension discretized independently from the original continuous action space
-use the 1st-99th percentile range and divide it into 256 bins for autoregressive action prediction
Unified VLA vs. Modular VLA
Unified Model / End-to-End VLA
-RT-2 / OpenVLA-ish philosophy
-architecture
-vision + language + actions + maybe world modeling
-one transformer
-token or action prediction
Pros
-simple and scalable
-shared multimodal representations
-end-to-end learning
-benefits directly from LLM scaling
Cons
-harder to debug
-less controllable and interpretable
-harder to satisfy real-time control constraints
-one model has to learn everything
Modular Architecture
-PI / pi0-ish, more classical robotics philosophy
-architecture
-perception or VLM
-world model or subgoal planner
-action policy
-low-level controller
Pros
-specialized modules
-easier debugging and control
-better latency and safety handling
-stronger control integration
Cons
-more engineering complexity
-harder end-to-end optimization
-interface bottlenecks
-error propagation across modules