CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models (Zhao et al.)

May 20, 2026

-vanilla VLAs lack temporal reasoning and planning because they mostly learn a direct image + text to action mapping, without intermediate reasoning

-visual CoT is data-efficient because

1.it already exists in robot datasets as intermediate frames

2.it does not require extra annotations

hybrid attention

1.causal attention for visual tokens

2.full attention for action sequence generation

action-less video data - videos without robot action labels or control signals

RQ-VAE (residual-quantized variational autoencoder) - compresses continuous observations into a compact sequence of discrete tokens

-action tokens

a_i

are represented by 7 tokens, with each dimension discretized independently from the original continuous action space

-use the 1st-99th percentile range and divide it into 256 bins for autoregressive action prediction

Unified VLA vs. Modular VLA

-RT-2 / OpenVLA-ish philosophy

-architecture

-vision + language + actions + maybe world modeling

-one transformer

-token or action prediction

Pros

-simple and scalable

-shared multimodal representations

-end-to-end learning

-benefits directly from LLM scaling

Cons

-harder to debug

-less controllable and interpretable

-harder to satisfy real-time control constraints

-one model has to learn everything

-PI / pi0-ish, more classical robotics philosophy

-architecture

-perception or VLM

-world model or subgoal planner

-action policy

-low-level controller

Pros

-specialized modules

-easier debugging and control

-better latency and safety handling

-stronger control integration

Cons

-more engineering complexity

-harder end-to-end optimization

-interface bottlenecks

-error propagation across modules