π0.7: a Steerable Model with Emergent Capabilities (Ai et al.)

April 26, 2026

instruction generalization - generalization to complex, unseen language references in unseen environments

cross-embodiment generalization - perform unseen dextrous tasks

compositional task generalization - compose known actions into unseen sequences

-prompt robot with subgoal images (generated) to condition policy
-VLAs are built on VLM backbones, trained on robot trajectories
-VLMs (i.e. Gemma 4B) trained with contrastive learning to ground images in text
-an image and it's caption should be very close in shared embedding space
-an image and unrelated text should be very different in embedding space
-each observation oto_t contains ot=[It1,,Itn,qt]o_t = [I_t^1, \dots, I_t^n, q_t] with nn camera images and the join configuration of the robot qtq_t
-trained to predict an action chunk = sequence of dense vectors that define robot movement measurements (i.e. end-effector pitch, roll, yaw, joint angles, etc.)
-prompt/context is also provided for each training example (usually a language instruction)
-they are not RL'd because reward functions are unreliable in practice, instead they use MLE to learn trajectories (imitation learning, behavior cloning, diffusion/flow-matching)
maxθED[logπθ(ao)]\max_\theta \mathbb{E}_{\mathcal{D}}[\log \pi_\theta(a | o)]
-SigLIP + Gemma give visual/text understanding backbone
-policy - takes current observations, task instructions, memory, and metadata and outputs an instruction
-world model - takes inputs and outputs subgoal images of the optimal next states
-these images help to bridge the gap between high level task and subtask instructoins
-i.e. if the subtask is "open the fridge door," we don't know exactly how to grasp the handle
-subgoal images depict the desired near-future state, providing a richer specification
-multi-view subgoals gt=[Gt1,,Gtn]g_t = [G_t^1, \dots, G_t^n] for cameras 1,,n1, \dots, n
-action expert - takes subgoal images and outputs an action chunk (actual motor commands)