π0.7: a Steerable Model with Emergent Capabilities (Ai et al.)
April 26, 2026
instruction generalization - generalization to complex, unseen language references in unseen environments
cross-embodiment generalization - perform unseen dextrous tasks
compositional task generalization - compose known actions into unseen sequences
-prompt robot with subgoal images (generated) to condition policy
-VLAs are built on VLM backbones, trained on robot trajectories
-VLMs (i.e. Gemma 4B) trained with contrastive learning to ground images in text
-an image and it's caption should be very close in shared embedding space
-an image and unrelated text should be very different in embedding space
-each observation contains with camera images and the join configuration of the robot
-trained to predict an action chunk = sequence of dense vectors that define robot movement measurements (i.e. end-effector pitch, roll, yaw, joint angles, etc.)
-prompt/context is also provided for each training example (usually a language instruction)
-they are not RL'd because reward functions are unreliable in practice, instead they use MLE to learn trajectories (imitation learning, behavior cloning, diffusion/flow-matching)
-SigLIP + Gemma give visual/text understanding backbone
-policy - takes current observations, task instructions, memory, and metadata and outputs an instruction
-world model - takes inputs and outputs subgoal images of the optimal next states
-these images help to bridge the gap between high level task and subtask instructoins
-i.e. if the subtask is "open the fridge door," we don't know exactly how to grasp the handle
-subgoal images depict the desired near-future state, providing a richer specification
-multi-view subgoals for cameras
-action expert - takes subgoal images and outputs an action chunk (actual motor commands)