π0.7: a Steerable Model with Emergent Capabilities (Ai et al.)

April 26, 2026

instruction generalization - generalization to complex, unseen language references in unseen environments

cross-embodiment generalization - perform unseen dextrous tasks

compositional task generalization - compose known actions into unseen sequences

-prompt robot with subgoal images (generated) to condition policy

-VLAs are built on VLM backbones, trained on robot trajectories

-VLMs (i.e. Gemma 4B) trained with contrastive learning to ground images in text

-an image and it's caption should be very close in shared embedding space

-an image and unrelated text should be very different in embedding space

-each observation

o_t

contains

o_t = [I_t^1, \dots, I_t^n, q_t]

with

n

camera images and the join configuration of the robot

q_t

-trained to predict an action chunk = sequence of dense vectors that define robot movement measurements (i.e. end-effector pitch, roll, yaw, joint angles, etc.)

-prompt/context is also provided for each training example (usually a language instruction)

-they are not RL'd because reward functions are unreliable in practice, instead they use MLE to learn trajectories (imitation learning, behavior cloning, diffusion/flow-matching)

\max_\theta \mathbb{E}_{\mathcal{D}}[\log \pi_\theta(a | o)]

-SigLIP + Gemma give visual/text understanding backbone

-policy - takes current observations, task instructions, memory, and metadata and outputs an instruction

-world model - takes inputs and outputs subgoal images of the optimal next states

-these images help to bridge the gap between high level task and subtask instructoins

-i.e. if the subtask is "open the fridge door," we don't know exactly how to grasp the handle

-subgoal images depict the desired near-future state, providing a richer specification

-multi-view subgoals

g_t = [G_t^1, \dots, G_t^n]

for cameras

1, \dots, n

-action expert - takes subgoal images and outputs an action chunk (actual motor commands)