EgoMimic: Scaling Imitation Learning via Egocentric Video (Kareer et al.)

April 26, 2026

-learn policies jointly from egocentric human/robot demonstration data, treating them as equally useful for training

passive data collection - an ideal robot data system should allow users to generate sensorimotor behavior data without intending to do so

-human vs robot data has different reference frames, ranges of movement, etc.

-need to normalize the coordinate frames to be egocentric at time t so that we can generalize from both data sources

-human egocentric data is from 1st person POV, head can rotate, etc.

-robot data reference frames is usually fixed, different POV/actions

-need to mask + normalize movement across human/robot data to mask irrelevant differences (i.e. geometry of hand vs. robot arm, range of motion, etc.)

Joint Human-Robot Policy Learning Algorithm

f_{enc}(.)

- transformer encoder - latent visuomotor representation of frames

f^p(f_{enc}(.))

- pose decoder - similar to subgoal/world prediction in VLA, dictates desired future end-effector states/trajectories

f^q(f_{enc}(.))

- joint decoder - similar to action expert/controller in VLA, dictates low-level robot control