EgoMimic: Scaling Imitation Learning via Egocentric Video (Kareer et al.)

April 26, 2026

-learn policies jointly from egocentric human/robot demonstration data, treating them as equally useful for training

passive data collection - an ideal robot data system should allow users to generate sensorimotor behavior data without intending to do so

-human vs robot data has different reference frames, ranges of movement, etc.
-need to normalize the coordinate frames to be egocentric at time t so that we can generalize from both data sources
-human egocentric data is from 1st person POV, head can rotate, etc.
-robot data reference frames is usually fixed, different POV/actions
-need to mask + normalize movement across human/robot data to mask irrelevant differences (i.e. geometry of hand vs. robot arm, range of motion, etc.)

Joint Human-Robot Policy Learning Algorithm

-fenc(.)f_{enc}(.) - transformer encoder - latent visuomotor representation of frames
-fp(fenc(.))f^p(f_{enc}(.)) - pose decoder - similar to subgoal/world prediction in VLA, dictates desired future end-effector states/trajectories
-fq(fenc(.))f^q(f_{enc}(.)) - joint decoder - similar to action expert/controller in VLA, dictates low-level robot control