Temporal Difference Learning for Model Predictive Control (Hansen et al.)

March 24, 2026

-paper deals with learning tasks for fixed/seen environments

model-free methods - don't reason about possible next steps/states explicitly, encode environment dynamics/understanding into vπ(s)v_{\pi}(s) and qπ(s,a)q_{\pi}(s, a)

-decisions based on experience, not decision-time simulation

model-based methods - learn an internal representation of the environment to reason through future steps at decision time

-simulation at decision-time

model predictive control

-long-horizon planning is expensive, optimize trajectory over short, finite time horizon
-yields local optima
-can be extended to approximate globally optimal solutions using a terminal value function that estimates discounted return beyond the planning horizon
-obtaining an accurate model/value function is challenging

Proposition: augment model-based planning with strengths of model-free learning

-learn latent dynamics model and terminal value function jointly

Contributions

1.Learn latent representation of the dynamics model purely from rewards, not state/video prediction
1.current SOTA tries to learn the full next operation (i.e. frame of pixels)
2.inefficient because it tries to learn irrelevant things (i.e. shadows, textures, background, lighting)
3.errors compound
4.objective is misaligned with the task
1.optimizing for pixel/state reconstruction accuracy, what you actually care about is reward
2.back-propagate gradients from the reward and TD-objective through multiple rollouts of the model, alleviating error compounding