Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization • Zhancun Mu

Abstract#

World-model reinforcement learning can scale decision-making through learned dynamics, but long-horizon policy improvement is often limited by model bias and by a mismatch between search and value learning. Model-Based Diffusion Policy Optimization (MBDPO) addresses this by representing policy optimization as a diffusion process over searched trajectories in latent world models.

Rather than relying on a separate planner over the learned model, MBDPO refines a diffusion policy with an implicit energy function extracted from collected data. The paper evaluates the method across multi-task offline pretraining, online learning, and offline-to-online fine-tuning, with scaling studies showing consistent gains as model capacity increases.

Key Ideas#

Diffusion policy optimization: Use diffusion policy representations to optimize trajectory distributions in latent world models.
Search-policy alignment: Reduce inconsistency between policy improvement and value learning by unifying search with policy optimization.
Scaling analysis: Study how world-model RL performance changes under larger datasets and model capacity.

Citation#

@article{cheng2026scaling,
  title={Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization},
  author={Cheng, Xiaoyuan and Yuan, Wenxuan and Mu, Zhancun and Zhang, Yuanzhao and Yang, Yiming and Wang, Hai and Sun, Zhuo and Liu, Che},
  journal={arXiv preprint arXiv:2605.26282},
  year={2026}
}

bibtex