-->
Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang
CVPR 2025 • 2025
conferenceWe present ROCKET-1, a novel approach for mastering open-world interaction with visual-temporal context prompting in Minecraft environments. Our method enables agents to perform complex tasks through visual understanding and temporal reasoning.
Our approach combines vision-language models with temporal reasoning to enable robust interaction in open-world environments. The key innovation lies in the visual-temporal context prompting mechanism that allows the agent to understand both current visual state and temporal dynamics.
The ROCKET-1 framework consists of:
We evaluate ROCKET-1 on various Minecraft tasks and demonstrate significant improvements over baseline methods. Our approach achieves:
We plan to extend this approach to other open-world environments and explore applications in real-world robotics scenarios.
@inproceedings{cai2025rocket1,
title={ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting},
author={Cai, Shaofei and Wang, Zihao and Lian, Kewei and Mu, Zhancun and Ma, Xiaojian and Liu, Anji and Liang, Yitao},
booktitle={Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}
bibtex