Zhancun Mu

ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang

CVPR 2025 • 2025

conference
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting teaser

Abstract#

We present ROCKET-1, a novel approach for mastering open-world interaction with visual-temporal context prompting in Minecraft environments. Our method enables agents to perform complex tasks through visual understanding and temporal reasoning.

Key Contributions#

  • Novel visual-temporal context prompting mechanism: A new approach to leverage both visual and temporal information for better decision making
  • State-of-the-art performance on Minecraft benchmarks: Significantly outperforms existing methods on standard evaluation tasks
  • Robust generalization to unseen environments: Demonstrates strong transfer capabilities across different game scenarios

Method#

Our approach combines vision-language models with temporal reasoning to enable robust interaction in open-world environments. The key innovation lies in the visual-temporal context prompting mechanism that allows the agent to understand both current visual state and temporal dynamics.

Architecture#

The ROCKET-1 framework consists of:

  1. Visual Encoder: Processes current game state
  2. Temporal Reasoning Module: Captures temporal dependencies
  3. Context Prompting: Integrates visual and temporal information
  4. Action Decoder: Generates appropriate actions

Experimental Results#

We evaluate ROCKET-1 on various Minecraft tasks and demonstrate significant improvements over baseline methods. Our approach achieves:

  • 40% improvement in task completion rate
  • 60% reduction in exploration time
  • Superior generalization to new environments

Future Work#

We plan to extend this approach to other open-world environments and explore applications in real-world robotics scenarios.

Citation#

@inproceedings{cai2025rocket1,
  title={ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting},
  author={Cai, Shaofei and Wang, Zihao and Lian, Kewei and Mu, Zhancun and Ma, Xiaojian and Liu, Anji and Liang, Yitao},
  booktitle={Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}
bibtex
Comment seems to stuck. Try to refresh?✨