Abstract#
We present ROCKET-1, a novel approach for mastering open-world interaction with visual-temporal context prompting in Minecraft environments. Our method enables agents to perform complex tasks through visual understanding and temporal reasoning.
Key Contributions#
- Novel visual-temporal context prompting mechanism: A new approach to leverage both visual and temporal information for better decision making
- State-of-the-art performance on Minecraft benchmarks: Significantly outperforms existing methods on standard evaluation tasks
- Robust generalization to unseen environments: Demonstrates strong transfer capabilities across different game scenarios
Method#
Our approach combines vision-language models with temporal reasoning to enable robust interaction in open-world environments. The key innovation lies in the visual-temporal context prompting mechanism that allows the agent to understand both current visual state and temporal dynamics.
Architecture#
The ROCKET-1 framework consists of:
- Visual Encoder: Processes current game state
- Temporal Reasoning Module: Captures temporal dependencies
- Context Prompting: Integrates visual and temporal information
- Action Decoder: Generates appropriate actions
Experimental Results#
We evaluate ROCKET-1 on various Minecraft tasks and demonstrate significant improvements over baseline methods. Our approach achieves:
- 40% improvement in task completion rate
- 60% reduction in exploration time
- Superior generalization to new environments
Future Work#
We plan to extend this approach to other open-world environments and explore applications in real-world robotics scenarios.
Citation#
@inproceedings{cai2025rocket1,
title={ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting},
author={Cai, Shaofei and Wang, Zihao and Lian, Kewei and Mu, Zhancun and Ma, Xiaojian and Liu, Anji and Liang, Yitao},
booktitle={Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}bibtex