Abstract#
We present ROCKET-3, a scalable multi-task reinforcement learning framework for generalizable spatial intelligence in visuomotor agents. Built on large-scale automated task synthesis and a distributed RL framework, ROCKET-3 enables agents to generalize to unseen worlds via cross-view goal specification.
Key Contributions#
- Large-Scale Task Synthesis: Automated generation of over 100,000 Minecraft training tasks, supporting robust multi-task RL.
- Efficient Distributed RL Framework: Stable and scalable RL training for long-sequence visuomotor policies using a memory-efficient, fragment-based storage system.
- Cross-View Goal Specification: Unified task space that enables generalization across domains and supports curriculum learning.
- Powerful Generalization: Empirical evidence of 4x improvement in interaction success rates and compelling zero-shot transfer to DMLab, Unreal Engine, and real-world robots.
Methodology#
ROCKET-3 employs a “Foundation-to-Finesse” pipeline:
- Imitation Learning Pre-Training: Learning from large-scale expert trajectories for foundational skills.
- RL Fine-Tuning: PPO-based RL with KL-divergence constraint for stable policy refinement.
- Cross-View Goal Specification (CVGS): Task representation enabling spatial reasoning and transferability.
- Distributed Training: Parallelized data collection (72 Minecraft instances, ~1000 FPS) and memory-efficient storage.
Results#
ROCKET-3 achieves state-of-the-art performance in Minecraft, with:
- 4x increase in average success rate (7% → 28%) across diverse tasks.
- Zero-shot generalization to unseen 3D environments, including simulated and real-world settings.
- Robust curriculum learning and spatial reasoning capabilities.
Future Work#
We aim to extend RL training to more diverse 3D environments and action spaces, further bridging the sim-to-real gap for embodied AI and robotics.