-->
Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xueije Liu, Qing Li, Anji Liu, Xiaojian Ma, Yitao Liang
NeurIPS 2024 • 2024
conferenceOmniJARVIS presents a unified vision-language-action tokenization framework that enables open-world instruction following agents. This work bridges the gap between language understanding and embodied action execution through a novel tokenization approach.
The ability to follow natural language instructions in open-world environments represents a fundamental challenge in artificial intelligence. Current approaches often struggle with the integration of vision, language, and action modalities, leading to suboptimal performance in complex scenarios.
Our unified tokenization approach treats vision, language, and actions as tokens in a shared representation space, enabling seamless multimodal understanding and generation. This design allows for:
We evaluate OmniJARVIS on various instruction-following benchmarks:
OmniJARVIS has broad applications in:
@inproceedings{wang2024omnijarvis,
title={OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents},
author={Wang, Zihao and Cai, Shaofei and Mu, Zhancun and Lin, Haowei and Zhang, Ceyao and Liu, Xueije and Li, Qing and Liu, Anji and Ma, Xiaojian and Liang, Yitao},
booktitle={Advances in Neural Information Processing Systems},
year={2024}
}
bibtex