WidowX & Google robot videos show real

OpenVLA: An Open-Source Vision-Language-Action Model

submited by
Style Pass
2024-07-04 16:30:04

WidowX & Google robot videos show real "zero-shot" rollouts with the OpenVLA model Franka Panda robot videos depict fine-tuned OpenVLA policies

We train OpenVLA by fine-tuning a pretrained Prismatic-7B VLM. Our model consists of three key elements: (1) a fused visual encoder, consisting of a SigLIP and a DinoV2 backbone, that maps image inputs to a number of ``image patch embeddings'', (2) a projector that takes the output embeddings of the visual encoder and maps them into the input space of a large language model, and (3) a Llama 2 7B language model backbone that predicts tokenized output actions. These tokens get decoded into continuous output actions that can be directly executed on the robot.

To train OpenVLA, we curate a dataset of 970k robot manipulation trajectories from the Open X-Embodiment (OpenX) dataset. Our dataset spans a wide range of tasks, scenes and robot embodiments. We train OpenVLA on a cluster of 64 A100 GPUs for 15 days. The trained model checkpoints can be downloaded from HuggingFace and used with a few lines of code.

Leave a Comment
Related Posts