At Waymo, we have been at the forefront of AI and ML in autonomous driving for over 15 years, and are continuously contributing to advancing research in the field. Today, we are sharing our latest research paper on an End-to-End Multimodal Model for Autonomous Driving (EMMA).
Powered by Gemini, a multimodal large language model developed by Google, EMMA employs a unified, end-to-end trained model to generate future trajectories for autonomous vehicles directly from sensor data. Trained and fine-tuned specifically for autonomous driving, EMMA leverages Gemini’s extensive world knowledge to better understand complex scenarios on the road.
Our research demonstrates how multimodal models, such as Gemini, can be applied to autonomous driving and explores pros and cons of the pure end-to-end approach. It highlights the benefit of incorporating multimodal world knowledge, even when the model is fine-tuned for autonomous driving tasks that require good spatial understanding and reasoning skills. Notably, EMMA demonstrates positive task transfer across several key autonomous driving tasks: training it jointly on planner trajectory prediction, object detection, and road graph understanding leads to improved performance compared to training individual models for each task. This suggests a promising avenue of future research, where even more core autonomous driving tasks could be combined in a similar, scaled-up setup.
“EMMA is research that demonstrates the power and relevance of multimodal models for autonomous driving,” said Waymo VP and Head of Research Drago Anguelov. “We are excited to continue exploring how multimodal methods and components can contribute towards building an even more generalizable and adaptable driving stack.”