It was a big day in the lab – January 22, 2024, my student Kevin Black is providing a live demo of his robotic learning system to Hannah Fry, who is compiling a documentary on superintelligence. Our robot is hardly superintelligent, it would be outwitted by any housecat. But we hooked up speech to text, and it’s showtime.
In a live robot demo, it’s hard to resist the temptation to challenge the machine with a new problem. Our guest takes off her watch – “put the watch on the towel.” The robot casts its arm around aimlessly. I’m good at keeping cool under pressure, but in my head I’m already scrambling to find a way to steer the demo to something more manageable. Kevin is quicker: “it’s good with colors, tell it that it’s the brown leather watch.” The trick works, to my amazement the robot moves haltingly toward the watch, grasps it by the belt, and drops it gingerly on the towel. “The diffusion model is good with colors” – this model, which we nicknamed SuSIE, combines image synthesis via diffusion with robotic control. A diffusion model pretrained on Internet-scale language-driven image generation “imagines” goals for the robot based on the human command, and the robot uses a learned policy to reach it. Glancing at the screen, I see the imagined subgoal: a small deformed brown teddy bear on a cyan towel, with the gripper hovering triumphantly nearby.
This robot is using a kind of robotic foundation model, trained on both Internet-scale image-language data and robotic data that we collected in the lab. The image-language data allows the robot to understand visual features, and how language commands should map onto desired outcomes. The robotic data that we collected in the lab allows the robot to figure out how movements of its arm affect the objects in the scene. Putting the two together gives us a robotic system that can understand language and take actions to fulfill user commands. These ingredients – web-scale training for semantic understanding and generalization, combined with robotic data for grounding this understanding in the physical world – represent a small scale prototype for a general recipe for robotic foundation models. But this simple prototype also raises some important questions: how do we go from these kinds of small-scale laboratory demonstrations to truly capable robotic systems? And what data will these systems be trained on?