Humans understand events in the world contextually, performing what’s called multimodal reasoning across time to make inferences about the past, pre

This AI system learned to understand videos by watching YouTube

submited by
Style Pass
2021-06-14 13:30:08

Humans understand events in the world contextually, performing what’s called multimodal reasoning across time to make inferences about the past, present, and future. Given text and an image that seems innocuous when considered separately — e.g., “Look how many people love you” and a picture of a barren desert — people recognize that these elements take on potentially hurtful connotations when they’re paired or juxtaposed, for example.

Even the best AI systems struggle in this area. But there’s been progress, most recently from a team at the Allen Institute for Artificial Intelligence and the University of Washington’s Paul G. Allen School of Computer Science & Engineering. In a preprint paper published this month, the researchers detail Multimodal Neural Script Knowledge Models (Merlot), a system that learns to match images in videos with words and even follow events globally over time by watching millions of YouTube videos with transcribed speech. It does all this in an unsupervised manner, meaning the videos haven’t been labeled or categorized — forcing the system to learn from the videos’ inherent structures.

Our capacity for commonsense reasoning is shaped by how we experience causes and effects. Teaching machines this type of “script knowledge” is a significant challenge, in part because of the amount of data it requires. For example, even a single photo of people dining at a restaurant can imply a wealth of information, like the fact that the people had to agree where to go, meet up, and enter the restaurant before sitting down.

Leave a Comment