When I started writing about AI in 2019, my first article was dedicated to what machine learning is and how it works. It explained the difference between weak and strong (or general) AI and mentioned that we have no idea how to create the latter.
Don’t get me wrong – multimodal learning wasn’t a new concept already in 2019. Back then, TechRepublic called it the “future of AI” and ABI Research predicted that multimodal learning will be key for self-driving cars, robotics, consumer devices, and healthcare. But today we can actually grasp the benefits of multimodal learning in its full capacity due to powerful hardware and cloud technologies.
Multimodal learning in machine learning is a type of learning where the model is trained to understand and work with multiple forms of input data, such as text, images, and audio.
These different types of data correspond to different modalities of the world – ways in which it’s experienced. The world can be seen, heard, or described in words. For a ML model to be able to perceive the world in all of its complexity and understanding different modalities is a useful skill.