Large Language Models or LLMs, have been all the rage since the advent of ChatGPT in 2022. This is largely thanks to the success of the transformer ar

Bridging Images and Text - a Survey of VLMs

submited by

Style Pass

2024-09-18 10:00:14

Large Language Models or LLMs, have been all the rage since the advent of ChatGPT in 2022. This is largely thanks to the success of the transformer architecture and availability of terabytes worth of text data over the internet. Despite their fame, LLMs are fundamentally limited to working only with texts.

Vision Language Models or VLMs are AI models that use both images and textual data to perform tasks that fundamentally need both of them. With how good LLMs have become, building quality VLMs has become the next logical step towards Artificial General Intelligence. In this article, let's understand the fundamentals of VLMs with a focus on how to build one. Throughout this article, we will cover the latest papers in the research and will provide with relevant links to the papers.

A couple of disclaimers: VLMs work with texts and images, and there are a class of models called Image Generators that do the following: