The multimodal language modeling space is far more undefined, ragged, and open for inspiration than that of language-only models. Where language has a set of defined tasks and behaviors that frontier labs attempt to hillclimb on, such as through innovative training approaches like OpenAI o1 on the hardest reasoning problems, frontier labs and small labs alike are trying to define what multimodal models should be used for. What does it mean for AI to see the world? Having a strong suite of open models is central to the field developing in a well-rounded and transparent manner — two conditions needed for beneficial outcomes.
Most multimodal language model research is through late fusion models, where the model is initialized from a language backbone and often an image encoder (which is likely what was used for GPT-4V). This is an expensive form of fine-tuning on a base language model, so the compute costs are still more accessible than most realize. There are many more architectures, but late-fusion has been popular due to its stability and predictability. Molmo and Llama 3.2 V are trained with this method.
The promise of scaling data through early-fusion models, which are pretrained on multimodal datasets, hasn’t emerged. It may be that the benefits only are clear once tested on GPT-5 scale clusters.