NVLM: Open Frontier-Class Multimodal LLMs

submited by
Style Pass
2024-10-02 01:30:03

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, after multimodal training, NVLM 1.0 shows improved accuracy on text-only tasks over its LLM backbone. We are open-sourcing the model weights and training code in Megatron-Core for the community.

We compare NVLM 1.0 to leading proprietary and open-access multimodal LLMs in above Table. Note that the model weights for *Llama 3-V have not been released yet. The results demonstrate that NVLM 1.0 achieves performance on par with leading models across both vision-language and text-only tasks. Specifically, our 72B model has achieved the highest OCRBench and VQAv2 so far. Our NVLM outperforms or on par with GPT-4o on all key benchmarks including MathVista, OCRBench, ChartQA and DocVQA, except MMMU. Importantly, we compare multimodal LLM to its backbone LLM on text-only tasks. Llama 3-V 70B and 405B show no degradation in text-only tasks, as their LLM backbones are frozen during multimodal training. The leading InternVL 2 model shows significant degradation on text-only bechmarks, including MMLU, GSM8K, MATH and HumanEval. In contrast, our NVLM-1.0 72B model demonstrates significant improvements over its text backbone on text-only math and coding benchmarks, with average accuracy increasing by 4.3 points after multimodal training. The results show the multimodal NVLM-1.0 72B, which outperforms Gemini 1.5 Pro, is also highly compelling at solving text-only tasks (e.g., math, coding, reasoning).

NVLM-1.0-D 72B model demonstrates good instruction-following capability. Based on the instructions, it appropriately controls the target generation’s length. It can also generate a very high-quality, detailed description of the provided image.

Leave a Comment