This blog post introduces SmolVLM, a 2B VLM, SOTA for its memory footprint. SmolVLM is small, fast, memory-efficient, and fully open-source. All model

SmolVLM - small yet mighty Vision Language Model

submited by

Style Pass

2024-11-26 17:00:08

This blog post introduces SmolVLM, a 2B VLM, SOTA for its memory footprint. SmolVLM is small, fast, memory-efficient, and fully open-source. All model checkpoints, VLM datasets, training recipes and tools are released under the Apache 2.0 license.

This year has seen a boom in multimodal AI with many large vision language models released. The trends were to initially scale up compute, later scale up the data diversity by generating synthetic data with large models, and, recently, scale down to make these models more efficient. Small open models allow local deployment to browser or edge devices, cut inference costs, and enable user customization. Some notable examples of these models include PaliGemma 3B, moondream2, and Qwen2VL.

In this blog post, we introduce SmolVLM, a new family of 2B small vision language models that can be used commercially and deployed to smaller local setups, with completely open training pipelines.

We release three models: SmolVLM-Base, which can be used for downstream fine-tuning, SmolVLM-Synthetic, the fine-tuned variant on synthetic data, and SmolVLM Instruct, the fine-tuned instruction variant, which can be used out of the box for interactive end-user applications.