Hugging Face releases new multimodal models SmolVLM-256M and SmolVLM-500M

January 26th news: Hugging Face has released two new multimodal models, SmolVLM-256M and SmolVLM-500M, with SmolVLM-256M claiming to be the world's smallest visual language model (Video Language Model). This model is primarily distilled from the 80B parameter model trained by the Hugging Face team last year, with the official claim that it achieves a balance between performance and resource requirements.

Ready-to-use, easy deployment
Both SmolVLM-256M and SmolVLM-500M models are "plug-and-play" and can be directly deployed on transformer MLX and ONNX platforms. Specifically, both SmolVLM-256M / 500M models use SigLIP as the image encoder and SmolLM2 as the text encoder. SmolVLM-256M is currently the smallest multimodal model, capable of accepting arbitrary sequences of images and text inputs and generating text outputs. Its functions include describing image content, generating subtitles for short videos, processing PDFs, and more.

Lightweight and efficient, mobile platform-friendly
Hugging Face claims that SmolVLM-256M, due to its compact size, can run easily on mobile platforms with less than 1GB of GPU memory to perform inference on a single image. This makes the model very suitable for resource-limited applications.

High-performance model SmolVLM-500M
SmolVLM-500M, designed for scenarios that require higher performance, is said by Hugging Face to be highly suitable for deployment in enterprise operating environments. SmolVLM-500M requires 1.23GB of GPU memory for inference on a single image. Although it has a higher load than SmolVLM-256M, the inference output is more precise.

Open-source license, easy for developers to use
Both models are licensed under Apache 2.0 open-source license, and the research team provides example programs based on transformer and WebGUI. All models and their demos are publicly available for developers to download and use.

Summary
Hugging Face's new multimodal models SmolVLM-256M and SmolVLM-500M excel in lightweight and high-performance scenarios, respectively. SmolVLM-256M, as the world's smallest visual language model, is suitable for running on mobile platforms, while SmolVLM-500M offers higher inference accuracy for enterprise applications. The open-source licenses and example programs for both models will provide great convenience for developers, promoting the widespread application of multimodal models.