DeepSeek Releases the New Visual Multimodal Model Janus-Pro-7B, Surpassing Stable Diffusion and DALL-E 3

Janus-Pro-7B Open-Source Release
On January 28th, DeepSeek announced the open-source release of the new visual multimodal model Janus-Pro-7B. This model outperformed Stable Diffusion and OpenAI’s DALL-E 3 in GenEval and DPG-Bench benchmark tests, demonstrating excellent performance.

Innovative Autoregressive Framework
Janus-Pro is an innovative autoregressive framework that achieves unified understanding and generation of multimodal information. Unlike previous methods, Janus-Pro solves some of the limitations in earlier frameworks by splitting the visual encoding process into multiple independent paths, while still using a single unified transformer architecture for processing. This decoupling method not only alleviates conflicts that may arise during understanding and generation but also enhances the framework's flexibility.

Performance Surpassing Traditional Models
Janus-Pro outperforms traditional unified models and also excels in comparison to task-specific models. With its simplicity, high flexibility, and efficiency, Janus-Pro has become a strong contender for the next-generation unified multimodal model.

Unified Multimodal Large Language Model
Janus-Pro is a unified multimodal large language model (MLLM), achieving more efficient processing by decoupling the visual encoding process from multimodal understanding and generation. Janus-Pro is built on the DeepSeek-LLM-1.5b-base/DeepSeek-LLM-7b-base model. For multimodal understanding tasks, Janus-Pro uses SigLIP-L as the visual encoder, supporting 384x384 pixel image input. For image generation tasks, Janus-Pro uses a tokenizer from a specific source with a downsampling rate of 16.

Advanced Versions and Improvements
Janus-Pro is an advanced version of the previous work, Janus. Specifically, Janus-Pro integrates optimized training strategies, expanded training data, and a larger model scale. These improvements have made significant progress in multimodal understanding and text-to-image instruction following capabilities, while also enhancing the stability of text-to-image generation.

JanusFlow Architecture
According to the official description, JanusFlow introduces a minimalist architecture that integrates autoregressive language models with normalizing flows, an advanced generative model method. Research has shown that normalizing flows can be trained directly within a large language model framework without the need for complex architectural adjustments. Extensive experiments indicate that JanusFlow achieves comparable or even better performance in its respective domains than specialized models, and significantly outperforms existing unified approaches in standard benchmark tests. This work represents a step forward towards more efficient and general visual language models.

Conclusion
DeepSeek’s open-source Janus-Pro-7B model excels in multimodal understanding and generation tasks, surpassing Stable Diffusion and DALL-E 3, thanks to its innovative autoregressive framework and decoupled visual encoding process. The release of Janus-Pro not only showcases DeepSeek's technical leadership but also provides powerful tools for developers, driving the development of multimodal models.

Specific address:GitHub 和 HuggingFace