A 3B-active-parameter native unified multimodal model for image and video understanding, generation, and editing.
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
CVPR 2025 🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".
Janus is a multimodal large language model and unified framework that integrates visual understanding and image generation within a single neural network. It functions as both a visual understanding model for analyzing images and a text-to-image generator. The system uses a unified transformer backbone and a multimodal latent space to bridge the gap between text and visual data. This architecture employs decoupled visual encoding and cross-modal tokenization to separate the paths for discriminative understanding and generative tasks, representing images as grids of discrete codes. The projec