LLaDA is a masked diffusion language model and conditional text generator. It generates text by iteratively refining masked tokens through a diffusion process rather than predicting the next token in a sequence. The project functions as a vision-language diffusion model, converting visual inputs into text responses. It also serves as a preference optimization framework that uses log-likelihood estimation and evidence lower bounds to tune model responses. The system supports multi-round conversational AI and text sequence evaluation. It integrates vision-language embedding for cross-modal con
OmniGen is a unified image generation model and diffusion framework that processes text, images, and vision tasks through a single system. It functions as a multimodal diffusion framework that treats diverse vision operations as unified image synthesis problems using shared model weights, removing the need for external adapter modules. The system supports subject-driven image generation to preserve the identity of objects from reference photos and allows for multi-reference image synthesis. It also operates as an instruction-based image editor, modifying visual content through natural languag
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
This repository is the official implementation of FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities.