# lucidrains/dalle2-pytorch

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/lucidrains-dalle2-pytorch).**

11,310 stars · 1,080 forks · Python · MIT

## Links

- GitHub: https://github.com/lucidrains/DALLE2-pytorch
- awesome-repositories: https://awesome-repositories.com/repository/lucidrains-dalle2-pytorch.md

## Topics

`artificial-intelligence` `deep-learning` `text-to-image`

## Description

This is a PyTorch implementation of a text-to-image model designed for synthesizing high-fidelity images from natural language descriptions. It utilizes a diffusion image generator to transform latent embeddings into visual data through an iterative denoising process.

The system employs a two-stage latent mapping process, using a CLIP-based latent prior to map text embeddings to image embeddings before decoding them into pixels. It features a cascading diffusion decoder that produces high-resolution imagery by passing low-resolution outputs through a sequence of models at increasing scales.

The project covers a broad range of generative capabilities, including image inpainting and super-resolution for localized editing and detail enhancement. It provides tools for multimodal embedding creation, contrastive language-image pre-training, and latent space compression to improve generation efficiency.

Training infrastructure includes support for distributed GPU cluster training, mixed-precision gradient training, and GPU memory management for handling large-scale datasets.

## Tags

### Artificial Intelligence & ML

- [Image Diffusion Models](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/image-diffusion-models.md) — Implements a generative model that creates high-fidelity images through an iterative denoising process.
- [Diffusion Model Training](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-ai-models/diffusion-models/diffusion-model-training.md) — Implements a training pipeline for the diffusion prior that maps text embeddings to image embeddings. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Cascading Decoders](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-ai-models/diffusion-models/diffusion-model-training/cascading-decoders.md) — Implements a cascading diffusion decoder to produce high-resolution imagery by passing outputs through multiple models at increasing scales.
- [Staged Training Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-ai-models/diffusion-models/diffusion-model-training/staged-training-pipelines.md) — Implements staged training for individual components of the generation pipeline to improve efficiency and resolution scaling. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Cross-Modal Latent Mappings](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-ai-models/latent-space-generative-models/latent-space-projections/image-to-latent-projections/generative-latent-mappings/cross-modal-latent-mappings.md) — Implements a two-stage process that predicts an image embedding from a text embedding before decoding to pixels.
- [Text-to-Image Latent Mappings](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-ai-models/latent-space-generative-models/latent-space-projections/image-to-latent-projections/generative-latent-mappings/text-to-image-latent-mappings.md) — Uses a CLIP-based latent prior to map text embeddings to image embeddings to guide the generation process.
- [Multimodal Embeddings](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-ai-models/latent-space-generative-models/shared-latent-spaces/multimodal-embeddings.md) — Aligns text and image representations into a shared embedding space to enable cross-modal retrieval and generation.
- [Latent Diffusion Models](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-models/latent-diffusion-models.md) — Optimizes image generation by performing the diffusion process within a compressed latent space. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Prior Networks](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-models/latent-diffusion-models/prior-networks.md) — Employs a CLIP-based latent prior to map text embeddings to image embeddings before decoding them into pixels.
- [Text-to-Image Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/generative-ai/text-to-image-synthesis.md) — Synthesizes high-fidelity images based on text embeddings using a decoder architecture. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Visual Embedding Predictions](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/generative-ai/text-to-image-synthesis/visual-embedding-predictions.md) — Generates a visual embedding from a text embedding using a prior network to bridge language and imagery. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [High-Resolution Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/image-super-resolution-models/high-resolution-synthesis.md) — Generates high-fidelity images by passing data through a sequence of networks at increasing resolutions. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Cascaded Image Decoders](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-scaling/resolution-scaling/resolution-independent-inference/cascaded-resolution-inference/cascaded-image-decoders.md) — Implements a cascading diffusion decoder to produce high-resolution imagery through a sequence of models at increasing scales.
- [Cascaded Upscaling Models](https://awesome-repositories.com/f/artificial-intelligence-ml/neural-network-architectures/u-net-architectures/cascaded-upscaling-models.md) — Implements a denoising process that increases image detail by chaining multiple U-Net stages for progressive upscaling.
- [Generative Decoder Training](https://awesome-repositories.com/f/artificial-intelligence-ml/neural-network-architectures/u-net-architectures/generative-decoder-training.md) — Trains a U-Net image decoder to synthesize high-resolution images from image embeddings. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Iterative Denoising Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/neural-network-architectures/u-net-architectures/iterative-denoising-pipelines.md) — Uses a U-Net architecture with symmetrical encoder-decoder paths to iteratively remove noise from image data.
- [Image Embeddings](https://awesome-repositories.com/f/artificial-intelligence-ml/text-embeddings/image-embeddings.md) — Generates mathematical representations of images and text to enable the translation of text prompts into visual embeddings. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Text-to-Image Implementations](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-image-implementations.md) — Provides a full PyTorch implementation of a text-to-image model for synthesizing high-fidelity images.
- [Classifier-Free Guidance](https://awesome-repositories.com/f/artificial-intelligence-ml/classifier-free-guidance.md) — Provides a mechanism to modulate the influence of text prompts by interpolating between conditioned and unconditioned predictions.
- [CLIP Embedding APIs](https://awesome-repositories.com/f/artificial-intelligence-ml/clip-embedding-apis.md) — Transforms raw images and text into latent embeddings to accelerate the training of a prior network. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Contrastive Learning Models](https://awesome-repositories.com/f/artificial-intelligence-ml/contrastive-learning-models.md) — Learns a shared latent space for text and images using a contrastive architecture to enable cross-modal retrieval. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Pretrained Model Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/custom-model-training/pretrained-model-integrations.md) — Integrates pretrained CLIP embedding models into the generation pipeline to provide standardized data representations. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Decoder Training Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/decoder-training-optimizations.md) — Optimizes the decoder's ability to generate images through a trainer that manages learning rates and weight decay. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Distributed GPU Training](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-gpu-training.md) — Synchronizes model weights and gradients across multiple GPU accelerators for large-scale dataset training.
- [Multi-GPU Training Distributions](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-ai-models/diffusion-models/diffusion-model-training/multi-gpu-training-distributions.md) — Distributes the training of diffusion priors and decoders across multiple GPU clusters to handle large datasets.
- [Latent Space Generative Models](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-ai-models/latent-space-generative-models.md) — Reduces computation time and memory by performing image diffusion in a compressed latent space. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Image Inpainting](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-ai-pipelines/text-to-image-generators/image-inpainting.md) — Fills or replaces masked regions of an image by combining target imagery with boolean masks for localized synthesis.
- [Conditional Image Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/image-generation-models/conditional-image-generation.md) — Allows adjustment of text conditioning strength to control the influence of prompts on the final image. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Image Super Resolution Models](https://awesome-repositories.com/f/artificial-intelligence-ml/image-super-resolution-models.md) — Increases the resolution of generated imagery through a cascading diffusion process using multiple neural networks. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Contrastive Pre-training](https://awesome-repositories.com/f/artificial-intelligence-ml/large-scale-model-training/vision-transformer-pre-training/contrastive-pre-training.md) — Learns shared embedding spaces for text and images to establish mathematical relationships between visual data and descriptions. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Training Backend Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/training-efficiency/training-backend-optimizers.md) — Optimizes hardware usage during prior and decoder training through mixed precision and gradient accumulation. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Exponential Moving Average Weight Updates](https://awesome-repositories.com/f/artificial-intelligence-ml/model-weight-reconstruction/weight-smoothing/exponential-moving-average-weight-updates.md) — Uses exponentially moving average weights to stabilize training and improve the quality of generated images. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Training Memory Management](https://awesome-repositories.com/f/artificial-intelligence-ml/training-memory-management.md) — Manages GPU memory by loading and offloading network stages to enable training of large models on limited hardware. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))
- [Component Stability Controls](https://awesome-repositories.com/f/artificial-intelligence-ml/training-stability-techniques/component-stability-controls.md) — Provides control over optimizers and moving averages for the decoder and prior to maintain training stability. ([source](https://github.com/lucidrains/dalle2-pytorch#readme))

### Graphics & Multimedia

- [Cross-Modal Retrieval Alignment](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/audio-processing-systems/audio-processing/text-to-speech-engines/text-to-speech-engines/cross-modal-retrieval-alignment.md) — Aligns text and image representations into a shared mathematical space to enable cross-modal retrieval and generation.

### Part of an Awesome List

- [Computer Vision](https://awesome-repositories.com/f/awesome-lists/ai/computer-vision.md) — Implementation of the DALL-E 2 image generation model.
- [Text to Image](https://awesome-repositories.com/f/awesome-lists/more/text-to-image.md) — Listed in the “Text to Image” section of the The Incredible Pytorch awesome list.