30 open-source projects similar to kohya-ss/sd-scripts, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Sd Scripts alternative.
kohya_ss is a graphical user interface and workbench for fine-tuning diffusion models, specifically designed for Stable Diffusion. It provides a suite of tools for training generative AI models, including specialized interfaces for creating Low-Rank Adaptation weights and training ControlNet spatial control networks. The project distinguishes itself through integrated VRAM usage optimization and hardware acceleration, featuring specific support for Intel GPUs via XPU-accelerated libraries. It implements parameter-efficient training methods and memory-saving techniques like gradient checkpoint
Diffusers is a PyTorch-based library and generative AI framework used to build, train, and deploy diffusion pipelines for producing multi-modal media. It provides a suite of tools for generating images, video, and audio from natural language descriptions, as well as specialized systems for text-to-image generation. The project differentiates itself through a modular architecture that separates noise schedulers, pretrained model blocks, and pipeline compositions. This structure allows for the construction of custom generation workflows and the ability to swap individual components of the diffu
ComfyUI-nunchaku is a 4-bit diffusion inference engine and a set of nodes for running low-precision quantized diffusion models within ComfyUI visual workflows. It provides a backend that reduces memory overhead and increases generation speed for transformer models. The project includes specialized tools for identity-preserving generation and an image-to-image guidance toolkit that uses depth maps and reference images. It also features a multimodal visual question answering implementation and a utility for merging multiple quantized model files into single unified files. The engine covers a b
This project is a toolkit for fine-tuning and managing text-to-image diffusion models. It focuses on low-rank adaptation to create small, portable weight files that customize model styles and behaviors without modifying the entire base model. The project provides specialized utilities for model distillation using singular value decomposition to extract adapters from fully trained models, as well as tools for blending and merging multiple adapters through weight interpolation. It includes capabilities for subject inversion and pivotal tuning to increase the visual fidelity of specific identiti
This project is a Dreambooth implementation designed to personalize Stable Diffusion models. It serves as an AI image personalization tool and model tuner that enables the creation of unique subject identifiers to generate consistent, personalized images. The system focuses on subject-driven image synthesis by fine-tuning pre-trained diffusion models on small, custom datasets. This allows the model to recognize specific people, objects, or artistic styles and place those learned subjects into diverse contexts via text-to-image conditioning. The implementation includes a diffusion model optim
mmagic is a multimodal training pipeline and framework for generative AI, focusing on visual synthesis and restoration. It provides the infrastructure to build and train models for tasks such as text-to-image and text-to-video generation, 3D-aware content synthesis, and high-fidelity image translation using diffusion models and generative adversarial networks. The project distinguishes itself through specialized capabilities for generative model personalization, including techniques for fine-tuning subjects and styles. It also supports advanced visual manipulations such as latent space interp
SD.Next is an all-in-one web interface and multi-backend inference engine for generating, editing, and processing images and videos using diffusion models. It functions as a comprehensive tool for diffusion model management and an automated image processing pipeline for bulk operations. The project is distinguished by its hardware-backend abstraction layer, which provides automatic detection and acceleration for NVIDIA CUDA, AMD ROCm, Intel OpenVINO, and DirectML. It features a headless generative API and a programmatic command interface, allowing users to trigger tasks via REST API or CLI wi
Kolors is a generative model implementation for synthesizing photorealistic images from natural language descriptions and visual references. It utilizes a latent diffusion model framework to produce high-fidelity imagery, operating within a compressed latent space to improve generation efficiency and quality. The system functions as a multilingual image generator, interpreting text prompts in multiple languages to produce semantically accurate visual outputs. It includes a custom model training pipeline that uses low-rank adaptation to teach the model specific subjects or artistic styles from
DiffusionBee is a Stable Diffusion desktop client for macOS that functions as an AI image generator and editor. It allows for the local generation of images from text prompts and the management of diffusion models without requiring external cloud services or technical setup. The application includes a local diffusion model manager for importing and switching between custom trained model files to achieve specific artistic styles. It also features a system for tracking generation history and uploading assets to a public gallery. The software covers several image synthesis and manipulation work
This project is an educational course and collection of training materials focused on generative diffusion models. It provides a curriculum and practical guides for training, fine-tuning, and deploying models capable of synthesizing images, audio, and video. The material covers specific implementation strategies including noise-based synthesis, iterative refinement, and latent space compression. It provides instruction on guiding generative outputs through conditional synthesis and prompt adherence optimization, as well as techniques for image inpainting and text-based editing. The project i
Sana is a framework for high-resolution image and video synthesis based on a linear diffusion transformer. It provides a toolkit for the training, fine-tuning, and execution of text-to-image and text-to-video models, as well as a video generative world model capable of simulating physical environments with precise spatial control. The project is distinguished by its use of linear complexity layers to handle high resolutions and its support for long-form, minute-length video generation in real time. It implements a two-stage inference paradigm that separates structural generation from visual t
Facechain is a generative AI toolchain and portrait generator designed to create personalized synthetic identities and consistent digital portraits. It provides a pipeline for training and refining diffusion models to produce subject-driven image synthesis from reference photos. The project focuses on digital twin generation, enabling the creation of a personalized model from a single image to maintain identity consistency across various poses and artistic styles. It utilizes identity fusion and similarity sorting to balance facial accuracy with stylized visual effects. The toolkit covers a
ai-toolkit is a diffusion model training toolkit designed for fine-tuning image and video generation models. It functions as a containerized model trainer and GPU training job manager, providing the infrastructure to orchestrate dependencies and manage training processes on remote GPU hardware. The system utilizes low-rank adaptation techniques, including LoRA and LoKr weight optimization, to reduce the hardware requirements for model training. It distinguishes itself through a web-based training controller that allows for the monitoring and modification of hyperparameters, secured by token-b
OmniGen is a unified image generation model and diffusion framework that processes text, images, and vision tasks through a single system. It functions as a multimodal diffusion framework that treats diverse vision operations as unified image synthesis problems using shared model weights, removing the need for external adapter modules. The system supports subject-driven image generation to preserve the identity of objects from reference photos and allows for multi-reference image synthesis. It also operates as an instruction-based image editor, modifying visual content through natural languag
HunyuanDiT is a bilingual text-to-image generative model and diffusion transformer image generator. It uses a latent diffusion system to synthesize high-resolution images from text prompts, with a specific focus on understanding and generating content from both Chinese and English language descriptions. The project features a multi-resolution transformer architecture and a bilingual embedding space to map different scripts into a shared semantic area. It supports iterative multi-turn image refinement, which translates conversational dialogue into updated prompts to progressively modify visual
Dream Textures is a Stable Diffusion integration for Blender that provides tools for text-to-image generation, depth projection, and node-based processing within a 3D environment. It functions as an AI texture generator capable of producing image textures and concept art from text prompts and scene renders. The system features a depth-to-image projection tool that maps generated imagery onto 3D models using depth data for spatial alignment. It also includes a node-based AI image processor for creating procedural visual effects and a dedicated toolset for AI-assisted inpainting and outpainting
This project is an integrated software framework designed to facilitate generative image synthesis and high-performance model inference on Intel processor and graphics hardware. It provides a specialized inference engine that executes latent diffusion models to transform natural language descriptions into visual outputs. The library distinguishes itself by leveraging the OpenVINO toolkit to optimize machine learning models for specific Intel hardware architectures. By utilizing kernel-level hardware acceleration and static graph optimization, the framework improves execution throughput and re
tensorrtx is a computer vision inference engine and model implementation library designed for graphics processor acceleration. It provides a framework for optimizing deep learning models through a GPU inference optimizer, a deep learning model converter for transforming weights from frameworks like TensorFlow and PyTorch, and a custom plugin library to implement operations not natively supported by the TensorRT API. The project distinguishes itself through a comprehensive collection of pre-defined network implementations, ranging from various YOLO versions and DETR transformers for object det
Flux is a diffusion model inference engine designed for text-to-image generation and image-to-image manipulation. It provides a system for executing open-weight models to transform natural language descriptions into visual imagery or to modify existing images. The project distinguishes itself through a flow-matching framework for image generation and a structural image controller. This controller allows for guided synthesis by using depth maps and Canny edge detection to constrain the geometry and composition of the output. The toolkit covers a broad range of image editing capabilities, incl
MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks. The project distinguishes itself through heavy optimization for edge hardware, utilizing quantized weight compression in GGUF and MLX formats to reduce memory overhead. It implements advanced inference techniques such as speculative sampling and radix-tree prefix caching to accelerate generation speed and throughp
This project is a comprehensive instructional resource and course for building neural networks using PyTorch. It covers the fundamental building blocks of deep learning, including tensor manipulation, automatic differentiation, and the construction of modular neural network components. The repository serves as a technical guide for several specialized domains. It provides implementation details for computer vision tasks such as image classification, object detection, and semantic segmentation, as well as natural language processing workflows involving transformers, recurrent networks, and gen
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats. The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XP
mmocr is a PyTorch-based optical character recognition framework designed for training and deploying text detection, recognition, and key information extraction models. It serves as a comprehensive toolbox for scene text detection and recognition, providing specialized libraries for locating text regions and converting visual text into machine-encoded strings. The project distinguishes itself through a research framework for key information extraction and advanced text spotting capabilities. These include point-based spotting using transformers and the use of parameterized Bezier curves to id
This is a framework for training and sampling diffusion models to generate high-fidelity images, video, and 4D assets. It provides a modular environment for managing generative AI training pipelines, including the handling of datasets, noise sampling, and loss weighting to stabilize the creation of synthetic content. The project features a modular model configuration system that uses YAML-based assembly to define network submodules and conditioners. It also includes a dedicated toolset for AI image watermarking, allowing for the embedding and detection of invisible markers to verify the origi
This project is a framework for running Stable Diffusion image generation models on Apple Silicon using Core ML hardware acceleration. It provides a local generative AI pipeline for producing images from text prompts using Swift and Python without relying on external cloud APIs. The system includes a model converter to transform deep learning checkpoints into Core ML formats and a model optimizer to quantize weights and activations. It features a ControlNet integration layer to guide image generation using external signals such as edge and depth maps. Capabilities cover text-to-image generat
This repository is a collection of node-based pipeline configurations, examples, and templates for generating AI media. It provides a workflow library and a curated gallery of blueprints designed for creating images, videos, and 3D assets using diffusion models. The project specifically offers a set of pre-configured node graphs for implementing advanced image generation and refinement techniques, with a focus on Stable Diffusion workflows. These examples demonstrate how to interconnect processing nodes to define complex generative logic without writing code. The available templates cover a
IF is a text-to-image diffusion system that translates natural language descriptions into visual imagery. The project provides a generative pipeline for creating images, an inpainting tool for modifying specific image sections, and a super-resolution upscaler to increase pixel density and clarity. The system includes a concept fine-tuning framework that allows for the teaching of new visual concepts by updating a small set of parameters. It also supports image style transfer to apply the aesthetic characteristics of a reference image to a new output.
ComfyUI-GGUF is a memory optimizer and model loader for ComfyUI that enables the execution of large transformer-based generative models using quantized weights. It provides a system for loading GGUF formatted weights within a node-based diffusion interface to reduce GPU memory consumption. The project includes a quantization tool for converting standard model checkpoints into compressed binary formats and a tensor fixer to restore missing keys and correct architectures in binary model files. These utilities ensure that compressed models remain functional during inference on hardware with limi
Sygil-webui is a web interface for Stable Diffusion latent diffusion models, providing a creative suite for text-to-image and text-to-video synthesis. It functions as an image generation tool and a latent diffusion image editor, allowing users to create visuals and video sequences from textual descriptions. The project includes a dedicated model training interface for creating custom textual inversion embeddings, which introduces specific new concepts or styles into the diffusion models. It also features specialized tools for generative image editing, including mask-based inpainting, image-to