Mmagic

mmagic is a multimodal training pipeline and framework for generative AI, focusing on visual synthesis and restoration. It provides the infrastructure to build and train models for tasks such as text-to-image and text-to-video generation, 3D-aware content synthesis, and high-fidelity image translation using diffusion models and generative adversarial networks.

The project distinguishes itself through specialized capabilities for generative model personalization, including techniques for fine-tuning subjects and styles. It also supports advanced visual manipulations such as latent space interpolation, point-based image editing, and stable animation generation to reduce flickering in video sequences.

The framework covers a broad range of image processing and restoration domains, including denoising, super-resolution, inpainting, and foreground matting. Its architectural surface includes a registry-based component system for custom model and loss definitions, as well as comprehensive data pipelines for multimodal asset loading and augmentation.

Training and operationalization are supported through distributed-data-parallel execution, mixed-precision optimization, and inference backend conversion for hardware accelerators.

Features

Image and Video Restoration Suites - Implements a comprehensive framework for improving visual fidelity through upscaling, denoising, and deblurring.
Data Processing Pipelines - Provides composable data processing pipelines for loading, rescaling, and augmenting multimodal assets for machine learning.
Image Inpainting - Restores corrupted image regions by synthesizing new pixels that blend with the surrounding context.
Data-Parallel Training - Implements distributed-data-parallel training to scale generative model workloads across multiple GPUs and compute nodes.
Distributed Training - Implements distributed data parallel patterns to execute training for dynamic architectures across multiple compute nodes.
Resolution Upscalers - Increases the resolution of images by synthesizing high-frequency details for classical and real-world scenarios.
Text-to-Image Generators - Generates high-resolution imagery from natural language text descriptions using diffusion-based models.
Text-to-Video Generators - Synthesizes animated video sequences from text prompts using personalized diffusion models.
Generative Model Fine-Tuning - Provides a modular framework for fine-tuning generative models using techniques like DreamBooth and Textual Inversion.
Generative Model Training Tools - Provides a modular framework for training and fine-tuning generative architectures on custom datasets.
Multi-Node Training Scaling - Supports scaling distributed training across multiple compute nodes and GPUs using IP-based communication or Slurm.
Image Data Preprocessing - Provides essential preprocessing utilities to normalize pixel values and apply padding to image tensors for model inputs.
Image Editing - Provides tools for visual content modification, including inpainting, colorization, and foreground matting.
Generative Objective Functions - Computes adversarial, perceptual, and pixel-wise objective functions to optimize generative AI models.
Mixed Precision Training - Includes a mixed-precision training wrapper that optimizes memory and speed by using half-precision tensors and gradient scaling.
Distributed Training - Distributes complex generative model training workloads across multiple GPUs or compute nodes.
Model Training and Inference Engines - Provides a unified implementation for forward passes, loss calculations, and training/validation/testing cycles.
Multimodal Data Preprocessing Utilities - Provides utilities to transform and batch raw multimodal datasets before moving them to target hardware.
Model Architectures - Creates custom neural network structures by inheriting from base module classes and registering them.
Multimodal Content Generation - Synthesizes high-resolution visual content from mixed-modality prompts using diffusion models and GANs.
Multimodal Training - Builds data processing workflows and distributed training pipelines specifically for multimodal generative AI.
Multimodal Tensor Formatting - Converts processed data dictionaries into packed tensors for efficient model forward method execution.
Data Standardization - Unifies disparate data and metadata into a standardized interface to simplify information flow between multimodal models.
Batched Data Loading - Stacks multiple multimodal data samples into single batches using tensor-like operations for efficient training.
Data Pipelines - Constructs composable data pipelines that transform raw images through sequential loading, rescaling, and cropping.
Vision - Loads image data and optional annotations required for high-fidelity restoration and inpainting tasks.
Multimodal Data Loading - Implements the ingestion of images and video frames from files into structured samples for multimodal training.
Image Denoising - Removes noise and artifacts from both grayscale and color images to improve visual quality.
Dataset Composition - Implements workflows to merge foreground objects with backgrounds and generate annotation files for matting datasets.
Restoration Quality Metrics - Calculates quantitative fidelity metrics including PSNR, SSIM, and NIQE to evaluate image and video restoration quality.
AI Foreground Isolation - Separates foreground objects from backgrounds by estimating precise alpha mattes for transparency.
Video Restoration Tools - Provides specialized directory organization for low-quality video frames to train restoration models.
Mask and Trimap Synthesis - Generates binary or soft masks and trimaps from alpha mattes for foreground matting tasks.
Image Blur Removal - Fixes blurring caused by out-of-focus elements to sharpen edges and restore visual structures.
Execution Hooks - Allows attaching custom operations to training loops for managing tasks like checkpointing and logging.
Preprocessing Optimizations - Generates downsampled images and LMDB databases from video sequences to accelerate training data access.
Denoising Model Trainers - Organizes image restoration and denoising data into structured directories to facilitate model training and evaluation.
Distributed Model Execution - Executes model testing and inference across single or multiple GPUs to reduce overall evaluation time.
Encoder-Decoder Architectures - Constructs neural network architectures using modular blocks like ResNet and gated convolutions.
GAN Implementations - Implements data preparation pipelines for unconditional GANs, including tensor conversion and image flipping.
GAN Training Loops - Manages independent optimizers and alternating schedules for generators and discriminators.
Inference Acceleration - Accelerates diffusion model sampling by merging redundant tokens in the vision transformer.
Image-to-Image Translation - Implements algorithms to transform images from one visual domain to another based on target styles or prompts.
Generative Distribution Assessments - Implements generative quality evaluation using Fréchet Inception Distance (FID) and Inception Score to compare real and synthetic datasets.
Generative Diversity Measurements - Analyzes the variety and smoothness of generative outputs using Perceptual Path Length and MS-SSIM.
Noise-to-Image Generation - Samples new images from random noise using both unconditional and conditional generative models.
Generative Model Accuracy Metrics - Quantifies model performance and accuracy using generative-specific metrics like FID and Precision & Recall.
Automatic Precision Casting - Optimizes memory and training speed by automatically casting the forward process to half-precision.
Video Sequence Preprocessing - Performs temporal mirroring and frame reversal to prepare video sequences for generative model training.
Conditional Image Generation - Produces synthetic images guided by discrete labels and specific input conditions.
Perspective Simulation - Produces high-resolution images that simulate 3D perspectives by interpolating camera positions and style codes.
Structural Conditioning - Adds conditional constraints to the generation process to dictate specific structural and visual layouts.
Unconditional Generation - Creates realistic images from random noise using various Generative Adversarial Network architectures.
Dataset Preparation - Organizes raw image data into directory structures to enable training for blind image super-resolution.
Inference Model Deployment - Transforms trained generative models into optimized formats like ONNX and TensorRT for hardware accelerators.
Learning Rate Schedulers - Dynamically adjusts learning rates and decay factors based on performance metrics to improve convergence.
Loss Function Implementations - Defines new loss modules by wrapping functional implementations in classes and registering them via configuration.
Hybrid Loss and Architecture Integration - Implements the functional merging of primary and auxiliary loss modules to improve training stability.
Low-Rank Adaptation - Integrates LoRA layers into modules to enable efficient parameter fine-tuning of generative models.
Model Complexity Analysis - Evaluates the computational cost of neural networks by calculating parameter counts and floating-point operations (FLOPs).
Model Output Visualizers - Generates and saves sample images or GIFs during training to visualize model progress.
Model Training Optimizers - Manages parameter updates, gradient zeroing, and backward passes through a specialized optimization wrapper.
Training Resumption - Restores model states from specific checkpoints to continue training from previous iterations.
Weight Smoothing - Provides exponential moving average weight smoothing to improve convergence and stability during model training.
Parameter Weight Smoothing - Implements exponential moving average weight maintenance to ensure training stability and better convergence.
Multi-Dataset Performance Benchmarking - Calculates multiple quality metrics across several datasets simultaneously to evaluate model performance.
NPU Accelerators - Supports training execution on Huawei Ascend NPU hardware using single or multi-device configurations.
Submodule - Constructs separate optimizers for different model submodules to enable independent learning rates.
Parameter Optimization Strategies - Manages hyper-parameter adjustments via defined optimizers, gradient clipping, and mixed precision strategies.
Ground Truth Comparisons - Generates side-by-side visual comparisons of input images, ground truth, and model predictions.
Gradient Accumulation Strategies - Simulates larger batch sizes by aggregating gradients over multiple iterations before parameter updates.
Image Augmentations - Implements random geometric and color transformations to increase dataset variety and model robustness.
Restoration Dataset Preparation - Organizes raw image pairs into structured directory formats for training image and video restoration models.
Training Loop Schedulers - Defines optimization processes, learning rate schedules, and iteration limits for model training loops.
Unpaired Image Translation - Implements data loading logic to sample random image pairs from separate domains for unsupervised translation.
Dataset Preparations - Organizes image collections into folder structures required for unpaired image-to-image translation training.
Vision Data Loaders - Provides configurable vision data loaders to manage dataset sampling, batch sizes, and worker counts.
3D and Spatial Synthesis - Creates 3D-aware generative visuals to produce spatial representations from 2D inputs.
Training Data Pipelines - Provides pipelines that load, normalize, and format multimodal data for training on GPU hardware.
Data Transformation Registrations - Supports registering user-defined data transformation functions into a pipeline registry for modular processing sequences.
Paired Image Dataset Preparation - Organizes image pairs into specific directory structures to facilitate supervised image-to-image translation.
Paired Image Loaders - Loads concatenated image pairs from directories for use in image-to-image translation models.
LMDB Dataset Converters - Converts raw image datasets into LMDB memory-mapped databases to accelerate I/O performance during high-throughput training.
Image Dataset Format Converters - Provides preprocessing scripts to crop, resize, and reformat raw image data for training and testing.
Multi-Backend Inference Executions - Executes generated models across various backends or via SDKs to produce visual synthesis results.
AI Image Masking - Creates bounding boxes and irregular masks to isolate specific image regions for processing.
Temporal Frame Interpolation - Generates intermediate frames between existing ones to increase video playback smoothness and frame rate.
Temporal Consistency Optimization - Produces consistent video by applying multi-frame rendering with diffusion models to reduce flickering.
Image Enhancement Tools - Improves image clarity and detail by removing artifacts and applying adaptive lookup tables.
Image Noise Reduction - Cleans digital images by eliminating Gaussian noise to improve overall visual clarity.
Degradation Simulation Pipelines - Simulates real-world image quality loss by applying random blur and compression to create restoration training pairs.
Pixel Value Normalization - Standardizes image inputs by adjusting pixel values using mean and standard deviation constants.
JPEG Artifact Reduction - Eliminates visual distortions caused by JPEG compression in grayscale and color images.
Alpha Matting - Provides alpha matting capabilities to separate foreground objects from backgrounds with smooth transparency transitions.
Matting Accuracy Metrics - Measures alpha matte prediction accuracy using specialized metrics such as SAD and MattingMSE.
High-Fidelity Synthesis - Generates natural, high-resolution images from conditional inputs using generative adversarial networks.
Rain Streak Removal - Eliminates rain streaks from images to restore original clarity and visibility.
Real-World Noise Suppression - Suppresses complex noise found in authentic digital photographs to improve image quality.
Video Frame Loading - Loads sequences of video frames and annotations for super-resolution and frame interpolation tasks.
Video Upscaling Pipelines - Increases video resolution by reconstructing high-frequency details using temporal alignment.
Registry-Based Extensibility - Implements a registry-based system allowing custom data transforms and model modules to be loaded via configuration strings.
Model Architecture Configurations - Specifies network architectures and loss functions using configuration files to instantiate models without code changes.
ML Experiment Logging - Records training scalars, images, and configuration data to Tensorboard and WandB for real-time monitoring.

XPixelGroup/BasicSR

8,297View on GitHub

BasicSR is a PyTorch-based image restoration toolbox and framework designed for training and deploying deep learning models to upscale, denoise, and deblur images and videos. It serves as a comprehensive system for image super-resolution and video quality restoration, providing the necessary infrastructure to recover fine visual details and increase pixel density. The project distinguishes itself through specialized toolkits for facial image enhancement and high-fidelity face synthesis, as well as a dedicated video quality restoration suite that utilizes deformable convolutions and generative

hiyouga/EasyR1

5,034View on GitHub

EasyR1 is a distributed model training system and reinforcement learning framework for large language and vision-language models. It functions as a multimodal trainer and an implementation of a Proximal Policy Optimization pipeline designed to refine the reasoning and perception capabilities of models that process both text and images. The system specializes in distributing reinforcement learning workloads across multiple compute nodes to manage high memory requirements. It optimizes hardware utilization through padding-free training and fine-tuning to fit large models onto available graphics

zhaochenyang20/Awesome-ML-SYS-Tutorial

5,371View on GitHub

This project provides a comprehensive technical guide and framework for engineering large-scale machine learning systems. It covers the full lifecycle of model development, focusing on the infrastructure and computational principles required to build, train, and serve generative AI models across distributed GPU clusters. The repository distinguishes itself by offering deep-dive tutorials and implementation strategies for complex system challenges. It emphasizes high-performance architectural primitives, such as collective communication orchestration, distributed tensor sharding, and static gr

Sygil-Dev/sygil-webui

7,879View on GitHub

Sygil-webui is a web interface for Stable Diffusion latent diffusion models, providing a creative suite for text-to-image and text-to-video synthesis. It functions as an image generation tool and a latent diffusion image editor, allowing users to create visuals and video sequences from textual descriptions. The project includes a dedicated model training interface for creating custom textual inversion embeddings, which introduces specific new concepts or styles into the diffusion models. It also features specialized tools for generative image editing, including mask-based inpainting, image-to

open-mmlabmmagic

Features