StoryDiffusion

StoryDiffusion is a generative AI system designed for consistent character image and video generation. It utilizes a pluggable cross-attention module to inject shared character representations into pretrained diffusion models, allowing for visual identity stability across multiple images and scenes without retraining the base model.

The project features a video generation pipeline that produces temporally coherent sequences from text prompts or condition images. It employs a latent space motion interpolator to predict intermediate frames and semantic motion, enabling long-range video generation and larger motion transitions by operating within a compressed variational autoencoder space.

The system includes capabilities for AI comic creation and a text-to-video pipeline. To support hardware accessibility, it implements precision-reduced model serving and low-memory inference to run the full generation pipeline on consumer GPUs.

An interactive demo interface is provided via a local web dashboard for content creation.

Features

Text-to-Video Generators - Combines consistent character generation with motion prediction to synthesize high-quality, temporally coherent videos from text prompts.

Attention Layer Injectors - Provides a mechanism for injecting external control signals into the attention layers of a diffusion model to enforce identity.

Latent Motion Prediction - Forecasts intermediate frames between condition images by operating in a compressed latent space.

Cross-Frame Attention Layers - Utilizes specialized attention layers to maintain visual character consistency across sequential frames.

Video Diffusion Models - Uses a latent diffusion model to produce temporally coherent video sequences from text prompts.

Noise-to-Image Generation - Generates high-quality visuals by reversing the noise process via iterative denoising.

Image-Conditioned Video Generators - Creates videos by analyzing provided keyframe images and predicting the motion between them.

Image-to-Video Generation - Synthesizes motion sequences using keyframe images and text prompts as guidance.

Long-form Generation - Produces extended video sequences by predicting motion between a series of condition images in a compressed semantic space.

Visual Identity Consistency - Maintains consistent characters and visual identities across multiple generated images.

Video and Motion Synthesis - Analyzes motion between condition images in a compressed semantic space to enable large video transitions.

Latent Frame Interpolators - Generates intermediate video frames by interpolating semantic data within a variational autoencoder.

Generative Character Consistency - Implements methods to maintain visual continuity of character identities across multiple AI-generated scenes.

Visual Character Consistency - Ensures characters remain visually stable across different prompts and scenes using specialized attention.

Semantic Motion Interpolations - Predicts intermediate video frames by interpolating semantic representations within a compressed variational autoencoder space.

Memory-Constrained Inference - Implements techniques to run large generative models within the memory constraints of consumer GPUs.

Mixed-Precision Quantization - Reduces GPU memory footprint by converting model weights to lower numerical precision.

Consumer GPU Optimizations - Enables full generation pipelines to run on consumer GPUs by reducing batch size and model precision.

AI Comic Generation - Generates series of visually consistent images to tell stories through an interactive interface.

HVision-NKUStoryDiffusion

Features

Star history