30 open-source projects similar to vqassessment/dover, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best DOVER alternative.
PySceneDetect is a suite of tools for identifying cuts and transitions in video files using content, threshold, and histogram detection algorithms. It functions as a scene detector, frame extractor, statistics analyzer, metadata exporter, and video scene splitter. The project identifies scene boundaries and can divide video files into smaller clips using external processing tools. It allows for the extraction of representative image frames from detected changes and the export of scene lists into industry-standard formats such as EDL, FCP, HTML, OTIO, and CSV. The toolset includes capabilitie
A unified inference and post-training framework for accelerated video generation.
Open-Sora is a video generation framework designed to produce cinematic sequences from text prompts and images. It functions as a generative system that transforms written descriptions or reference images into video content featuring realistic textures and lighting. The project includes a dedicated prompt engineering tool that uses large language models to expand simple user inputs into detailed descriptions. It also features a motion controller for adjusting movement intensity in generated sequences and evaluating motion levels in existing video files. The framework incorporates text-to-vid
Diffusers is a PyTorch-based library and generative AI framework used to build, train, and deploy diffusion pipelines for producing multi-modal media. It provides a suite of tools for generating images, video, and audio from natural language descriptions, as well as specialized systems for text-to-image generation. The project differentiates itself through a modular architecture that separates noise schedulers, pretrained model blocks, and pipeline compositions. This structure allows for the construction of custom generation workflows and the ability to swap individual components of the diffu
FramePack is a neural video synthesis engine and generation framework designed to produce long, temporally consistent video sequences. It functions as a diffusion model optimizer, providing a suite of techniques to manage the computational demands of high-parameter video models while maintaining visual stability during extended generation tasks. The system distinguishes itself through a hierarchical approach to frame prediction, which plans distant anchor frames before filling in intermediate content to prevent cumulative temporal drift. By utilizing constant-length context compression and to
DiffSynth-Studio is a comprehensive platform for the lifecycle management of generative diffusion models, providing a unified environment for inference, fine-tuning, and training. It utilizes a modular pipeline architecture and a standardized abstraction layer to support consistent workflows across diverse model configurations for image and video generation. The platform distinguishes itself through a memory-optimized inference engine that dynamically manages resources to facilitate high-resolution generation on constrained hardware. It also integrates specialized training capabilities, inclu
🔥 2026.06.01 We released LongLive-RAG, a general retrieval-augmented framework for long video gen. - 🔥 2026.05.30 LongLive2.0 now supports I2V AR teacher-forcing training and I2V DMD distillation for Wan2.2-TI2V-5B. - ⚡ 2026.05.25 We optimized the NVFP4 inference path with fused Triton…
Open-Sora-Plan is a text-to-video framework and distributed video training system. It utilizes a diffusion transformer architecture and large language model components to transform written descriptions or image prompts into high-quality video sequences. The system features a distributed infrastructure designed for large-scale video training and inference. It employs sequence parallelism to split high-resolution or long-duration video samples across multiple GPUs and uses a sparse attention mechanism to increase processing speed. The project includes capabilities for both text-to-video and im
MAGI-1: Autoregressive Video Generation at Scale
SkyReels-V2 is a video generation system that creates, extends, and refines video clips from text descriptions, images, or both. It operates as a diffusion-based video generation model that can produce videos of any duration by denoising frames sequentially, with each new frame conditioned on the ones that came before it. The system supports generating videos from scratch using text prompts, starting from a single image and producing subsequent frames, or constraining both the first and last frames to match user-provided images. What distinguishes SkyReels-V2 is its combination of infinite-le
A SOTA open-source image editing model, which aims to provide comparable performance against the closed-source models like GPT-4o and Gemini 2 Flash.
A pipeline parallel training script for diffusion models.
HunyuanVideo-1.5 is a video generation foundation model and text-to-video diffusion framework. It utilizes a latent video diffusion model and a spatio-temporal transformer architecture to generate high-definition video sequences from text descriptions and images. The project enables cinematic camera control for directing pans and tilts and provides image-to-video animation capabilities. It supports visual style adaptation through low-rank adaptation tuning and uses a language model for prompt refinement to improve visual alignment. The model covers high-resolution video upscaling via a super
TurboDiffusion is a video diffusion inference engine and generator designed to create high-resolution videos from text prompts and images. It provides a runtime environment for executing optimized diffusion model checkpoints with a focus on reducing latency and GPU memory usage. The project features a specialized training framework for aligning sparse-linear attention models with pretrained full-attention models. This system includes capabilities for sparse attention parameter merging and sparse-linear model alignment to reduce computational costs during inference while maintaining output qua
CogVideo is a generative video framework that uses diffusion models and transformer-based architectures to synthesize high-resolution video clips. It functions as both a text-to-video and image-to-video generator, converting textual descriptions or static images into temporal visual sequences. The system integrates large language model capabilities to expand short user prompts into detailed descriptions for better visual alignment. It supports the animation of static images through latent seeding and provides the ability to extend the length of existing video sequences. The project includes
Wan2.1 is a generative video synthesis framework that provides foundation models for creating high-fidelity video sequences and static images from descriptive text prompts. The system utilizes a unified architecture trained on both static and dynamic datasets, allowing it to function as a comprehensive tool for visual media creation. The framework distinguishes itself through a transformer-based temporal modeling approach that ensures structural coherence and consistent motion across video frames. It supports multi-resolution latent scaling, enabling the generation of content in various aspec