These open-source tools automate video production by transforming text prompts and static images into dynamic animations.
CogVideo is a generative video framework that uses diffusion models and transformer-based architectures to synthesize high-resolution video clips. It functions as both a text-to-video and image-to-video generator, converting textual descriptions or static images into temporal visual sequences. The system integrates large language model capabilities to expand short user prompts into detailed descriptions for better visual alignment. It supports the animation of static images through latent seeding and provides the ability to extend the length of existing video sequences. The project includes
CogVideo is a generative video framework that uses diffusion models and transformers for both text-to-video and image-to-video generation, with support for prompt expansion, temporal sequence extension, and model fine-tuning—making it a comprehensive match for this search.
CogVideo is a video generation framework and large language model architecture designed for synthesizing high-resolution video clips from natural language descriptions and images. It functions as a text-to-video and image-to-video generator, while also providing a model for video captioning to analyze visual content into descriptive text summaries. The system supports animating static images into motion sequences and transforming series of images into video based on prompts. It includes capabilities for extending the length of generated video clips to create longer sequences of motion. The f
CogVideo is a video generation framework built on diffusion models that takes both text and image inputs to produce video clips, with support for fine-tuning and temporal extensions—exactly the kind of tool this search is after.
DiffSynth-Studio is a comprehensive platform for the lifecycle management of generative diffusion models, providing a unified environment for inference, fine-tuning, and training. It utilizes a modular pipeline architecture and a standardized abstraction layer to support consistent workflows across diverse model configurations for image and video generation. The platform distinguishes itself through a memory-optimized inference engine that dynamically manages resources to facilitate high-resolution generation on constrained hardware. It also integrates specialized training capabilities, inclu
DiffSynth-Studio is a comprehensive platform for managing generative diffusion models that supports both image and video generation, inference, and fine-tuning, directly matching the need for an open-source tool to create video content from text descriptions and images.
StoryDiffusion is a generative AI system designed for consistent character image and video generation. It utilizes a pluggable cross-attention module to inject shared character representations into pretrained diffusion models, allowing for visual identity stability across multiple images and scenes without retraining the base model. The project features a video generation pipeline that produces temporally coherent sequences from text prompts or condition images. It employs a latent space motion interpolator to predict intermediate frames and semantic motion, enabling long-range video generati
StoryDiffusion is a generative AI system that produces temporally coherent video sequences from both text prompts and condition images using diffusion models and latent motion interpolation, directly matching your need for an open-source text-to-video and image-to-video generation tool.
Videocrafter is a latent diffusion model designed for AI video synthesis. It functions as both a text-to-video and image-to-video generation system, synthesizing high-quality video sequences from descriptive text prompts or static image inputs. The model utilizes a diffusion-based neural network to transform inputs into animated content, ensuring visual consistency and temporal coherence throughout the generated sequences. This allows for the creation of custom video clips and the animation of static images into fluid motion.
Videocrafter is a latent diffusion model that directly generates video from both text prompts and static images, using diffusion-based synthesis to produce coherent sequences, which makes it exactly the kind of AI video generation tool you are looking for, though it may not cover every advanced feature like fine-tuning or explicit motion control.
Diffusers is a PyTorch-based library and generative AI framework used to build, train, and deploy diffusion pipelines for producing multi-modal media. It provides a suite of tools for generating images, video, and audio from natural language descriptions, as well as specialized systems for text-to-image generation. The project differentiates itself through a modular architecture that separates noise schedulers, pretrained model blocks, and pipeline compositions. This structure allows for the construction of custom generation workflows and the ability to swap individual components of the diffu
Diffusers is a comprehensive PyTorch library for building and deploying diffusion models, with dedicated pipelines for text-to-video and image-to-video generation, support for fine-tuning and custom training, and a modular cross-platform design that matches your need for an AI video generation framework.
SkyReels-V2 is a video generation system that creates, extends, and refines video clips from text descriptions, images, or both. It operates as a diffusion-based video generation model that can produce videos of any duration by denoising frames sequentially, with each new frame conditioned on the ones that came before it. The system supports generating videos from scratch using text prompts, starting from a single image and producing subsequent frames, or constraining both the first and last frames to match user-provided images. What distinguishes SkyReels-V2 is its combination of infinite-le
SkyReels-V2 is a diffusion-based video generation system that creates, extends, and refines videos from text descriptions, images, or both, with controllable motion and duration — exactly the core capability of AI-driven video generation you are looking for.
Open-Sora-Plan is a text-to-video framework and distributed video training system. It utilizes a diffusion transformer architecture and large language model components to transform written descriptions or image prompts into high-quality video sequences. The system features a distributed infrastructure designed for large-scale video training and inference. It employs sequence parallelism to split high-resolution or long-duration video samples across multiple GPUs and uses a sparse attention mechanism to increase processing speed. The project includes capabilities for both text-to-video and im
Open-Sora-Plan is a diffusion-transformer framework that generates high-quality videos from both text descriptions and image prompts, with built-in support for distributed training and high-resolution output, directly matching the need for AI video generation from text and images.
AnimateDiff is a latent diffusion video generator and text-to-video diffusion framework. It converts existing text-to-image diffusion models into animation generators by applying specialized motion modules, allowing for the creation of video sequences without modifying the original base model. The project provides an image-to-video animation framework that uses sparse RGB images, sketches, or structural keyframe constraints to guide generation. It further distinguishes itself with a motion adapter system that injects cinematic camera movements, such as zooming, panning, and tilting, into anim
AnimateDiff is a latent diffusion framework that generates video from both text and images, with controllable motion adapters and camera effects—directly addressing the core request, though explicit fine-tuning support is not prominent.
HunyuanVideo-1.5 is a video generation foundation model and text-to-video diffusion framework. It utilizes a latent video diffusion model and a spatio-temporal transformer architecture to generate high-definition video sequences from text descriptions and images. The project enables cinematic camera control for directing pans and tilts and provides image-to-video animation capabilities. It supports visual style adaptation through low-rank adaptation tuning and uses a language model for prompt refinement to improve visual alignment. The model covers high-resolution video upscaling via a super
HunyuanVideo-1.5 is a full video generation foundation model and diffusion framework that directly supports both text-to-video and image-to-video generation, offers controllable camera motion and fine-tuning via LoRA, and provides high-resolution upscaling—covering all the key capabilities you asked for in a video generation tool.
ComfyUI is a node-based generative AI orchestration engine designed for constructing, testing, and executing complex image and video synthesis pipelines. By utilizing a directed acyclic graph execution model, the platform allows users to build reproducible workflows through modular, interconnected processing blocks without requiring manual code implementation. It serves as both a local environment for high-performance model inference and a production-ready server for deploying generative capabilities. The platform distinguishes itself through its focus on workflow portability and extensibilit
ComfyUI is a node-based generative AI orchestration engine that directly supports text-to-video and image-to-video synthesis using diffusion models, offers controllable workflows for motion and duration, and provides a cross-platform CLI/API, making it a comprehensive framework for the video generation tasks you're after.
Open-Sora is a video generation framework designed to produce cinematic sequences from text prompts and images. It functions as a generative system that transforms written descriptions or reference images into video content featuring realistic textures and lighting. The project includes a dedicated prompt engineering tool that uses large language models to expand simple user inputs into detailed descriptions. It also features a motion controller for adjusting movement intensity in generated sequences and evaluating motion levels in existing video files. The framework incorporates text-to-vid
Open-Sora is a diffusion-based video generation framework that directly transforms text prompts and images into cinematic video sequences, with built-in motion control and prompt expansion—exactly the kind of tool this search is after, though explicit model fine-tuning and output format details are not confirmed in the provided description.
Sygil-webui is a web interface for Stable Diffusion latent diffusion models, providing a creative suite for text-to-image and text-to-video synthesis. It functions as an image generation tool and a latent diffusion image editor, allowing users to create visuals and video sequences from textual descriptions. The project includes a dedicated model training interface for creating custom textual inversion embeddings, which introduces specific new concepts or styles into the diffusion models. It also features specialized tools for generative image editing, including mask-based inpainting, image-to
Sygil-webui is a Stable Diffusion web interface that supports text-to-video generation and includes model training, fitting the search for an open-source AI video generation tool—though its image-to-video capabilities are less prominent and it lacks explicit mention of motion control or multiple export formats.
This is a PyTorch-based implementation of diffusion models for synthesizing photorealistic images and video. It provides a framework for text-to-image and text-to-video generation, as well as unconditional image synthesis. The system utilizes a cascading diffusion pipeline to produce high-resolution imagery by passing low-resolution outputs through a sequence of super-resolution models. It also includes capabilities for image inpainting, allowing the reconstruction of masked or missing regions of visual media guided by surrounding context and text prompts. The project includes tools for diff
lucidrains/imagen-pytorch is a PyTorch framework for text-to-video generation using cascading diffusion models, with training and fine-tuning support, fitting the core intent—though it does not explicitly handle image-to-video generation from input images.
imaginAIry is a system for generating and refining images and videos using diffusion models. It operates as a web-based server that triggers generation requests through standard API calls, allowing for the creation of visuals and video sequences from text prompts or existing files. The project provides a suite for AI image editing and upscaling, enabling the modification of visuals through natural language instructions and super-resolution tools to increase detail and image size. The system includes capabilities for structural image control using depth maps, edge maps, and body poses to main
imaginairy is a diffusion-based system that can generate video from text prompts and existing images via its API, making it a valid tool for AI video generation, though its focus is more on image editing and upscaling and it lacks explicit controls for motion/duration or model fine-tuning.
This is a framework for training and sampling diffusion models to generate high-fidelity images, video, and 4D assets. It provides a modular environment for managing generative AI training pipelines, including the handling of datasets, noise sampling, and loss weighting to stabilize the creation of synthetic content. The project features a modular model configuration system that uses YAML-based assembly to define network submodules and conditioners. It also includes a dedicated toolset for AI image watermarking, allowing for the embedding and detection of invisible markers to verify the origi
This repository is a diffusion model training and sampling framework that explicitly supports video generation alongside images and 4D assets, making it a genuine tool for text-to-video and image-to-video tasks, though its scope is broader than video alone.
Sana is a framework for high-resolution image and video synthesis based on a linear diffusion transformer. It provides a toolkit for the training, fine-tuning, and execution of text-to-image and text-to-video models, as well as a video generative world model capable of simulating physical environments with precise spatial control. The project is distinguished by its use of linear complexity layers to handle high resolutions and its support for long-form, minute-length video generation in real time. It implements a two-stage inference paradigm that separates structural generation from visual t
Sana is a framework for high-resolution text-to-video synthesis using a linear diffusion transformer, with built-in training, fine-tuning, and spatial control over motion, which fits the core need for AI video generation; however, explicit image-to-video generation is not highlighted in the evidence.
Pixelle-Video is a text-to-video automation platform and generation engine that converts text topics into complete videos with synchronized narration, images, and music. It functions as a modular system for producing short-form content, utilizing large language models to automate script composition, visual asset generation, and voiceover production. The platform features a node-based workflow orchestrator that allows the composition of custom generation pipelines by linking different AI models. It includes a dynamic video layout designer that uses HTML templates to define aspect ratios and vi
Pixelle-Video is a text-to-video automation platform that orchestrates AI models (including likely diffusion models) to generate videos from text, with node-based workflows and video layout design, fitting the search for an open-source video generation tool, though it focuses on pipeline orchestration rather than a single model and may not include model fine-tuning.
Genkit is an open-source framework for building AI-powered applications. It provides a unified interface for connecting to hundreds of generative AI models from multiple providers, enabling text, image, audio, and video generation through a single API. The framework structures multi-step AI interactions—including chat, retrieval-augmented generation, tool use, and agentic workflows—as composable, traceable flows with built-in streaming and state management. The framework distinguishes itself through a comprehensive developer toolkit that includes a command-line interface and a local developer
Genkit is a general-purpose AI application framework that can orchestrate video generation from text and images via multiple model providers, but its focus is on composable AI workflows rather than being a dedicated video generation tool, so it fits the category in a broader, less specialized way.
ComfyUI is a modular generative AI workflow orchestrator and node-based GUI for designing and executing complex diffusion model pipelines. It functions as both a visual interface for building generative logic graphs and a programmable backend API that exposes diffusion model operations for external integration. The system distinguishes itself through a graph-based execution model that supports differential workflow execution, re-running only modified nodes to reduce computation. It features dynamic model offloading to manage memory between system RAM and GPU VRAM and utilizes metadata-embedde
ComfyUI is a node-based workflow orchestrator that can execute diffusion model pipelines for text-to-video and image-to-video generation, supporting controllable parameters and offering a graphical interface and API, which fits your request for a flexible, cross-platform AI video generation framework; however, its built-in model fine-tuning capabilities are limited and primarily rely on external custom nodes.
Open-Higgsfield-AI is a generative AI content studio and visual workflow orchestrator. It provides a unified interface for creating photorealistic images and videos, utilizing a node-based editor to chain multiple image, video, and audio models into automated content pipelines. The system functions as an AI video animation tool and local GPU inference engine, allowing users to run generative models on local hardware or remote servers. It includes specialized capabilities for audio-driven lip synchronization and cinematic camera controls to adjust virtual lens and focal settings. The platform
Open-Higgsfield-AI is a generative AI content studio that orchestrates text-to-video and image-to-video generation using a node-based pipeline, fitting the search for an open-source tool for AI video creation; it covers core generation features but does not explicitly include model fine-tuning, keeping it a solid but not comprehensive answer.