SkyReels-V2 is a video generation system that creates, extends, and refines video clips from text descriptions, images, or both. It operates as a diffusion-based video generation model that can produce videos of any duration by denoising frames sequentially, with each new frame conditioned on the ones that came before it. The system supports generating videos from scratch using text prompts, starting from a single image and producing subsequent frames, or constraining both the first and last frames to match user-provided images.
What distinguishes SkyReels-V2 is its combination of infinite-length video generation, frame-level control, and motion quality refinement through reinforcement learning. The system can extend videos indefinitely by denoising tokens at independent noise levels per frame, enabling seamless continuation of footage beyond typical length limits. It also applies direct preference optimization on preference pairs to train the model toward physically plausible, large-motion sequences, improving temporal coherence and motion quality. A prompt expansion language model automatically expands brief text descriptions into more detailed prompts, while a vision-language captioning model generates detailed textual descriptions of video content including shot types and camera movements.
The system includes multi-GPU pipeline parallelism that distributes frame batches across multiple GPUs to reduce end-to-end inference time for large-scale outputs. It also supports video extension, appending new frames to an existing clip by conditioning on its last frames for seamless continuation.