Video generation with controllable time and space has become a reality, and Alibaba's new large-scale model VideoComposer has become popular-AI-php.cn

In the field of AI painting, Composer proposed by Alibaba and ControlNet based on Stable diffusion proposed by Stanford have led the theoretical development of controllable image generation. However, the industry's exploration of controllable video generation is still relatively blank.

Compared with image generation, controllable video is more complex, because in addition to the controllability of the space of the video content, it also needs to meet the controllability of the time dimension. Based on this, the research teams of Alibaba and Ant Group took the lead in making an attempt and proposed VideoComposer, which simultaneously achieves video controllability in both time and space dimensions through a combined generation paradigm.

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

Paper address: https://arxiv.org/abs/2306.02018
Project homepage: https://videocomposer.github.io

Some time ago, Alibaba The Wensheng video model was low-key and open sourced in the MoDa community and Hugging Face. It unexpectedly attracted widespread attention from developers at home and abroad. The video generated by the model even received a response from Musk himself. The model received orders for many days in a row on the MoDa community. Tens of thousands of international visits per day.

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

#Text-to-Video on Twitter

VideoComposer As the latest achievement of the research team, it has once again received widespread attention from the international community focus on.

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

##VideoComposer on Twitter In fact, controllability has become a higher benchmark for visual content creation, which has made significant progress in customized image generation, but there are still three problems in the field of video generation. Big Challenge:

Complex data structure, the generated video needs to satisfy both the diversity of dynamic changes in the time dimension and the content consistency in the spatio-temporal dimension;
Complex Guidance conditions, existing controllable video generation requires complex conditions that cannot be constructed manually. For example, Gen-1/2 proposed by Runway needs to rely on depth sequences as conditions, which can better achieve structural migration between videos, but cannot solve the controllability problem well;
Lack of motion controllability. Motion pattern is a complex and abstract attribute of video. Motion controllability is a necessary condition to solve the controllability of video generation.

Prior to this, Composer proposed by Alibaba has proven that compositionality is extremely helpful in improving the controllability of image generation, and this study on VideoComposer is also Based on the combined generation paradigm, it improves the flexibility of video generation while solving the above three major challenges. Specifically, the video is decomposed into three guiding conditions, namely text conditions, spatial conditions, and video-specific timing conditions, and then the Video LDM (Video Latent Diffusion Model) is trained based on this. In particular, it uses efficient Motion Vector as an important explicit timing condition to learn the motion pattern of videos, and designs a simple and effective spatiotemporal condition encoder STC-encoder to ensure the spatiotemporal continuity of condition-driven videos. In the inference stage, different conditions can be randomly combined to control the video content.

Experimental results show that VideoComposer can flexibly control the time and space patterns of videos, such as generating specific videos through single pictures, hand-drawn drawings, etc., and can even easily use simple hand-drawn directions. Control the target's movement style. This study directly tested the performance of VideoComposer on 9 different classic tasks, and all achieved satisfactory results, proving the versatility of VideoComposer.

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

Figure (a-c) VideoComposer is able to generate videos that meet text, spatial and temporal conditions or a subset thereof; (d ) VideoComposer can use only two strokes to generate a video that satisfies Van Gogh's style, while satisfying the expected movement mode (red strokes) and shape mode (white strokes)

Method introduction

Video LDM

## Hidden space. Video LDM first introduces a pre-trained encoder to map the input video Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular to a latent space expression, where

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

. Then, the pre-trained decoder D is used to map the latent space to the pixel space

. In VideoComposer, the parameters are set to Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular .

Diffusion model. To learn the actual video content distribution

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

, the diffusion model learns to gradually denoise from the normal distribution noise to restore the real visual content. This process is actually simulating a reversible Markov chain with a length of T=1000. In order to perform a reversible process in the latent space, Video LDM injects noise into

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

to obtain the noise-injected latent variable

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

. Then it uses the denoising function

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

to act on

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

and the input condition c, then the optimization goal is as follows:

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

In order to fully explore the use of spatial local inductive bias and sequence temporal inductive bias for denoising, VideoComposer will

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

Instantiated as a 3D UNet, using both temporal convolution operator and cross attention mechanism.

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

##VideoComposer

combination condition. VideoComposer decomposes video into three different types of conditions, namely textual conditions, spatial conditions, and critical timing conditions, which together determine spatial and temporal patterns in the video. VideoComposer is a general composable video generation framework, so more customized conditions can be incorporated into VideoComposer based on the downstream application, not limited to those listed below:

Text conditions: Text description provides visual instructions for the video with rough visual content and motion aspects. This is also a commonly used condition for T2V;

Spatial conditions:
Single Image, select the first frame of a given video as the spatial condition to generate image to video , to express the content and structure of the video;
Single Sketch, use PiDiNet to extract the sketch of the first video frame as the second spatial condition;
Style (Style), in order to further transfer the style of a single image to the synthesized video, select image embedding as a style guide;

Timing conditions:
Motion Vector (Motion Vector), motion vector as a unique element of video is expressed as a two-dimensional vector, that is, horizontal and vertical directions. It explicitly encodes pixel-by-pixel movement between two adjacent frames. Due to the natural properties of motion vectors, this condition is treated as a temporally smooth synthesized motion control signal, which extracts motion vectors in standard MPEG-4 format from compressed video;
depth sequence ( Depth Sequence), in order to introduce video-level depth information, use the pre-trained model in PiDiNet to extract the depth map of the video frame;
Mask Sequence (Mask Sequence), introduce the tubular mask to mask local spatiotemporal content and force the model to predict the masked area based on observable information;
Sketch Sequnce, compared with a single sketch, a sketch sequence can provide More control over details for precise, custom compositions.

# Spatiotemporal conditional encoder. Sequence conditions contain rich and complex spatiotemporal dependencies, which pose a great challenge to controllable instructions. To enhance the temporal perception of input conditions, this study designed a spatiotemporal condition encoder (STC-encoder) to incorporate spatiotemporal relationships. Specifically, a lightweight spatial structure is first applied, including two 2D convolutions and an avgPooling, to extract local spatial information, and then the resulting condition sequence is input to a temporal Transformer layer for temporal modeling. In this way, STC-encoder can facilitate the explicit embedding of temporal cues and provide a unified entry for conditional embedding for diverse inputs, thereby enhancing inter-frame consistency. In addition, the study repeated the spatial conditions of a single image and a single sketch in the temporal dimension to ensure their consistency with the temporal conditions, thus facilitating the condition embedding process.

After the conditions are processed through the STC-encoder, the final condition sequence has the same spatial shape as the STC-encoder and is then fused by element-wise addition. Finally, the merged conditional sequence is concatenated along the channel dimension as a control signal. For text and style conditions, a cross-attention mechanism is utilized to inject text and style guidance.

Training and inference

Two-stage training strategy. Although VideoComposer can be initialized through pre-training of image LDM, which can alleviate the training difficulty to a certain extent, it is difficult for the model to have the ability to perceive temporal dynamics and the ability to generate multiple conditions at the same time. , this will increase the difficulty of training combined video generation. Therefore, this study adopted a two-stage optimization strategy. In the first stage, the model was initially equipped with timing modeling capabilities through T2V training; in the second stage, VideoComposer was optimized through combined training to achieve better performance.

reasoning. During the inference process, DDIM is used to improve inference efficiency. And adopt classifier-free guidance to ensure that the generated results meet the specified conditions. The generation process can be formalized as follows:

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

where ω is the guidance ratio; c1 and c2 are two sets of conditions. This guidance mechanism is judged by the set of two conditions, and can give the model more flexible control through intensity control.

Experimental results

In the experimental exploration, the study demonstrated that VideoComposer serves as a unified model with a universal generation framework and verified the capabilities of VideoComposer on 9 classic tasks.

Part of the results of this research are as follows, in static picture to video generation (Figure 4), video Inpainting (Figure 5), static sketch generation to video (Figure 6), hand-painted motion control Video (Figure 8) and motion transfer (Figure A12) can both reflect the advantages of controllable video generation.

Video generation with controllable time and space has become a reality, and Alibabas new large-scale model VideoComposer has become popular

#Team Introduction

Public information shows that Alibaba’s research on visual basic models mainly focuses on the research of large visual representation models, visual generative large models and their downstream applications. It has published more than 60 CCF-A papers in related fields and won more than 10 international championships in multiple industry competitions, such as the controllable image generation method Composer, the image and text pre-training methods RA-CLIP and RLEG, and the uncropped long Video self-supervised learning HiCo/HiCo, speaking face generation method LipFormer, etc. all come from this team.

The above is the detailed content of Video generation with controllable time and space has become a reality, and Alibaba's new large-scale model VideoComposer has become popular. For more information, please follow other related articles on the PHP Chinese website!