Fudan University and Huawei Noah propose the VidRD framework to achieve iterative high-quality video generation

WBOY
Release: 2023-10-20 15:01:08
forward
538 people have browsed it

Researchers from Fudan University and Huawei's Noah's Ark Laboratory proposed an iterative solution for generating high-quality videos based on the image diffusion model (LDM) - VidRD (Reuse and Diffuse). This solution aims to make breakthroughs in the quality and sequence length of generated videos, and achieve high-quality, controllable video generation of long sequences. It effectively reduces the jitter problem between generated video frames, has high research and practical value, and contributes to the current hot AIGC community.

Latent diffusion model (LDM) is a generative model based on denoising autoencoder, which can generate high-quality data from randomly initialized data by gradually removing noise. sample. However, due to computational and memory limitations during both model training and inference, a single LDM can usually only generate a very limited number of video frames. Although existing work attempts to use a separate prediction model to generate more video frames, this also incurs additional training costs and produces frame-level jitter.

In this paper, inspired by the remarkable success of latent diffusion models (LDMs) in image synthesis, a framework called "Reuse and Diffuse", referred to as VidRD, is proposed. This framework can generate more video frames after the small number of video frames already generated by LDM, thereby iteratively generating longer, higher quality, and diverse video content. VidRD loads a pre-trained image LDM model for efficient training and uses a U-Net network with added temporal information for noise removal.

Fudan University and Huawei Noah propose the VidRD framework to achieve iterative high-quality video generation


  • ##Paper title: Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
  • Paper address: https://arxiv.org/abs/2309.03549
  • Project homepage: https://anonymous0x233.github.io/ ReuseAndDiffuse/

The main contributions of this article are as follows:

  1. In order to generate smoother videos, this article is based on timing-aware The LDM model proposes an iterative "text-to-video" generation method. This method can iteratively generate more video frames by reusing latent space features from already generated video frames and following the previous diffusion process each time.
  2. This article designs a set of data processing methods to generate high-quality "text-video" data sets. For the existing action recognition data set, this paper uses a multi-modal large language model to give text descriptions to the videos. For image data, this paper uses random scaling and translation methods to generate more video training samples.
  3. On the UCF-101 data set, this article verified the two evaluation indicators of FVD and IS as well as the visualization results. The quantitative and qualitative results show that: compared with existing methods, the VidRD model All achieved better results.

Method introduction

Fudan University and Huawei Noah propose the VidRD framework to achieve iterative high-quality video generation

Figure 1. Schematic diagram of the VidRD video generation framework proposed in this article

This article believes that using pre-trained image LDM as the starting point for LDM training for high-quality video synthesis is an efficient and wise choice. At the same time, this view is further supported by research work such as [1, 2]. In this context, the carefully designed model in this article is built based on the pre-trained stable diffusion model, fully learning from and inheriting its excellent characteristics. These include a variational autoencoder (VAE) for accurate latent representation and a powerful denoising network U-Net. Figure 1 shows the overall architecture of the model in a clear and intuitive way.

In the model design of this article, a notable feature is the full utilization of the weights of the pre-trained model. Specifically, most network layers, including the components of VAE and the upsampling and downsampling layers of U-Net, are initialized using pre-trained weights of the stable diffusion model. This strategy not only significantly speeds up the model training process, but also ensures that the model exhibits good stability and reliability from the beginning. Our model can iteratively generate additional frames from an initial video clip containing a small number of frames by reusing the original latent features and mimicking the previous diffusion process. In addition, for the autoencoder used to convert between pixel space and latent space, we inject timing-related network layers into its decoder and fine-tune these layers to improve temporal consistency.

In order to ensure the continuity between video frames, this article adds 3D Temp-conv and Temp-attn layers to the model. The Temp-conv layer follows the 3D ResNet, which implements 3D convolution operations to capture spatial and temporal correlations to understand the dynamics and continuity of video sequence aggregation. The Temp-Attn structure is similar to Self-attention and is used to analyze and understand the relationship between frames in the video sequence, allowing the model to accurately synchronize the running information between frames. These parameters are randomly initialized during training and are designed to provide the model with understanding and encoding of temporal structure. In addition, in order to adapt to the model structure, the data input has also been adapted and adjusted accordingly.

Fudan University and Huawei Noah propose the VidRD framework to achieve iterative high-quality video generation

Figure 2. The high-quality “text-video” training data set construction method proposed in this article

In order to train the VidRD model, this article proposes a method of constructing a large-scale "text-video" training data set, as shown in Figure 2. This method can handle "text-image" data and "text-video" without description data. In addition, in order to achieve high-quality video generation, this article also attempts to remove watermarks on the training data.

Although high-quality video description datasets are relatively scarce in the current market, a large number of video classification datasets exist. These datasets have rich video content, and each video is accompanied by a classification label. For example, Moments-In-Time, Kinetics-700 and VideoLT are three representative large-scale video classification data sets. Kinetics-700 covers 700 human action categories and contains over 600,000 video clips. Moments-In-Time includes 339 action categories, with a total of more than one million video clips. VideoLT, on the other hand, contains 1,004 categories and 250,000 long, unedited videos.

In order to make full use of existing video data, this article attempts to automatically annotate these videos in more detail. This article uses multi-modal large language models such as BLIP-2 and MiniGPT4. By targeting key frames in the video and combining their original classification labels, this article designs many Prompts to generate annotations through model question and answer. This method not only enhances the speech information of video data, but also brings more comprehensive and detailed video descriptions to existing videos that do not have detailed descriptions, thereby enabling richer video tag generation to help the VidRD model bring better training effect.

In addition, for the existing very rich image data, this article also designed a detailed method to convert the image data into video format for training. The specific operation is to pan and zoom at different positions of the image at different speeds, thereby giving each image a unique dynamic presentation form and simulating the effect of moving a camera to capture still objects in real life. Through this method, existing image data can be effectively utilized for video training.

Effect display

The description texts are: "Timelapse at the snow land with aurora in the sky.", "A candle is burning .", "An epic tornado attacking above a glowing city at night.", and "Aerial view of a white sandy beach on the shores of a beautiful sea." More visualizations can be found on the project homepage.

Fudan University and Huawei Noah propose the VidRD framework to achieve iterative high-quality video generation

Figure 3. Visual comparison of the generation effect with existing methods

Finally, as shown in Figure 3 shows the visual comparison of the generated results of this article with the existing methods Make-A-Video [3] and Imagen Video [4] respectively, showing the better quality generation effect of the model in this article.

The above is the detailed content of Fudan University and Huawei Noah propose the VidRD framework to achieve iterative high-quality video generation. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:51cto.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!