The first multi-view autonomous driving scene video generation world model | DrivingDiffusion: New ideas for BEV data and simulation-AI-php.cn

Some personal thoughts of the author

In the field of autonomous driving, with the development of BEV-based sub-tasks/end-to-end solutions, high-quality multi-view training data and the corresponding simulation scene construction are increasingly important. In response to the pain points of current tasks, "high quality" can be decoupled into three aspects:

Long tail scenarios in different dimensions: such as vehicles at close range in obstacle data As well as precise heading angles during car cutting, as well as scenes with different curvatures in lane line data or ramps/merges/merges that are difficult to collect. These often rely on large amounts of data collection and complex data mining strategies, which are costly.
3D True Value - High Consistency of Images: Current BEV data acquisition is often affected by errors in sensor installation/calibration, high-precision maps and the reconstruction algorithm itself. This makes it difficult for us to ensure that each set of [3D true values-image-sensor parameters] in the data is accurate and consistent.
Time series data based on satisfying the above conditions: Multi-view images of consecutive frames and corresponding true values, which are necessary for current perception/prediction/decision-making/end-to-end and other tasks Indispensable.

For simulation, video generation that meets the above conditions can be generated directly through layout, which is undoubtedly the most direct way to construct multi-agent sensor input. DrivingDiffusion solves the above problems from a new perspective.

What is DrivingDiffusion?

DrivingDiffusion is a diffusion model framework for automatic driving scene generation, which implements layout controlled multi-view image/video generation and SOTA is implemented respectively.
DrivingDiffusion-Future, as a self-driving world model, has the ability to predict future scene videos based on single frame images and influence the motion planning of the main vehicle/other vehicles based on language prompts.

What is the effect of DrivingDiffusion generation?

Students in need can first take a look at the project homepage: https://drivingdiffusion.github.io

(1) DrivingDiffusion

Layout-controlled multi-view image generation

首个多视角自动驾驶场景视频生成世界模型 | DrivingDiffusion: BEV数据和仿真新思路

The figure shows the multi-view image generation effect with layout projection as input.

Adjust the layout: Precisely control the generated results

首个多视角自动驾驶场景视频生成世界模型 | DrivingDiffusion: BEV数据和仿真新思路

The upper part of the figure shows the diversity of generated results and the modules below The importance of design. The lower part shows the results of perturbing the vehicle directly behind, including the generation effects of moving, turning, colliding and even floating in the air.

Layout controlled multi-view video generation

首个多视角自动驾驶场景视频生成世界模型 | DrivingDiffusion: BEV数据和仿真新思路

Top: Video generation results of DrivingDiffusion after training on nuScenes data. Bottom: Video generation results of DrivingDiffusion after training on a large amount of private real-world data.

(2) DrivingDiffusion-Future

Generate subsequent frames based on the text description of the input frame

首个多视角自动驾驶场景视频生成世界模型 | DrivingDiffusion: BEV数据和仿真新思路

Use a single frame image as input to construct subsequent frame driving scenes based on the text description of the main car/other cars. The first three rows and the fourth row in the figure show the generation effect after text description control of the behavior of the main vehicle and other vehicles respectively. (The green box is the input, the blue box is the output)

Generate subsequent frames directly based on the input frame

首个多视角自动驾驶场景视频生成世界模型 | DrivingDiffusion: BEV数据和仿真新思路

No need for other controls, only Use a single frame of image as input to predict subsequent frames of driving scenes. (The green box is the input, the blue box is the output)

How does DrivingDiffusion solve the above problems?

DrivingDiffusion first artificially constructs all 3D true values (obstacles/road structures) in the scene. After projecting the true values into Layout images, this is used as model input to obtain the multi-camera perspective. Real images/videos. The reason why 3D true values (BEV views or encoded instances) are not used directly as model input, but parameters are used for post-projection input, is to eliminate systematic 3D-2D consistency errors. (In such a set of data, 3D true values and vehicle parameters are artificially constructed according to actual needs. The former brings the ability to construct rare scene data at will. , the latter eliminates the error of geometric consistency in traditional data production.)

There is still one question left at this time: whether the quality of the generated image/video can meet the usage requirements ?

When it comes to constructing scenarios, everyone often thinks of using a simulation engine. However, there is a large domain gap between the data it generates and the real data. The generated results of GAN-based methods often have a certain bias from the distribution of actual real data. Diffusion Models are based on the characteristics of Markov chains that generate data by learning noise. The fidelity of the generated results is higher and is more suitable for use as a substitute for real data.

DrivingDiffusion directly generates

sequential multi-view views according to artificially constructed scenes and vehicle parameters, which can not only be used as a reference for downstream autonomous driving tasks Training data can also be used to build a simulation system for feedback on autonomous driving algorithms.

The "artificially constructed scene" here only contains obstacles and road structure information, but DrivingDiffusion's framework can easily introduce layout information such as signboards, traffic lights, construction areas, and even low-level occupation grid/depth map and other control modes.

Overview of DrivingDiffusion method

There are several difficulties when generating multi-view videos:

perspective and timing. How to design a framework that can generate long videos? How to maintain cross-view consistency and cross-frame consistency?

DrivingDiffusion Mainly designed a general training framework, using the stable-diffusion-v1-4 model as a pre-training model for images, and using 3D pseudo-convolution to expand the original image input to process the input of new dimensions in perspective/time series 3D-Unet, after obtaining the diffusion model to handle the new dimensions, carried out alternate iterative video expansion, ensuring short timing and Overall consistency over long periods of time. In addition, DrivingDiffusion proposed Consistency Module and Local Prompt, which respectively solve the problems of cross-view/cross-frame consistency and instance quality.

DrivingDiffusion generates long video process

Single frame multi-view model: generate multi-view key frames,

首个多视角自动驾驶场景视频生成世界模型 | DrivingDiffusion: BEV数据和仿真新思路 The single-view timing model with frames as additional control and multi-view sharing: perform timing expansion on each view in parallel,

Determine new keyframes and extend the video through a sliding window.
Training framework for cross-view models and temporal models

For multi-view models and sequential models, the extended dimensions of 3D-Unet are perspective and time respectively. Both have the same layout controller. The author believes that subsequent frames can obtain information in the scene from multi-view key frames and implicitly learn the associated information of different targets. Both use different consistency attention modules and the same Local Prompt module respectively.
Layout encoding: Obstacle category/instance information and road structure segmentation layout are encoded into RGB images with different fixed encoding values, and the layout token is output after encoding.
Key frame control: All timing expansion processes use the multi-view image of a certain key frame. This is based on the assumption that subsequent frames in a short timing sequence can obtain information from the key frame. All fine-tuning processes use the key frame and the multi-view image of a subsequent frame generated by it as additional controls, and output the multi-view image after optimizing the cross-view consistency of the frame.
Optical flow prior based on a specific perspective: For the temporal model, only data from a certain perspective is sampled during training. In addition, the optical flow prior value of each pixel position under the perspective image that is calculated in advance is used, and is encoded as a camera ID token to perform interactive control of the hidden layer similar to time embedding in the diffusion process.

Consistency Module & Local Prompt

首个多视角自动驾驶场景视频生成世界模型 | DrivingDiffusion: BEV数据和仿真新思路

##Consistency Module is divided into two parts: Consistency attention mechanism and Consistency correlation loss.

The consistency attention mechanism focuses on the interaction between adjacent views and timing-related frames. Specifically, for cross-frame consistency, it only focuses on the information interaction between left and right adjacent views that overlap. For the timing model, each Frame only focuses on keyframes and the previous frame. This avoids the huge computational load caused by global interactions.

The consistent correlation loss adds geometric constraints by pixel-level correlation and regression of pose, whose gradient is provided by a pre-trained pose regressor. The regressor adds a pose regression head based on LoFTR and is trained using the true pose values on the real data of the corresponding data set. For multi-view models and time series models, this module supervises the camera relative pose and main vehicle motion pose respectively.

Local Prompt and Global Prompt cooperate to reuse the parameter semantics of CLIP and stable-diffusion-v1-4 to locally enhance specific category instance areas. As shown in the figure, based on the cross-attention mechanism of image tokens and global text description prompts, the author designs a local prompt for a certain category and uses the image token in the mask area of the category to query the local prompt. This process makes maximum use of the concept of text-guided image generation in the open domain in the original model parameters.

Overview of DrivingDiffusion-Future method

首个多视角自动驾驶场景视频生成世界模型 | DrivingDiffusion: BEV数据和仿真新思路

For future scene construction tasks, DrivingDiffusion-Future uses two methods: 1. One is to predict subsequent frame images (visual branch) directly from the first frame image, and use inter-frame optical flow as an auxiliary loss. This method is relatively simple, but the effect of generating subsequent frames based on text descriptions is average. Another way is to add a new concept branch based on the former, which predicts the BEV view of subsequent frames through the first frame BEV view. This is because the prediction of the BEV view helps the model capture the core information of the driving scene and establish concepts. . At this time, the text description acts on both branches at the same time, and the characteristics of the concept branch are acted on the visual branch through the perspective conversion module of BEV2PV. Some parameters of the perspective conversion module are pre-trained by using true-value images to replace the noise input (and in Freeze during subsequent training). It is worth noting that the

main vehicle control text description controller and other vehicle control/environment text description controller are decoupled.

Experimental Analysis

In order to evaluate the performance of the model, DrivingDiffusion uses frame-level Fréchet Inception Distance (FID) to evaluate the quality of the generated images, and accordingly uses FVD to evaluate the generated images. Video quality. All metrics are calculated on the nuScenes validation set. As shown in Table 1, compared with the image generation task BEVGen and the video generation task DriveDreamer in autonomous driving scenarios, DrivingDiffusion has greater advantages in performance indicators under different settings.

首个多视角自动驾驶场景视频生成世界模型 | DrivingDiffusion: BEV数据和仿真新思路

Although methods such as FID are often used to measure the quality of image synthesis, they do not fully feedback the design goals of the task, nor do they reflect the synthesis quality of different semantic categories. Since the task is dedicated to generating multi-view images consistent with 3D layouts, DrivingDiffuison proposes to use the BEV perceptual model metric to measure performance in terms of consistency: using the official models of CVT and BEVFusion as evaluators, using the same real 3D model as the nuScenes validation set Generate images conditionally on the layout, perform CVT and BevFusion inference on each set of generated images, and then compare the predicted results with the real results, including the average intersection over U (mIoU) score of the drivable area and the NDS of all object classes. The statistics are shown in Table 2. Experimental results show that the perception indicators of the synthetic data evaluation set are very close to those of the real evaluation set, which reflects the high consistency of the generated results and 3D true values and the high fidelity of the image quality.

首个多视角自动驾驶场景视频生成世界模型 | DrivingDiffusion: BEV数据和仿真新思路

In addition to the above experiments, DrivingDiffusion has conducted experiments on adding synthetic data training to address the main problem it solves - improving the performance of autonomous driving downstream tasks. Table 3 demonstrates the performance improvements achieved by synthetic data augmentation in BEV perception tasks. In the original training data, there are problems with long-tail distributions, especially for small targets, close-range vehicles, and vehicle orientation angles. DrivingDiffusion focuses on generating additional data for these classes with limited samples to solve this problem. After adding 2000 frames of data focused on improving the distribution of obstacle orientation angles, NDS improved slightly, while mAOE dropped significantly from 0.5613 to 0.5295. After using 6000 frames of synthetic data that is more comprehensive and focused on rare scenes to assist training, a significant enhancement can be observed on the nuScenes validation set: NDS increased from 0.412 to 0.434, and mAOE decreased from 0.5613 to 0.5130. This demonstrates the significant improvement that data augmentation of synthetic data can bring to perception tasks. Users can make statistics on the distribution of each dimension in the data based on actual needs, and then supplement it with targeted synthetic data.

The significance and future work of DrivingDiffusion

DrivingDiffusion simultaneously realizes the ability to generate multi-view videos of autonomous driving scenes and predict the future, which is of great significance to autonomous driving tasks. Among them, layout and parameters are all artificially constructed and the conversion between 3D-2D is through projection rather than relying on learnable model parameters, which eliminates geometric errors in the previous process of obtaining data. , has strong practical value. At the same time, DrivingDiffuison is extremely scalable and supports new scene content layouts and additional controllers. It can also losslessly improve the generation quality through super-resolution and video frame insertion technology.

In autonomous driving simulation, there are more and more attempts at Nerf. However, in the task of street view generation, the separation of dynamic and static content, large-scale block reconstruction, decoupling appearance control of weather and other dimensions, etc., bring a huge amount of work. In addition, Nerf often needs to be carried out in a specific range of scenes. Only after training can it support new perspective synthesis tasks in subsequent simulations. DrivingDiffusion naturally contains a certain amount of general knowledge prior, including visual-text connections, conceptual understanding of visual content, etc. It can quickly create a scene according to needs just by constructing the layout. However, as mentioned above, the entire process is relatively complex, and the generation of long videos requires post-processing model fine-tuning and expansion. DrivingDiffusion will continue to explore the compression of perspective dimensions and time dimensions, as well as combine Nerf for new perspective generation and conversion, and continue to improve generation quality and scalability.

The above is the detailed content of The first multi-view autonomous driving scene video generation world model | DrivingDiffusion: New ideas for BEV data and simulation. For more information, please follow other related articles on the PHP Chinese website!