The high-definition video is not real. The 3D scenes rendered in several photos make it difficult for you to distinguish the authenticity.-AI-php.cn

The high-definition video is not real. The 3D scenes rendered in several photos make it difficult for you to distinguish the authenticity.

PHPz

Release： 2024-08-05 20:15:51

Original

664 people have browsed it

The high-definition video is not real. The 3D scenes rendered in several photos make it difficult for you to distinguish the authenticity.

Please note that the above animation is completely a 3D scene rendered from multiple photos. It is difficult for humans to detect their flaws.

Then let’s see how this scenario is realized.

Grids and points are the most common 3D scene representations, and because they are explicit, they are well suited for fast GPU/CUDA-based rasterization. In contrast, state-of-the-art Neural Radiation Field (NeRF) methods are built on continuous scene representation, often using volumetric ray rendering optimized multi-layer perceptrons (MLP) to synthesize new perspectives on the captured scene. While the continuity of these methods helps with optimization, the random sampling required for rendering is expensive and noisy.

Researchers from the University of the French Riviera introduced a new method that can combine the advantages of these two methods: 3D Gaussian representation has SOAT visual quality and is also optimized in training time, while tile-based The snowballing algorithm (tile-based splatting) achieves SOTA real-time rendering at 1080p resolution on several data sets.

Paper address: https://huggingface.co/papers/2308.04079

The research team has set a goal: to render scenes shot with multiple photos in real time and achieve the fastest time in typical real scenes optimization. Although the method previously proposed by Fridovich-Kei et al. achieved fast training, it was difficult to achieve the visual quality achieved by the current SOTA NeRF method, which required up to 48 hours of training time. There are also studies proposing fast but low-quality radiation field methods that can achieve interactive rendering according to the scene (10-15 frames per second), but this method cannot achieve real-time rendering at high resolutions.

Next, let’s see how this article is implemented. The research team’s solution mainly consists of three parts.

First, introduce 3D Gaussian as a flexible and expressive scene representation. The input is similar to the NeRF method, i.e. the camera is calibrated using structure-from-motion (SfM) and a 3D Gaussian ensemble is initialized using a sparse point cloud derived from the SfM process. Furthermore, this study was able to obtain high-quality results using only SfM points as input. It should be noted that for the NeRF synthetic dataset, our method can obtain high-quality results even with random initialization. Research shows that 3D Gaussian is a good choice.

Second, optimize the 3D Gaussian properties, namely 3D position, opacity?, anisotropic covariance and spherical harmonic (SH) coefficients. The optimization process produces a rather compact, unstructured and precise representation of the scene.

Third, real-time rendering solution, this research uses fast GPU sorting algorithm. However, due to the 3D Gaussian representation, it is possible to perform anisotropic splicing while respecting visibility ordering, thanks to sorting and blending—and by tracking the traversal of as many sorted splices as required, allowing for fast and accurate passed backward.

Overview of methods

In summary, this paper makes the following contributions:

Introduction of anisotropic 3D Gaussians as a high-quality, unstructured representation of radiation fields;

Optimization method for 3D Gaussian properties, intertwined with adaptive density control to create high-quality representations of captured scenes;

A fast differentiable rendering method for GPUs that is visibility aware Features that allow anisotropic stitching and fast backpropagation for high-quality new view synthesis.

Experiment

The following figure shows the comparison of the effects of this article's method and previous methods.

The scenes from top to bottom are bicycles, gardens, counters and rooms from the Mip-NeRF360 data set; game rooms from the deep hybrid data set (for more comparisons, please read the original article). Significant differences produced by different methods have been marked in the figure, such as the spokes of the bicycle, the glass of the house at the far end of the garden, the pole of the iron basket and the teddy bear.

It can be observed that the method in this article has more advantages in details than previous methods.

You can see a more obvious difference in the video

In addition, in Figure 6 we can see that even with 7K iterations (∼ 5 minutes), the method in this article has Captures train details very well. At 30K iterations (∼35 min), background artifacts are significantly reduced. For the garden scene, the difference is barely noticeable, and 7K iterations (∼8 minutes) is already very high quality.

The research team adopts the method suggested by Mip-NeRF360, dividing the data set into training/testing parts, and testing every 8 photos to make consistent and meaningful comparisons, thereby generating error indicators, and using The most commonly used standard PSNR, L-PIPS and SSIM indicators in the literature are shown in Table 1 for detailed data.

Table 1 presents a quantitative evaluation of the new method compared to previous work calculated across three datasets. The results marked with "†" are directly adopted from the original paper, and other results are the experimental results of the experimental team.

PSNR score of synthetic NeRF. It can be seen that the method in this article has better scores in most cases, and even reaches the optimal level.

Ablation Experiments

The research team isolated the different contributions and algorithm choices made and constructed a set of experiments to measure their effects. The following aspects of the algorithm were tested: initialization from SfM, densification strategy, anisotropic covariance, allowing an unlimited number of patches with gradients, and the use of spherical harmonics. The table below summarizes the quantitative effects of each option.

Let’s take a look at a more intuitive effect.

Using SfM points for initialization will produce better results.

Ablation densification strategies in both cases of Clone and Split

Limit the number of points that accept gradients, which has a significant impact on visual quality. Left: 10 Gaussian points limiting the received gradient. Right: The full methodology of this article.

For more details, please read the original article.

The above is the detailed content of The high-definition video is not real. The 3D scenes rendered in several photos make it difficult for you to distinguish the authenticity.. For more information, please follow other related articles on the PHP Chinese website!