Neural Radiation Fields (NeRF) have become a popular new view synthesis method. Although NeRF is rapidly generalizing to a wider range of applications and datasets, directly editing NeRF modeling scenarios remains a huge challenge. An important task is to remove unwanted objects from a 3D scene and maintain consistency with its surrounding scene, this task is called 3D image inpainting. In 3D, solutions must be consistent across multiple views and be geometrically valid.
In this article, researchers from Samsung, University of Toronto and other institutions propose a new 3D inpainting method to solve these challenges, given a small set of pose images and sparseness in a single input image Note, the proposed model framework first quickly obtains the three-dimensional segmentation mask of the target object and uses the mask, and then introduces a method based on perceptual optimization, which uses the learned two-dimensional images to repair and extract their information. to three-dimensional space while ensuring view consistency.
The research also brings a new benchmark for evaluating 3D in-scene inpainting methods by training on a challenging real-life scene dataset. In particular, this dataset contains views of the same scene with and without target objects, enabling more principled benchmarking of inpainting tasks in 3D space.
The following is The effect shows that after removing some objects, it can still maintain consistency with its surrounding scene:
Comparison between the method in this article and other methods, other methods There are obvious artifacts, but the method in this article is not very obvious:
The author uses an integrated method To cope with various challenges in 3D scene editing tasks, this method obtains multi-view images of the scene, uses user input to extract the 3D mask, and uses NeRF training to fit it into the mask image, so that the target object is reasonably Three-dimensional appearance and geometric shapes replaced. Existing interactive 2D segmentation methods do not consider the 3D aspect, and current NeRF-based methods cannot obtain good results using sparse annotations and do not achieve sufficient accuracy. While some current NeRF-based algorithms allow object removal, they do not attempt to provide newly generated parts of space. According to current research progress, this work is the first to simultaneously handle interactive multi-view segmentation and complete 3D image restoration in a single framework.
Researchers utilize off-the-shelf, 3D-free models for segmentation and image restoration, and transfer their output to 3D space in a view-consistent manner. Building on work on 2D interactive segmentation, the proposed model starts from a small number of user-calibrated image points with the mouse on a target object. From this, their algorithm initializes the mask with a video-based model and trains it into a coherent 3D segmentation by fitting the NeRF of a semantic mask. Then, the pre-trained 2D image restoration is applied to the multi-view image set. The NeRF fitting process is used to reconstruct the 3D image scene, using perceptual loss to constrain the inconsistency of the 2D image, and the geometry of the normalized mask of the depth image. area. Overall, we provide a complete approach, from object selection to new view synthesis of embedded scenes, in a unified framework with minimal burden on the user, as shown in the figure below.
In summary, the contributions of this work are as follows:
Specific to the method, this study first describes how to initialize a rough 3D mask from single-view annotations. Denote the annotated source code view as I_1. Feed sparse information about objects and source views to an interactive segmentation model, which is used to estimate the initial source object mask . The training views are then treated as a video sequence, along with given a video instance segmentation model V to compute , where is the initial guess of the object mask for I_i. The initial masks are often inaccurate near boundaries because the training views are not actually adjacent video frames, and video segmentation models are often 3D unknown.
The multi-view segmentation module obtains the input RGB image, the corresponding camera intrinsic and extrinsic parameters, and the initial mask to Train a semantic NeRF. The above diagram depicts the network used in semantic NeRF; for a point x and a view directory d, in addition to the density σ and color c, it returns a pre-sigmoid object logit, s (x). For its fast convergence, the researchers used instant-NGP as their NeRF architecture. The desired objectivity associated with a ray r is obtained by presenting in the equation the logarithm of the points on r rather than their color relative to the density:
The classification loss is then used for supervision:
is used for supervision based on The overall loss of NeRF's multi-view segmentation model is:
##Finally, two stages are used for optimization to further improve the mask. code; after obtaining the initial 3D mask, the mask is rendered from the training view and used to supervise the secondary multi-view segmentation model as the initial hypothesis (instead of the video segmentation output).
#The image above shows an overview of the view-consistent fix. As lack of data prevents direct training of 3D modified inpainting models, this study leverages existing 2D inpainting models to obtain depth and appearance priors and then supervises NeRF rendering fitting to the complete scene. This embedding NeRF is trained using the following loss:
#This study proposes a repair method with view consistency, and the input is RGB. First, the study transfers image and mask pairs to an image inpainter to obtain an RGB image. Since each view is repaired independently, the repaired views are directly used to supervise the NeRF reconstruction. In this paper, instead of using mean square error (MSE) as the loss to generate masks, the researchers propose to use perceptual loss LPIPS to optimize the masked part of the image, while still using MSE to optimize the unmasked part. This loss is calculated as follows:
Even with the perceptual loss, repairing the difference between views will lead incorrectly The model converges to low-quality geometry (e.g., "blurred" geometry measurements may form near the camera to account for different information from each view). Therefore, the researchers used the generated depth map as additional guidance for the NeRF model and separated the weights when calculating the perceptual loss, using the perceptual loss to fit only the color of the scene. To do this, we used a NeRF optimized for images containing unwanted objects and rendered depth maps corresponding to the training views. The calculation method is to use the distance to the camera instead of the color of the point:
Then The rendered depth is input to the inpainter model to obtain the inpainted depth map. Research has found that using LaMa for depth rendering, such as RGB, can yield sufficiently high-quality results. This NeRF can be the same model used for multi-view segmentation, if other sources are used to obtain the masks, such as human annotated masks, a new NeRF will be installed into the scene. These depth maps are then used to oversee the geometry of the inpainted NeRF, through which the rendered depth is then fed into the inpainter model to obtain the inpainted depth map. Research has found that using LaMa for depth rendering, such as RGB, can yield sufficiently high-quality results. This NeRF can be the same model used for multi-view segmentation, if other sources are used to obtain the masks, such as human annotated masks, a new NeRF will be installed into the scene. These depth maps are then used to supervise the geometry of the inpainted NeRF by its rendering depth to the inpainted depth by a distance of to the inpainted depth:
##Experimental results
Qualitatively speaking, the figure below compares the results of the researchers’ segmentation model with the output of NVOS and some video segmentation methods. Compare. Their model reduces noise and improves view consistency compared to the thick edges of 3D video segmentation models. Although NVOS uses scribbles instead of the sparse points used in the researchers' new model, the new model's MVSeg is visually superior to NVOS. Since the NVOS codebase is not available, the researchers replicated published qualitative results on NVOS (see the supplementary document for more examples).
The following table shows the comparison of the MV method with the baseline. Overall, the newly proposed method significantly outperforms other 2D and 3D repair methods. The table below further shows that removing guidance from geometric structures degrades the quality of the repaired scene.
Qualitative results are shown in Figure 6 and Figure 7. Figure 6 shows that our method can reconstruct view-consistent scenes with detailed textures, including coherent views of glossy and matte surfaces. Figure 7 shows that our perceptual method reduces the exact reconstruction constraints of the mask region, thereby preventing the appearance of blur when using all images, while also avoiding artifacts caused by single-view supervision.
The above is the detailed content of New research from NeRF is here: 3D scenes are tracelessly removed without objects, accurate to hair. For more information, please follow other related articles on the PHP Chinese website!